HPC and MLOps Manager
- Donostia / San Sebastián
Administrador HPC y MLOps
We are looking for a motivated person, with proven experience in the management of IT systems and Cloud, to be part of a team dedicated to the design and implementation of a common computing system for the entire organization dedicated to the support of research projects. This is intended to cover in a more efficient and democratic way the computational needs of the organization's researchers, as well as being more efficient from a resource management and maintenance point of view.
The candidate is expected to have the knowledge and experience to manage, based on best practices, the configuration, maintenance, upgrade and monitoring of the computing system at two levels: (1) The HPC, Artificial Intelligence and Big Data hardware infrastructure, which includes local clusters of machines using the latest GPU technology and other hardware environments for training, testing and inference of Deep Learning models, distributed storage and CI/CD. (2) The software platform that provides applications and services that allow to develop the research work efficiently, in addition to simplifying the integration, maintenance and monitoring operations. Other objectives will be linked to the active participation in research projects as support and assistance to researchers in the implementation of emerging and innovative technologies.
Candidates should exhibit a proactive attitude towards problem solving, excellent information technology (IT) skills, teamwork and commitment to understanding the needs of colleagues. Candidates should also have the necessary skills in cloud technologies, devops and systems, especially distributed computing and storage systems, scalability and security. Knowledge in machine learning processes and general data processing would be an asset.
Tasks and responsibilities will be considered:
- Assess existing HW infrastructure (focused on GPU servers, file servers and networking), identify needs and participate in the system modernization design process.
- Derive future HW needs
- Assist in the implementation of internal HCP platform (collaborate with external consultants of the center)
- Maintain, upgrade and support internal HPC
- Implement best practices in CI/CD and MLOps
- Develop middleware for MLOps
- Provide support/consulting for the implementation of MLOps for third parties, in private or public clouds/clusters
Candidates must have:
Education: At least a master's degree in computer science or telecommunications.
Experience: We are looking for a versatile engineer with demonstrable experience in the following areas:
- Experience in Linux environments (user management, scripting, service management, process monitoring and tuning).
- Experience in network configuration (traffic monitoring in communication networks and security)
- Distributed Storage Systems (S3, BeeGFS, Lustre, Ceph, NAS configuration)
- HPC job scheduling system: Slurm
- Containerization technologies: Docker
- Microservices and orchestration technologies: Kubernetes
- CI/CD tools: GitLab
We value candidates:
- Experience in HPC architectures, GPU servers, data-driven architectures, distributed storage.
- Bare-metal virtualization solutions: Proxmox, MAAS, OpenStack
- Implementation of Big Data and DB systems: Kafka, PostgreSQL, Spark, MongoDB, Cassandra
- Configuration automation tools: Ansible, Puppet,...
- Knowledge of different cloud service providers and their service offerings (e.g. IaaS, PaaS): Amazon Web Services, Google Cloud Platform, Microsoft Azure.
- Code-defined infrastructures: e.g. AWS CloudFormation, Terraform
- MLOPs and AI workflow management tools: Airflow, Kubeflow, etc.
We offer:
- Joining a dynamic, innovative and leading Center in the field of Artificial Intelligence and Visual Computing & Interaction at international level with work centers in Donostia.
- Multidisciplinary work team in the Digital Positioning department.
- Creative freedom to conduct research aligned with the Center's management procedures.
- Personal development through training and educational opportunities.
- Career opportunities and professional progression.
- Reconciliation policies to achieve a balance between work and family life.
- Equal employment opportunities.