The Principal AI Ops Engineer manages the deployment, monitoring, and maintenance of AI models and systems. This role involves ensuring the reliability, scalability, and performance of AI systems, collaborating with cross-functional teams to optimize AI operations, and troubleshooting issues as they arise.
Responsibilities and Duties
- Deploy, monitor, and maintain AI models andsystems to ensure optimal performance and reliability.
- Implement and manage CI / CD pipelines forthe continuous integration and delivery of AI models.
- Collaborate with data scientists, AIengineers, and other stakeholders to understand model requirements and ensuresuccessful deployment.
- Monitor the performance of AI models andsystems, identifying and resolving issues promptly.
- Develop and maintain automated monitoringand alerting systems to ensure the health and performance of AI systems.
- Optimize AI models and infrastructure forscalability and efficiency.
- Ensure compliance with data governance,security, and regulatory standards in AI operations.
- Document deployment procedures, monitoringprocesses, and maintenance activities.
- Stay updated with the latest advancementsin AI operations and infrastructure technologies.
- Provide technical support and guidance tojunior AI Ops engineers and other team members.
- Participate in project planning andcontribute to the development of project timelines and deliverables.
- Perform other duties relevant to the job asassigned by the Head of AI Ops & Infrastructure or senior management .
Requirements
Bachelor’s degree in Computer Science,Information Technology, or a related fieldRelevant certifications (e.g., AWSCertified DevOps Engineer, Google Cloud Professional DevOps Engineer) arepreferredMinimum of 8 years of experience in AIoperations, DevOps, or related fieldsExperience in managing the deployment andmaintenance of AI modelsStrong programming skills in languages suchas PythonProficiency in AI and machine learningframeworks (e.g., TensorFlow, PyTorch)Experience with CI / CD tools (e.g., Jenkins,GitLab CI)Excellent problem-solving andtroubleshooting skillsStrong communication and interpersonalskillsIn-depth knowledge of AI operations andinfrastructure managementFamiliarity with cloud platforms (e.g.,AWS, Azure, Google Cloud) and their AI servicesUnderstanding of data governance, security,and regulatory standardsAbility to manage multiple tasks andprioritize effectivelyStrong attention to detail and commitmentto delivering high-quality workAbility to work independently and as partof a teamProgramming languages (e.g., Python, Java,C++)AI and machine learning frameworks (e.g.,TensorFlow, PyTorch)Monitoring and logging tools (e.g.,Prometheus, ELK Stack)Collaboration and communication tools(e.g., Slack, Microsoft Teams)#J-18808-Ljbffr