Amazon SageMaker HyperPod Now Supports Amazon EKS: Enhancing AI Model Training with Kubernetes
Today marks an important milestone for the AI and machine learning community as Amazon announces the integration of Amazon Elastic Kubernetes Service (EKS) with Amazon SageMaker HyperPod. This new capability brings together the robustness of Kubernetes with the specialized infrastructure of Amazon SageMaker HyperPod, aimed at optimizing foundation model (FM) development and training large-scale AI models.
What is Amazon SageMaker HyperPod?
Amazon SageMaker HyperPod is a purpose-built infrastructure designed specifically for distributed training of large machine learning models. It allows users to efficiently scale training workloads across thousands of AI accelerators, significantly reducing training time by up to 40%. The service is engineered to provide a resilient environment that automatically detects and repairs issues, allowing uninterrupted model training over extended periods.
Key Features of Amazon SageMaker HyperPod with EKS Integration
- Kubernetes-Based Management:
With the new EKS support, users can now manage their SageMaker HyperPod clusters through a Kubernetes interface. This integration enables seamless switching between Slurm and Amazon EKS, optimizing various workloads such as training, fine-tuning, experimentation, and inference. Kubernetes is a popular choice for machine learning workloads due to its scalability and rich ecosystem of open-source tools. - Enhanced Observability:
The CloudWatch Observability EKS add-on provides comprehensive monitoring capabilities. Users can gain insights into CPU usage, network performance, disk activity, and other low-level node metrics via a unified dashboard. This detailed observability extends to resource utilization across the entire cluster, making troubleshooting and optimization more efficient. - Resilient Training Environment:
SageMaker HyperPod automatically detects and repairs faulty instances, ensuring uninterrupted training. This feature is crucial for training models over weeks or months, allowing data scientists to focus on model development without worrying about infrastructure management. - Advanced Job Management:
Job management is streamlined with the optional HyperPod CLI, designed for Kubernetes environments. Users can also leverage their existing CLI tools for managing jobs. Integration with Amazon CloudWatch Container Insights provides advanced observability, offering deeper insights into cluster performance, health, and utilization.Technical Overview
Setting Up Amazon SageMaker HyperPod with EKS
To get started with Amazon EKS support in Amazon SageMaker HyperPod, follow these steps:
- Prepare the Scenario:
Review the prerequisites and create an Amazon EKS cluster using an AWS CloudFormation stack. This setup includes configuring VPC and storage resources. - Cluster Configuration:
Use the AWS Management Console or AWS Command Line Interface (CLI) to create and manage SageMaker HyperPod clusters. Specify the cluster configuration in a JSON file, including details such as the EKS cluster ARN, instance groups, and resilience configurations.json<br /> {<br /> "ClusterName": "example-hp-cluster",<br /> "Orchestrator": {<br /> "Eks": {<br /> "ClusterArn": "${EKS_CLUSTER_ARN}"<br /> }<br /> },<br /> "InstanceGroups": [<br /> {<br /> "InstanceGroupName": "worker-group-1",<br /> "InstanceType": "ml.p5.48xlarge",<br /> "InstanceCount": 32,<br /> "LifeCycleConfig": {<br /> "SourceS3Uri": "s3://${BUCKET_NAME}",<br /> "OnCreate": "on_create.sh"<br /> },<br /> "ExecutionRole": "${EXECUTION_ROLE}",<br /> "ThreadsPerCore": 1,<br /> "OnStartDeepHealthChecks": [<br /> "InstanceStress",<br /> "InstanceConnectivity"<br /> ],<br /> }<br /> ],<br /> "VpcConfig": {<br /> "SecurityGroupIds": [<br /> "$SECURITY_GROUP"<br /> ],<br /> "Subnets": [<br /> "$SUBNET_ID"<br /> ]<br /> },<br /> "ResilienceConfig": {<br /> "NodeRecovery": "Automatic"<br /> }<br /> }<br />
- Cluster Creation:
Run the following AWS CLI command to create the cluster:sh<br /> aws sagemaker create-cluster --cli-input-json file://eli-cluster-config.json<br />
Verify the cluster status in the SageMaker Console, ensuring it changes to "InService."
- Job Management:
Use kubectl commands to manage resources and jobs from your development environment. For advanced troubleshooting, use AWS Systems Manager (SSM) to log into individual nodes. - Running Jobs:
Follow the steps outlined in the Amazon SageMaker HyperPod EKS documentation to run jobs on the SageMaker HyperPod cluster orchestrated by EKS. Utilize the HyperPod CLI and kubectl commands for job submission and management.Benefits of Amazon SageMaker HyperPod with EKS
Resilient Environment
This integration provides a more resilient training environment with deep health checks, automated node recovery, and job auto-resume features. SageMaker HyperPod can continuously train models for extended periods without interruptions, reducing training time by up to 40%.
Enhanced GPU Observability
Amazon CloudWatch Container Insights offers detailed metrics and logs for containerized applications, enabling comprehensive monitoring of cluster performance and health. This feature is particularly beneficial for monitoring GPU usage, which is critical for AI model training.
Scientist-Friendly Tools
The integration includes a custom HyperPod CLI for job management, Kubeflow Training Operators for distributed training, and integration with SageMaker Managed MLflow for experiment tracking. These tools, along with SageMaker’s distributed training libraries, provide significant optimizations for training large models.
Flexible Resource Utilization
The integration enhances the developer experience and scalability for FM workloads. Data scientists can efficiently share compute capacity across training and inference tasks, using existing Amazon EKS clusters or creating new ones. The flexibility to bring your own tools for job submission, queuing, and monitoring is also a significant advantage.
Getting Started
To explore the new capabilities of Amazon SageMaker HyperPod with EKS, you can refer to the following resources:
- SageMaker HyperPod EKS Workshop
- aws-do-hyperpod project
- awsome-distributed-training project
This release is generally available in AWS Regions where Amazon SageMaker HyperPod is offered, excluding Europe (London). For detailed pricing information, visit the Amazon SageMaker Pricing page.
Conclusion
The integration of Amazon EKS with Amazon SageMaker HyperPod marks a significant advancement in the field of AI and machine learning. By combining the power of Kubernetes with the specialized infrastructure of SageMaker HyperPod, AWS has provided a robust solution for training and deploying large-scale AI models. This integration offers a resilient, scalable, and efficient environment for developing cutting-edge AI solutions.
For further insights and a detailed walkthrough, you can explore the comprehensive documentation provided by AWS. This collaborative effort, contributed by experts like Manoj Ravi, Adhesh Garg, Tomonori Shimomura, Alex Iankoulski, and Anoop Saha, ensures that users have all the information required to leverage this powerful new feature.
– Eli
LinkedIn
For more Information, Refer to this article.