Exciting news has emerged for developers who are using Windows and are interested in working with AI models: the Docker Model Runner has now extended its support to include vLLM on Docker Desktop for Windows, utilizing WSL2 and NVIDIA GPUs. This development is a significant stride forward, as it allows developers on Windows platforms to benefit from vLLM’s robust high-throughput inference capabilities directly through Docker Desktop, effectively harnessing the power of NVIDIA GPUs for accelerated AI development on local machines.
Understanding Docker Model Runner
For those unfamiliar with it, Docker Model Runner is essentially a tool designed to simplify the process of running generative AI models. The primary goal of Docker Model Runner is to make the experience of running an AI model as straightforward as running a container. Here are some of its standout features:
- Simplified User Experience: The process has been streamlined to a single, intuitive command:
docker model run <model-name>. This simplification aims to make it accessible to a wider range of users, reducing the complexity traditionally associated with AI model deployment. - Comprehensive GPU Support: Initially starting with NVIDIA GPUs, the Docker Model Runner has now expanded to include Vulkan support. This expansion is noteworthy as it means the Model Runner can operate on virtually any modern GPU, including those from AMD and Intel, thus broadening accessibility for developers.
- vLLM Integration: With the integration of vLLM, developers can now perform high-throughput inference using an NVIDIA GPU, making it a powerful tool for those requiring efficient processing capabilities.
Delving Into vLLM
vLLM stands out as a high-throughput inference engine specifically designed for large language models. It excels in efficient memory management of the KV cache and is capable of handling multiple concurrent requests with remarkable performance. This makes vLLM an excellent choice for developers building AI applications that need to cater to numerous requests or require high-throughput inference. More information on vLLM can be found on its GitHub repository.
Prerequisites for Getting Started
Before diving in, developers need to ensure they meet a few prerequisites to enable GPU support:
- Docker Desktop for Windows, starting from version 4.54.
- WSL2 backend should be enabled in Docker Desktop.
- An NVIDIA GPU with updated drivers, meeting a compute capability of 8.0 or higher.
- Proper GPU support configuration within Docker Desktop.
Steps to Get Started
Step 1: Enable Docker Model Runner
Begin by enabling the Docker Model Runner in Docker Desktop. This can be accomplished via the Docker Desktop settings or through the command line using:
bash<br /> docker desktop enable model-runner --tcp 12434<br />Step 2: Install the vLLM Backend
To utilize vLLM, install the vLLM runner with CUDA support by executing:
bash<br /> docker model install-runner --backend vllm --gpu cuda<br />Step 3: Verify the Installation
Verify that both inference engines are operational by checking the status:
bash<br /> docker model install-runner --backend vllm --gpu cuda<br />Expected output should confirm the running status of both engines, indicating successful setup.
Step 4: Run a Model with vLLM
With the setup complete, you can now pull and run models optimized for vLLM. Models featuring the
-vllmsuffix on Docker Hub are tailored for vLLM:bash<br /> docker model run ai/smollm2-vllm "Tell me about Docker."<br />Troubleshooting Tips: Addressing GPU Memory Issues
If you encounter GPU memory errors, such as insufficient free memory on the device, you can adjust the GPU memory utilization to manage the memory footprint more effectively. This adjustment allows the model to run alongside other GPU workloads:
bash<br /> docker model configure --gpu-memory-utilization 0.7 ai/smollm2-vllm<br />The Significance of This Update
This enhancement is particularly beneficial for Windows developers due to several reasons:
- Production Parity: Developers can test their models with the same inference engine that will be used in production environments, ensuring consistency across development and deployment stages.
- Unified Workflow: By staying within the familiar Docker ecosystem, developers can maintain a seamless and efficient workflow.
- Local Development: This update allows developers to keep their data private and potentially reduce API costs during the development phase by working locally.
How to Engage with the Docker Model Runner Community
The Docker Model Runner thrives on community involvement, and there are multiple ways for developers to contribute:
- Star the Repository: By starring the Docker Model Runner repository, developers can show their support and help increase its visibility.
- Contribute Ideas: Those with ideas for new features or bug fixes can create issues to discuss them or fork the repository to make changes and submit pull requests.
- Spread the Word: Sharing information about the Docker Model Runner with friends, colleagues, and other interested parties can help grow its user base and community support.
This update marks an exciting new chapter for Docker Model Runner, and there is a strong sense of anticipation about the innovations that developers will create using this powerful tool. The community is encouraged to dive in, explore the possibilities, and contribute to this evolving project.
For more Information, Refer to this article.

































