Introduction to Cutting-Edge AI Tools: Promptfoo, Docker Model Runner, and Docker MCP Toolkit
In the ever-evolving landscape of artificial intelligence (AI), developers and engineers are continually seeking efficient tools to evaluate, manage, and deploy AI models. Among the noteworthy solutions is Promptfoo, an open-source command-line interface (CLI) and library designed to evaluate Language Learning Model (LLM) applications. Complementing this tool is the Docker Model Runner, which simplifies the management, deployment, and execution of AI models using Docker technology. Furthermore, the Docker MCP Toolkit functions as a local gateway, enabling users to set up, manage, and run containerized Model-Container-Protocol (MCP) servers, seamlessly connecting them to AI agents.
Together, these innovative tools provide developers with the capability to compare different models, assess MCP servers, and even conduct LLM red-teaming—all from the convenience of their development environment. Let’s delve deeper into how these tools work and explore practical examples of their application.
Prerequisites for Using Docker MCP Toolkit and Promptfoo
Before diving into the examples, it’s essential to ensure that the necessary tools are enabled and installed. Here’s a step-by-step guide to getting started:
- Enable Docker MCP Toolkit in Docker Desktop. This toolkit allows you to manage and run containerized MCP servers.
- Enable Docker Model Runner in Docker Desktop. This tool helps in managing and running AI models.
- Use the Docker Model Runner CLI to fetch specific models, such as:
- ai/gemma3:4B-Q4_K_M
- ai/smollm3:Q4_K_M
- ai/mxbai-embed-large:335M-F16
- Install Promptfoo to evaluate LLM applications.
Once these prerequisites are fulfilled, you’re ready to explore how these tools can be used effectively.
Utilizing Docker Model Runner and Promptfoo for Prompt Comparisons
One common challenge for developers is deciding whether their AI application’s prompt and context require purchasing tokens from an AI cloud provider or if an open-source model can deliver almost the same value at a lower cost. This decision needs regular reassessment due to evolving prompts, new model releases, or changes in token pricing. Fortunately, the Docker Model Runner provider in Promptfoo makes this process straightforward.
For instance, consider comparing the Gemma3 model running locally with the Claude Opus 4.1 model using a simple prompt about whales. Promptfoo offers a variety of assertions to evaluate and grade model outputs, ranging from deterministic evaluations like "contains" to model-assisted evaluations such as "llm-rubric." While the latter usually employs Open AI models, in this scenario, local models powered by Docker Model Runner are used. Specifically, the smollm3:Q4_K_M model is configured to judge outputs, while the mxbai-embed-large:335M-F16 model is used for embedding and checking semantic similarity.
By running these evaluations, developers can view results and determine model performance. In this case, both models produced similar scores, indicating that the locally running Gemma3 model suffices for the scenario. However, for more complex real-world applications, a broader set of assertions would be necessary.
Evaluating MCP Tools with Docker Toolkit and Promptfoo
With MCP servers becoming increasingly prevalent, selecting the right tools for specific use cases and evaluating their performance for quality and security is critical. The Docker MCP Catalog serves as a centralized registry for discovering, sharing, and executing MCP servers, allowing seamless integration with Docker Desktop and easy evaluation using Promptfoo.
Consider an example of direct MCP testing, which is useful for validating server performance regarding authentication, authorization, and input validation. By quickly enabling Fetch, GitHub, and Playwright MCP servers in Docker Desktop with the MCP Toolkit, developers can configure the GitHub MCP server with built-in OAuth for authentication.
Developers can further configure the MCP Toolkit as a Promptfoo provider, facilitating the connection of containerized MCP servers. By enabling the mcp/youtube-transcript MCP server with a simple Docker run command, developers can validate that MCP server tools are available, authenticated, and functional through specific tests.
Red-Teaming Your MCP
Beyond direct testing, evaluating the entire MCP stack for privacy, safety, and accuracy is crucial. This is where Promptfoo’s red-teaming capabilities come into play, allowing developers to conduct AI-assisted security assessments of agentic MCP applications.
For example, evaluating an agent that summarizes GitHub repositories involves configuring the provider with Claude Opus 4.1, connected to the Docker MCP Toolkit with the GitHub MCP server. The built-in OAuth integration in Docker Desktop ensures authentication.
Developers can define a prompt for the application agent, outlining guidelines for summarizing repositories and integrating tool outputs naturally. Additionally, a prompt for the red-team agent is defined, incorporating plugins and strategies for evaluating the MCP application.
By using the promptfoo redteam run command, developers can generate and execute a test plan, reviewing results to identify vulnerabilities such as Tool Discovery. Based on these insights, the application prompt can be updated to mitigate vulnerabilities and re-evaluated to ensure improved security.
Conclusion
In summary, Promptfoo, Docker Model Runner, and Docker MCP Toolkit empower teams to evaluate prompts with diverse models, conduct direct MCP tool tests, and perform AI-assisted red-team assessments of agentic MCP applications. For those interested in exploring these examples, the docker/docker-model-runner-and-mcp-with-promptfoo repository is available for cloning and experimentation.
Additional Resources
For readers who wish to delve deeper into these tools and their applications, the official documentation and community forums provide invaluable insights and support. Engaging with the community can also offer practical tips and shared experiences from fellow developers.
For more Information, Refer to this article.

































