Project Moonshot is one of the world’s first Large Language Model (LLM) Evaluation Toolkits. It is an open-source tool to bring benchmarking and red teaming together to help LLM App developers and compliance teams to test and evaluate their LLMs and LLM applications. This open-source tool is hosted on GitHub.
Comprehensive benchmark datasets. One place to access more than 100 (and still growing) benchmark datasets, with pre-built evaluators.
Simple. Project Moonshot covers LLMs, and AI Verify Toolkit covers traditional AI.
In the LLM space, companies often ask “which foundation LLM model best suits our goals?” and “how do we ensure our application, building on the model we chose, is robust and safe?” Moonshot helps companies answer these questions through a comprehensive suite of benchmarking tests and scoring reports, so they can deploy responsible Generative AI systems with confidence.
We’re very excited about Project Moonshot, it is one of the first tangible representations globally of what it means to approach AI Safety, in a way that is actionable for companies and AI teams.
- JX Wee, Regional Director, Asia Pacific at DataRobot
We’re very excited about Project Moonshot, it is one of the first tangible representations globally of what it means to approach AI Safety, in a way that is actionable for companies and AI teams.
- JX Wee, Regional Director, Asia Pacific at DataRobot
Project Moonshot embodies the AI Verify Foundation’s commitment to involve the global community in making AI trustworthy and safe for humanity. The Foundation collaborates with industry, governments, and civil societies, to ensure that the unique culture, heritage, and values of our communities are represented and tested.












Your organisation’s background – Could you briefly share your organisation’s background (e.g. sector, goods/services offered, customers), AI solution(s) that has/have been developed/used/deployed in your organisation, and what it is used for (e.g. product recommendation, improving operation efficiency)?
Your AI Verify use case – Could you share the AI model and use case that was tested with AI Verify? Which version of AI Verify did you use?
Your experience with AI Verify – Could you share your journey in using AI Verify? For example, preparation work for the testing, any challenges faced, and how were they overcome? How did you find the testing process? Did it take long to complete the testing?
Your key learnings and insights – Could you share key learnings and insights from the testing process? For example, 2 to 3 key learnings from the testing process? Any actions you have taken after using AI Verify?