Skip to content

Latest commit

 

History

History
353 lines (302 loc) · 37.7 KB

File metadata and controls

353 lines (302 loc) · 37.7 KB

Tools, Datasets, and Evaluation

Contents

General AI Tools and Extensions

LLM for Robotics

  • PromptCraft-Robotics: Robotics and a robot simulator with ChatGPT integration git [Feb 2023] github stars
  • ChatGPT-Robot-Manipulation-Prompts: A set of prompts for Communication between humans and robots for executing tasks. git [Apr 2023] github stars
  • Siemens Industrial Copilot ✍️ [31 Oct 2023]
  • LeRobot🤗: Hugging Face. LeRobot aims to provide models, datasets, and tools for real-world robotics in PyTorch. git [Jan 2024] github stars
  • Mobile ALOHA: Stanford’s mobile ALOHA robot learns from humans to cook, clean, do laundry. Mobile ALOHA extends the original ALOHA system by mounting it on a wheeled base ✍️ [4 Jan 2024] / ALOHA: A Low-cost Open-source Hardware System for Bimanual Teleoperation.
  • Figure 01 + OpenAI: Humanoid Robots Powered by OpenAI ChatGPT 📺 [Mar 2024]
  • Gemini Robotics: Robotics built on the foundation of Gemini 2.0 [12 Mar 2025]

Awesome demo

  • FRVR Official Teaser📺: Prompt to Game: AI-powered end-to-end game creation [16 Jun 2023]
  • rewind.ai: Rewind captures everything you’ve seen on your Mac and iPhone [Nov 2023]
  • Vercel announced V0.dev: Make a snake game with chat [Oct 2023]
  • Mobile ALOHA📺: A day of Mobile ALOHA [4 Jan 2024]
  • groq: An LPU Inference Engine, the LPU is reported to be 10 times faster than NVIDIA’s GPU performance ✍️ [Jan 2024]
  • Sora📺: Introducing Sora — OpenAI’s text-to-video model [Feb 2024]
  • Oasis✍️: Minecraft clone. Generated by AI in Real-Time. The first playable AI model that generates open-world games. ✍️ git [31 Oct 2024] github stars

Datasets for LLM Training

Pretrain for a base model

{
    "text": ...,
    "meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...},
    "red_pajama_subset": "common_crawl" | "c4" | "github" | "books" | "arxiv" | "wikipedia" | "stackexchange"
}

databricks-dolly-15k: Instruction-Tuned 🤗: SFT training - QA pairs or Dialog

{
  "prompt": "What is the capital of France?",
  "response": "The capital of France is Paris."
},
{
    "prompt": "Can you give me a recipe for chocolate chip cookies?",
    "response": "Sure! ..."
}

Anthropic human-feedback🤗: RLHF training - Chosen and Rejected pairs

{
  "chosen": "I'm sorry to hear that. Is there anything I can do to help?",
  "rejected": "That's too bad. You should just get over it."
}
  • Dataset example

    🗣️

    SFT Dataset

    Category Instruction Context Response
    0 Open QA How do I get rid of mosquitos in my house? You can get rid of mosquitos in your house by ...
    1 Classification Classify each country as "African" or "European" Nigeria: African
    Rwanda: African
    Portugal: European
    2 Information Extraction Extract the unique names of composers from the text. To some extent, European and the US traditions... Pierre Boulez, Luigi Nono, Karlheinz Stockhausen
    3 General QA Should investors time the market? Timing the market is based on predictions of t...

    RLHF Dataset

    Instruction Chosen Response Rejected Response
    What is Depreciation Depreciation is the drop in value of an asset ... What is Depreciation – 10 Important Facts to K...
    What do you know about the city of Aberdeen in Scotland? Aberdeen is a city located in the North East of Scotland. It is known for its granite architecture and its offshore oil industry. As an AI language model, I don't have personal knowledge or experiences about Aberdeen.
    Describe thunderstorm season in the United States and Canada. Thunderstorm season in the United States and Canada typically occurs during the spring and summer months, when warm, moist air collides with cooler, drier air, creating the conditions for thunderstorms to form. Describe thunderstorm season in the United States and Canada.

Evaluating Large Language Models

LLM Evalution Benchmarks

Language Understanding and QA

  1. BIG-bench📑: Consists of 204 evaluations, contributed by over 450 authors, that span a range of topics from science to social reasoning. The bottom-up approach; anyone can submit an evaluation task. git [9 Jun 2022] github stars
  2. BigBench: 204 tasks. Predicting future potential [Published in 2023] github stars
  3. GLUE & SuperGLUE: GLUE (General Language Understanding Evaluation)
  4. HELM📑: Evaluation scenarios like reasoning and disinformation using standardized metrics like accuracy, calibration, robustness, and fairness. The top-down approach; experts curate and decide what tasks to evaluate models on. git [16 Nov 2022] github stars
  5. MMLU (Massive Multitask Language Understanding): Over 15,000 questions across 57 diverse tasks. [Published in 2021] github stars
  6. MMLU (Massive Multi-task Language Understanding)📑: LLM performance across 57 tasks including elementary mathematics, US history, computer science, law, and more. [7 Sep 2020]
  7. TruthfulQA🤗: Truthfulness. [Published in 2022]

Coding

  1. CodeXGLUE: Programming tasks. github stars
  2. HumanEval: Challenges coding skills. [Published in 2021] github stars
  3. MBPP: Mostly Basic Python Programming. [Published in 2021]
  4. SWE-bench: Software Engineering Benchmark. Real-world software issues sourced from GitHub. (GPT-5.2: 55.6% Pro, 80% Verified; Gemini 3: 76.2%)
  5. SWE-Lancer✍️: OpenAI. full engineering stack, from UI/UX to systems design, and include a range of task types, from $50 bug fixes to $32,000 feature implementations. [18 Feb 2025] (GPT-5.2: 74.6% IC Diamond)
  6. Vibe Code Bench: Claude Sonnet 4.5 (Thinking)and GPT 5.1 are head and shoulders above the competition. GPT 5.1 stands out especially for its low cost and high performance.
  7. LiveCodeBench Pro✍️: Algorithmic coding problems. (Gemini 3: Elo 2,439)

Chatbot Assistance

  1. Chatbot Arena🤗: Human-ranked ELO ranking.
  2. MT Bench: Multi-turn open-ended questions

Vision & Multimodal

  1. CharXiv Reasoning✍️: Scientific chart reasoning. (GPT-5.2: 88.7% with Python, 82.1% without tools)
  2. ScreenSpot-Pro✍️: UI screenshot understanding. (GPT-5.2: 86.3% with Python, 64.2% without tools; Gemini 3: high performance)
  3. MMMU-Pro✍️: Multimodal reasoning. (GPT-5.2: 80.4% with Python, 79.5% without tools; Gemini 3: 81.0%)
  4. Video-MMMU✍️: Video understanding. (GPT-5.2: 85.9%; Gemini 3: 87.6%)

Long Context

  1. OpenAI MRCRv2📑: Multi-round co-reference resolution. (GPT-5.2: 77.0% at 128k-256k tokens; Gemini 3: 77.0% at 128k)
  2. BrowseComp✍️: Long context web browsing (128k, 256k). (GPT-5.2: 92.0% at 128k, 89.8% at 256k; Gemini 3: reference available)

Tool Calling & Agentic

  1. Tau2-bench📑: Multi-turn tool usage in customer support. (GPT-5.2: 98.7% Telecom, 82.0% Retail)
  2. Vending-Bench 2✍️: Year-long business simulation. (Gemini 3: $5,478.16 mean net worth, 272% higher than GPT-5.1)
  3. LiveCodeBench Pro✍️: Algorithmic coding problems. (Gemini 3: Elo 2,439)

Reasoning

  1. ARC (AI2 Reasoning Challenge): Measures general fluid intelligence. (GPT-5.2: 86.2% ARC-AGI-1, 52.9% ARC-AGI-2; Gemini 3: 31.1% ARC-AGI-2, 45.1% with Deep Think) github stars
  2. DROP🤗: Evaluates discrete reasoning.
  3. HellaSwag: Commonsense reasoning. [Published in 2019] github stars
  4. LogicQA: Evaluates logical reasoning skills. github stars
  5. GPQA Diamond✍️: PhD-level scientific knowledge. (GPT-5.2: 92.4%; Gemini 3: 91.9%, 93.8% with Deep Think)
  6. Humanity's Last Exam✍️: Hardest reasoning benchmark. (GPT-5.2: 45.5% with search/Python; Gemini 3: 37.5%, 40%+ with Deep Think)

Translation

  1. WMT🤗: Evaluates translation skills.

Math

  1. GSM8K: Arithmetic Reasoning. [Published in 2021] github stars
  2. MATH: Tests ability to solve math problems. [Published in 2021] github stars
  3. AIME 2025✍️: Competition math benchmark. (GPT-5.2: 100%; Gemini 3: 100% with code, 95% without tools)
  4. FrontierMath✍️: Expert-level mathematics. (GPT-5.2: 40.3% Tier 1-3, 14.6% Tier 4; Gemini 3: >20x improvement on MathArena Apex)
  5. HMMT✍️: High school math tournament. (GPT-5.2: 99.4% Feb 2025)

Other Benchmarks

Evaluation Metrics

  • Evaluating LLMs and RAG Systems✍️ (Jan 2025)
  • Automated evaluation
    • n-gram metrics: ROUGE, BLEU, METEOR → compare overlap with reference text.
    • ROUGE: multiple variants (N, L, W, S, SU) based on n-gram, LCS, skip-bigrams.
    • BLEU: 0–1 score for translation quality.
    • METEOR: precision + recall + semantic similarity.
    • Probabilistic metrics: Perplexity → lower is better predictive performance.
    • Embedding metrics: Ada Similarity, BERTScore → semantic similarity using embeddings.
  • Human evaluation
    • Measures relevance, fluency, coherence, groundedness.
    • Automated with LLM-based evaluators.
  • Built-in methods
    • Prompt flow evaluation methods: ✍️ [Aug 2023] / ✍️

LLMOps: Large Language Model Operations

  1. Agent Trace✍️: Data spec for recording AI agent attribution, reasoning steps, and tool calls.
  2. agenta: OSS LLMOps workflow: building (LLM playground, evaluation), deploying (prompt and configuration management), and monitoring (LLM observability and tracing). [Jun 2023] github stars
  3. Azure ML Prompt flow: A set of LLMOps tools designed to facilitate the creation of LLM-based AI applications [Sep 2023] > How to Evaluate & Upgrade Model Versions in the Azure OpenAI Service✍️ [14 Aug 2024]
  4. Azure Machine Learning studio Model Data Collector: Collect production data, analyze key safety and quality evaluation metrics on a recurring basis, receive timely alerts about critical issues, and visualize the results. ✍️ [Apr 2024]
  5. circuit‑tracer: Anthrophic. Tool for finding and visualizing circuits within large language models. a circuit is a minimal, causal computation pathway inside a transformer model that shows how internal features lead to a specific output. [May 2025] github stars
  6. DeepEval: LLM evaluation framework. similar to Pytest but specialized for unit testing LLM outputs. [Aug 2023] github stars
  7. DeepTeam: A LLM Red Teaming Framework. [Mar 2025] github stars
  8. Giskard: The testing framework for ML models, from tabular to LLMs [Mar 2022] github stars
  9. Langfuse: git LLMOps platform that helps teams to collaboratively monitor, evaluate and debug AI applications. [May 2023] github stars
  10. Language Model Evaluation Harness:💡Over 60 standard academic benchmarks for LLMs. A framework for few-shot evaluation. Hugginface uses this for Open LLM Leaderboard🤗 [Aug 2020] github stars
  11. LangWatch scenario:💡LangWatch Agentic testing for agentic codebases. Simulating agentic communication using autopilot [Apr 2025] github stars
  12. LLMOps Database: A curated knowledge base of real-world LLMOps implementations.
  13. Maxim AI: git End-to-end simulation, evaluation, and observability plaform, helping teams ship their AI agents reliably and >5x faster. [Dec 2023]
  14. Machine Learning Operations (MLOps) For Beginners✍️: DVC (Data Version Control), MLflow, Evidently AI (Monitor a model). Insurance Cross Sell Prediction git [29 Aug 2024] github stars
  15. Netdata: AI-powered real-time infrastructure monitoring platform [Jun 2013] github stars
  16. OpenAI Evals: A framework for evaluating large language models (LLMs) [Mar 2023] github stars
  17. Opik: an open-source platform for evaluating, testing and monitoring LLM applications. Built by Comet. [2 Sep 2024] github stars
  18. Pezzo: Open-source, developer-first LLMOps platform [May 2023] github stars
  19. phoenix: AI Observability & Evaluation [Nov 2022] github stars
  20. promptfoo: Test your prompts. Evaluate and compare LLM outputs, catch regressions, and improve prompt quality. [Apr 2023] github stars
  21. PromptTools: Open-source tools for prompt testing [Jun 2023] github stars
  22. Ragas: Evaluation framework for your Retrieval Augmented Generation (RAG) [May 2023] github stars
  23. traceloop openllmetry: Quality monitoring for your LLM applications. [Sep 2023] github stars
  24. TruLens: Instrumentation and evaluation tools for large language model (LLM) based applications. [Nov 2020] github stars

Challenges in evaluating AI systems

  1. 30 requirements for an MLOps environment🗣️: Kirk Borne twitter [15 Jul 2023]
  2. Challenges in evaluating AI systems✍️: The challenges and limitations of various methods for evaluating AI systems, such as multiple-choice tests, human evaluations, red teaming, model-generated evaluations, and third-party audits. 🗄️ [4 Oct 2023]
  3. Economics of Hosting Open Source LLMs✍️: Comparison of cloud vendors such as AWS, Modal, BentoML, Replicate, Hugging Face Endpoints, and Beam, using metrics like processing time, cold start latency, and costs associated with CPU, memory, and GPU usage. git [13 Nov 2024]
  4. Pretraining on the Test Set Is All You Need📑: On that note, in the satirical Pretraining on the Test Set Is All You Need paper, the author trains a small 1M parameter LLM that outperforms all other models, including the 1.3B phi-1.5 model. This is achieved by training the model on all downstream academic benchmarks. It appears to be a subtle criticism underlining how easily benchmarks can be "cheated" intentionally or unintentionally (due to data contamination). 🗣️ [13 Sep 2023]
  5. Sakana AI claimed 100x faster AI training, but a bug caused a 3x slowdown: Sakana’s AI resulted in a 3x slowdown — not a speedup. [21 Feb 2025]
  6. Your AI Product Needs Evals [29 Mar 2024] / How to Evaluate LLM Applications: The Complete Guide [7 Nov 2023]