Tools, Datasets, and Evaluation

General AI Tools and Extensions
LLM for Robotics
Awesome Demo
Datasets for LLM Training
Evaluating Large Language Models
LLMOps: Large Language Model Operations

General AI Tools and Extensions

5 LLM-based Apps for Developers: Github Copilot, Cursor IDE, Tabnine, Warp, Replit Agent
AI Search engine:
- Phind: AI-Powered Search Engine for Developers [July 2022]
- Perplexity [Dec 2022]
- Perplexity comet: agentic browser [9 Jul 2025]
- GenSpark: AI agents engine perform research and generate custom pages called Sparkpages. [18 Jun 2024]
- felo.ai: Sparticle Inc. in Tokyo, Japan [04 Sep 2024]
- Goover
- oo.ai: Open Research. Fastest AI Search.
AI Tools: https://aitoolmall.com/
Ai2 Playground
Awesome AI Tools: Curated collection of 100+ AI tools. [Jun 2023]
Airtable list: Generative AI Index | AI Startups
AlphaXiv: an interactive extension of arXiv
AniDoc: Animation Creation Made Easier ✍️
Cherry Studio: a desktop client that supports multiple LLM providers.
Content writing: http://jasper.ai/chat / 🗣️
Duck.ai:💡Private, Useful, and Optional AI: DuckDuckGo offers free access to popular AI chatbots at Duck.ai
Edge and Chrome Extension & Plugin
- MaxAI.me
- BetterChatGPT
- ChatHub All-in-one chatbot client Webpage
- ChatGPT Retrieval Plugin
FLORA: an AI platform integrating text, image, and video models into a unified canvas.
Future Tools: https://www.futuretools.io/
God Tier Prompts: A community driven leaderboard where the best prompts rise to the top.
Open Source Image Creation Tool
- ComfyUI - https://github.com/comfyanonymous/ComfyUI
- Stable Diffusion web UI - https://github.com/AUTOMATIC1111/stable-diffusion-webui
INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations ref📑 [5 Dec 2024]
MGX (MetaGPT X): Multi-agent collaboration platform to develop an application.
Msty:💡The easiest way to use local and online AI models
napkin.ai: a text-to-visual graphics generator [7 Aug 2024]
Newsletters & Tool Databas: https://www.therundown.ai/
Open Source No-Code AI Tools
- Anything-LLM — https://anythingllm.com
- Budibase — https://budibase.com
- Coze Studio — https://www.coze.com
- Dify — https://dify.ai
- Flowise — https://flowiseai.com
- n8n — https://n8n.io
- NocoBase — https://www.nocobase.com
- NocoDB — https://nocodb.com
- Sim — https://www.sim.ai
- Strapi — https://strapi.io
- ToolJet — https://www.tooljet.ai
Oceans of AI - All AI Tools https://play.google.com/store/apps/details?id=in.blueplanetapps.oceansofai&hl=en_US
Open source (huggingface):🤗http://huggingface.co/chat
Pika AI - Free AI Video Generator
Product Hunt > AI
Quora Poe A chatbot service that gives access to GPT-4, gpt-3.5-turbo, Claude from Anthropic, and a variety of other bots. [Feb 2023]
recraft.ai: Text-to-editable vector image generator
Same.dev: Clone Any Website in Minutes
skywork.ai: Deep Research is a multimodal generalist agent that can create documents, slides, and spreadsheets.
Smartsub: AI-powered transcription, translation, and subtitle creation
Terence Tao + Claude Code📺: Video discussion of Claude Code in advanced research workflows. [Mar 2026]
TEXT-TO-CAD: Generate CAD from text prompts
The leader: http://openai.com
The runner-up: http://bard.google.com -> https://gemini.google.com
Toolerific.ai: https://toolerific.ai/: Find the best AI tools for your tasks
Vercel AI Vercel AI Playground / Vercel AI SDK git [May 2023]
websim.ai: a web editor and simulator that can generate websites. [1 Jul 2024]
allAIstartups: https://www.allaistartups.com/ai-tools

LLM for Robotics

PromptCraft-Robotics: Robotics and a robot simulator with ChatGPT integration git [Feb 2023]
ChatGPT-Robot-Manipulation-Prompts: A set of prompts for Communication between humans and robots for executing tasks. git [Apr 2023]
Siemens Industrial Copilot ✍️ [31 Oct 2023]
LeRobot🤗: Hugging Face. LeRobot aims to provide models, datasets, and tools for real-world robotics in PyTorch. git [Jan 2024]
Mobile ALOHA: Stanford’s mobile ALOHA robot learns from humans to cook, clean, do laundry. Mobile ALOHA extends the original ALOHA system by mounting it on a wheeled base ✍️ [4 Jan 2024] / ALOHA: A Low-cost Open-source Hardware System for Bimanual Teleoperation.
Figure 01 + OpenAI: Humanoid Robots Powered by OpenAI ChatGPT 📺 [Mar 2024]
Gemini Robotics: Robotics built on the foundation of Gemini 2.0 [12 Mar 2025]

Awesome demo

FRVR Official Teaser📺: Prompt to Game: AI-powered end-to-end game creation [16 Jun 2023]
rewind.ai: Rewind captures everything you’ve seen on your Mac and iPhone [Nov 2023]
Vercel announced V0.dev: Make a snake game with chat [Oct 2023]
Mobile ALOHA📺: A day of Mobile ALOHA [4 Jan 2024]
groq: An LPU Inference Engine, the LPU is reported to be 10 times faster than NVIDIA’s GPU performance ✍️ [Jan 2024]
Sora📺: Introducing Sora — OpenAI’s text-to-video model [Feb 2024]
Oasis✍️: Minecraft clone. Generated by AI in Real-Time. The first playable AI model that generates open-world games. ✍️ git [31 Oct 2024]

Datasets for LLM Training

LLM-generated datasets:
- Self-Instruct📑: Seed task pool with a set of human-written instructions. [20 Dec 2022]
- Self-Alignment with Instruction Backtranslation📑: Without human seeding, use LLM to produce instruction-response pairs. The process involves two steps: self-augmentation and self-curation. [11 Aug 2023]
LLMDataHub: Awesome Datasets for LLM Training: A quick guide (especially) for trending instruction finetuning datasets
Open LLMs and Datasets: A list of open LLMs available for commercial use.
SQuAD: The Stanford Question Answering Dataset (SQuAD), a set of Wikipedia articles, 100,000+ question-answer pairs on 500+ articles. [16 Jun 2016]
Synthetic Data Vault (SDV) : Synthetic data generation for tabular data [May 2018]
RedPajama: LLaMA training dataset of over 1.2 trillion tokens git [17 Apr 2023]
FineWeb🤗:🤗HuggingFace. crawled 15 trillion tokens of high-quality web data from the summer of 2013 to March 2024. [Apr 2024]
MS MARCO Web Search: A large-scale information-rich web dataset, featuring millions of real clicked query-document labels [Apr 2024]
Nemotron-Personas-Japan: Synthesized Data for Sovereign AI🤗: The first open synthetic dataset that captures Japan's demographic, geographic, and cultural spectrum. [23 Sep 2025]
Synthetic Data of LLMs: A reading list on LLM based Synthetic Data Generation [Oct 2024]
Open Thoughts: Fully Open Data Curation for Thinking Models [28 Jan 2025]

Pretrain for a base model

{
    "text": ...,
    "meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...},
    "red_pajama_subset": "common_crawl" | "c4" | "github" | "books" | "arxiv" | "wikipedia" | "stackexchange"
}

databricks-dolly-15k: Instruction-Tuned 🤗: SFT training - QA pairs or Dialog

{
  "prompt": "What is the capital of France?",
  "response": "The capital of France is Paris."
},
{
    "prompt": "Can you give me a recipe for chocolate chip cookies?",
    "response": "Sure! ..."
}

Anthropic human-feedback🤗: RLHF training - Chosen and Rejected pairs

{
  "chosen": "I'm sorry to hear that. Is there anything I can do to help?",
  "rejected": "That's too bad. You should just get over it."
}

Dataset example

🗣️

SFT Dataset

Category	Instruction	Context	Response
0	Open QA	How do I get rid of mosquitos in my house?	You can get rid of mosquitos in your house by ...
1	Classification	Classify each country as "African" or "European"	Nigeria: African Rwanda: African Portugal: European
2	Information Extraction	Extract the unique names of composers from the text.	To some extent, European and the US traditions... Pierre Boulez, Luigi Nono, Karlheinz Stockhausen
3	General QA	Should investors time the market?	Timing the market is based on predictions of t...

RLHF Dataset

Instruction	Chosen Response	Rejected Response
What is Depreciation	Depreciation is the drop in value of an asset ...	What is Depreciation – 10 Important Facts to K...
What do you know about the city of Aberdeen in Scotland?	Aberdeen is a city located in the North East of Scotland. It is known for its granite architecture and its offshore oil industry.	As an AI language model, I don't have personal knowledge or experiences about Aberdeen.
Describe thunderstorm season in the United States and Canada.	Thunderstorm season in the United States and Canada typically occurs during the spring and summer months, when warm, moist air collides with cooler, drier air, creating the conditions for thunderstorms to form.	Describe thunderstorm season in the United States and Canada.

Evaluating Large Language Models

Artificial Analysis LLM Performance Leaderboard🤗: Performance benchmarks & pricing across API providers of LLMs
Awesome LLMs Evaluation Papers: Evaluating Large Language Models: A Comprehensive Survey git [Oct 2023]
Can Large Language Models Be an Alternative to Human Evaluations?📑 [3 May 2023]
ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?📑: Open-Source LLMs vs. ChatGPT; Benchmarks and Performance of LLMs [28 Nov 2023]
Docker cagent: Deterministic, replayable flows: the vcr (vcr.py) pattern records/replays http calls, making LLM and agent tests fast, reliable, and CI-friendly. [Sep 2025]
Evaluation of Large Language Models: A Survey on Evaluation of Large Language Models📑: [6 Jul 2023]
Evaluation Papers for ChatGPT [28 Feb 2023]
Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge):💡Key considerations and Use cases when using LLM-evaluators [Aug 2024]
LightEval:🤗 a lightweight LLM evaluation suite that Hugging Face has been using internally [Jan 2024]
LLM Model Evals vs LLM Task Evals : Model Evals are really for people who are building or fine-tuning an LLM. vs The best LLM application builders are using Task evals. It's a tool to help builders build. [Feb 2024]
LLMPerf Leaderboard: Evaulation the performance of LLM APIs. [Dec 2023]
LLM-as-a-Judge:💡LLM-as-a-Judge offers a quick, cost-effective way to develop models aligned with human preferences and is easy to implement with just a prompt, but should be complemented by human evaluation to address biases. [Jul 2024]
OCR Arena: a free playground for testing and evaluating leading foundation VLMs and open source OCR models side-by-side. [Nov 2025]
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models📑: We utilize the FEEDBACK COLLECTION, a novel dataset, to train PROMETHEUS, an open-source large language model with 13 billion parameters, designed specifically for evaluation tasks. [12 Oct 2023]
The Leaderboard Illusion📑:💡Chatbot Arena's benchmarking is skewed by selective disclosures, private testing advantages, and data access asymmetries, leading to overfitting and unfair model rankings. [29 Apr 2025]

LLM Evalution Benchmarks

Language Understanding and QA

BIG-bench📑: Consists of 204 evaluations, contributed by over 450 authors, that span a range of topics from science to social reasoning. The bottom-up approach; anyone can submit an evaluation task. git [9 Jun 2022]
BigBench: 204 tasks. Predicting future potential [Published in 2023]
GLUE & SuperGLUE: GLUE (General Language Understanding Evaluation)
HELM📑: Evaluation scenarios like reasoning and disinformation using standardized metrics like accuracy, calibration, robustness, and fairness. The top-down approach; experts curate and decide what tasks to evaluate models on. git [16 Nov 2022]
MMLU (Massive Multitask Language Understanding): Over 15,000 questions across 57 diverse tasks. [Published in 2021]
MMLU (Massive Multi-task Language Understanding)📑: LLM performance across 57 tasks including elementary mathematics, US history, computer science, law, and more. [7 Sep 2020]
TruthfulQA🤗: Truthfulness. [Published in 2022]

Coding

CodeXGLUE: Programming tasks.
HumanEval: Challenges coding skills. [Published in 2021]
MBPP: Mostly Basic Python Programming. [Published in 2021]
SWE-bench: Software Engineering Benchmark. Real-world software issues sourced from GitHub. (GPT-5.2: 55.6% Pro, 80% Verified; Gemini 3: 76.2%)
SWE-Lancer✍️: OpenAI. full engineering stack, from UI/UX to systems design, and include a range of task types, from $50 bug fixes to $32,000 feature implementations. [18 Feb 2025] (GPT-5.2: 74.6% IC Diamond)
Vibe Code Bench: Claude Sonnet 4.5 (Thinking)and GPT 5.1 are head and shoulders above the competition. GPT 5.1 stands out especially for its low cost and high performance.
LiveCodeBench Pro✍️: Algorithmic coding problems. (Gemini 3: Elo 2,439)

Chatbot Assistance

Chatbot Arena🤗: Human-ranked ELO ranking.
MT Bench: Multi-turn open-ended questions

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena📑 [9 Jun 2023]

Vision & Multimodal

CharXiv Reasoning✍️: Scientific chart reasoning. (GPT-5.2: 88.7% with Python, 82.1% without tools)
ScreenSpot-Pro✍️: UI screenshot understanding. (GPT-5.2: 86.3% with Python, 64.2% without tools; Gemini 3: high performance)
MMMU-Pro✍️: Multimodal reasoning. (GPT-5.2: 80.4% with Python, 79.5% without tools; Gemini 3: 81.0%)
Video-MMMU✍️: Video understanding. (GPT-5.2: 85.9%; Gemini 3: 87.6%)

Long Context

OpenAI MRCRv2📑: Multi-round co-reference resolution. (GPT-5.2: 77.0% at 128k-256k tokens; Gemini 3: 77.0% at 128k)
BrowseComp✍️: Long context web browsing (128k, 256k). (GPT-5.2: 92.0% at 128k, 89.8% at 256k; Gemini 3: reference available)

Tool Calling & Agentic

Tau2-bench📑: Multi-turn tool usage in customer support. (GPT-5.2: 98.7% Telecom, 82.0% Retail)
Vending-Bench 2✍️: Year-long business simulation. (Gemini 3: $5,478.16 mean net worth, 272% higher than GPT-5.1)
LiveCodeBench Pro✍️: Algorithmic coding problems. (Gemini 3: Elo 2,439)

Reasoning

ARC (AI2 Reasoning Challenge): Measures general fluid intelligence. (GPT-5.2: 86.2% ARC-AGI-1, 52.9% ARC-AGI-2; Gemini 3: 31.1% ARC-AGI-2, 45.1% with Deep Think)
DROP🤗: Evaluates discrete reasoning.
HellaSwag: Commonsense reasoning. [Published in 2019]
LogicQA: Evaluates logical reasoning skills.
GPQA Diamond✍️: PhD-level scientific knowledge. (GPT-5.2: 92.4%; Gemini 3: 91.9%, 93.8% with Deep Think)
Humanity's Last Exam✍️: Hardest reasoning benchmark. (GPT-5.2: 45.5% with search/Python; Gemini 3: 37.5%, 40%+ with Deep Think)

Translation

WMT🤗: Evaluates translation skills.

Math

GSM8K: Arithmetic Reasoning. [Published in 2021]
MATH: Tests ability to solve math problems. [Published in 2021]
AIME 2025✍️: Competition math benchmark. (GPT-5.2: 100%; Gemini 3: 100% with code, 95% without tools)
FrontierMath✍️: Expert-level mathematics. (GPT-5.2: 40.3% Tier 1-3, 14.6% Tier 4; Gemini 3: >20x improvement on MathArena Apex)
HMMT✍️: High school math tournament. (GPT-5.2: 99.4% Feb 2025)

Other Benchmarks

Alpha Arena: a benchmark designed to measure AI's investing abilities. [Oct 2025]
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering📑 [14 Nov 2024]
Korean SAT LLM Leaderboard: Benchmarking 10 years of Korean CSAT (College Scholastic Ability Test) exams [Oct 2024]
OpenAI BrowseComp✍️: A benchmark assessing AI agents’ ability to use web browsing tools to complete tasks requiring up-to-date information, reasoning, and navigation skills. Boost from tools + reasoning. Human trainer success ratio = 29.2% × 86.4% ≈ 25.2% [10 Apr 2025]
OpenAI GDPval✍️: OpenAI's benchmark evaluating AI performance on real-world tasks across 44 occupations [25 Sep 2025]
OpenAI MLE-bench📑: A benchmark for measuring the performance of AI agents on ML tasks using Kaggle. git [9 Oct 2024] > Agent Framework used in MLE-bench, GPT-4o (AIDE) achieves more medals on average than both MLAB and OpenHands (8.7% vs. 0.8% and 4.4% respectively)
OpenAI Paper Bench✍️: a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. git [2 Apr 2025]
OpenAI SimpleQA Benchmark✍️: SimpleQA, a factuality benchmark for short fact-seeking queries, narrows its scope to simplify factuality measurement. git [30 Oct 2024]
Social Sycophancy: A Broader Understanding of LLM Sycophancy📑: ELEPHANT; LLM Benchmark to assess LLM Sycophancy. Dataset (query): OEQ (Open-Ended Questions) and Reddit. LLMs (prompted as judges) to assess the presence of sycophancy in outputs with prompt [20 May 2025]

Evaluation Metrics

Evaluating LLMs and RAG Systems✍️ (Jan 2025)
Automated evaluation
- n-gram metrics: ROUGE, BLEU, METEOR → compare overlap with reference text.
- ROUGE: multiple variants (N, L, W, S, SU) based on n-gram, LCS, skip-bigrams.
- BLEU: 0–1 score for translation quality.
- METEOR: precision + recall + semantic similarity.
- Probabilistic metrics: Perplexity → lower is better predictive performance.
- Embedding metrics: Ada Similarity, BERTScore → semantic similarity using embeddings.
Human evaluation
- Measures relevance, fluency, coherence, groundedness.
- Automated with LLM-based evaluators.
Built-in methods
- Prompt flow evaluation methods: ✍️ [Aug 2023] / ✍️

LLMOps: Large Language Model Operations

Agent Trace✍️: Data spec for recording AI agent attribution, reasoning steps, and tool calls.
agenta: OSS LLMOps workflow: building (LLM playground, evaluation), deploying (prompt and configuration management), and monitoring (LLM observability and tracing). [Jun 2023]
Azure ML Prompt flow: A set of LLMOps tools designed to facilitate the creation of LLM-based AI applications [Sep 2023] > How to Evaluate & Upgrade Model Versions in the Azure OpenAI Service✍️ [14 Aug 2024]
Azure Machine Learning studio Model Data Collector: Collect production data, analyze key safety and quality evaluation metrics on a recurring basis, receive timely alerts about critical issues, and visualize the results. ✍️ [Apr 2024]
circuit‑tracer: Anthrophic. Tool for finding and visualizing circuits within large language models. a circuit is a minimal, causal computation pathway inside a transformer model that shows how internal features lead to a specific output. [May 2025]
DeepEval: LLM evaluation framework. similar to Pytest but specialized for unit testing LLM outputs. [Aug 2023]
DeepTeam: A LLM Red Teaming Framework. [Mar 2025]
Giskard: The testing framework for ML models, from tabular to LLMs [Mar 2022]
Langfuse: git LLMOps platform that helps teams to collaboratively monitor, evaluate and debug AI applications. [May 2023]
Language Model Evaluation Harness:💡Over 60 standard academic benchmarks for LLMs. A framework for few-shot evaluation. Hugginface uses this for Open LLM Leaderboard🤗 [Aug 2020]
LangWatch scenario:💡LangWatch Agentic testing for agentic codebases. Simulating agentic communication using autopilot [Apr 2025]
LLMOps Database: A curated knowledge base of real-world LLMOps implementations.
Maxim AI: git End-to-end simulation, evaluation, and observability plaform, helping teams ship their AI agents reliably and >5x faster. [Dec 2023]
Machine Learning Operations (MLOps) For Beginners✍️: DVC (Data Version Control), MLflow, Evidently AI (Monitor a model). Insurance Cross Sell Prediction git [29 Aug 2024]
Netdata: AI-powered real-time infrastructure monitoring platform [Jun 2013]
OpenAI Evals: A framework for evaluating large language models (LLMs) [Mar 2023]
Opik: an open-source platform for evaluating, testing and monitoring LLM applications. Built by Comet. [2 Sep 2024]
Pezzo: Open-source, developer-first LLMOps platform [May 2023]
phoenix: AI Observability & Evaluation [Nov 2022]
promptfoo: Test your prompts. Evaluate and compare LLM outputs, catch regressions, and improve prompt quality. [Apr 2023]
PromptTools: Open-source tools for prompt testing [Jun 2023]
Ragas: Evaluation framework for your Retrieval Augmented Generation (RAG) [May 2023]
traceloop openllmetry: Quality monitoring for your LLM applications. [Sep 2023]
TruLens: Instrumentation and evaluation tools for large language model (LLM) based applications. [Nov 2020]

Challenges in evaluating AI systems

30 requirements for an MLOps environment🗣️: Kirk Borne twitter [15 Jul 2023]
Challenges in evaluating AI systems✍️: The challenges and limitations of various methods for evaluating AI systems, such as multiple-choice tests, human evaluations, red teaming, model-generated evaluations, and third-party audits. 🗄️ [4 Oct 2023]
Economics of Hosting Open Source LLMs✍️: Comparison of cloud vendors such as AWS, Modal, BentoML, Replicate, Hugging Face Endpoints, and Beam, using metrics like processing time, cold start latency, and costs associated with CPU, memory, and GPU usage. git [13 Nov 2024]
Pretraining on the Test Set Is All You Need📑: On that note, in the satirical Pretraining on the Test Set Is All You Need paper, the author trains a small 1M parameter LLM that outperforms all other models, including the 1.3B phi-1.5 model. This is achieved by training the model on all downstream academic benchmarks. It appears to be a subtle criticism underlining how easily benchmarks can be "cheated" intentionally or unintentionally (due to data contamination). 🗣️ [13 Sep 2023]
Sakana AI claimed 100x faster AI training, but a bug caused a 3x slowdown: Sakana’s AI resulted in a 3x slowdown — not a speedup. [21 Feb 2025]
Your AI Product Needs Evals [29 Mar 2024] / How to Evaluate LLM Applications: The Complete Guide [7 Nov 2023]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tools, Datasets, and Evaluation

Contents

General AI Tools and Extensions

LLM for Robotics

Awesome demo

Datasets for LLM Training

SFT Dataset

RLHF Dataset

Evaluating Large Language Models

LLM Evalution Benchmarks

Language Understanding and QA

Coding

Chatbot Assistance

Vision & Multimodal

Long Context

Tool Calling & Agentic

Reasoning

Translation

Math

Other Benchmarks

Evaluation Metrics

LLMOps: Large Language Model Operations

Challenges in evaluating AI systems

FilesExpand file tree

tools_extra.md

Latest commit

History

tools_extra.md

File metadata and controls

Tools, Datasets, and Evaluation

Contents

General AI Tools and Extensions

LLM for Robotics

Awesome demo

Datasets for LLM Training

SFT Dataset

RLHF Dataset

Evaluating Large Language Models

LLM Evalution Benchmarks

Language Understanding and QA

Coding

Chatbot Assistance

Vision & Multimodal

Long Context

Tool Calling & Agentic

Reasoning

Translation

Math

Other Benchmarks

Evaluation Metrics

LLMOps: Large Language Model Operations

Challenges in evaluating AI systems