- General AI Tools and Extensions
- LLM for Robotics
- Awesome Demo
- Datasets for LLM Training
- Evaluating Large Language Models
- LLMOps: Large Language Model Operations
- 5 LLM-based Apps for Developers: Github Copilot, Cursor IDE, Tabnine, Warp, Replit Agent
- AI Search engine:
- Phind: AI-Powered Search Engine for Developers [July 2022]
- Perplexity [Dec 2022]
- Perplexity comet: agentic browser [9 Jul 2025]
- GenSpark: AI agents engine perform research and generate custom pages called Sparkpages. [18 Jun 2024]
- felo.ai: Sparticle Inc. in Tokyo, Japan [04 Sep 2024]
- Goover
- oo.ai: Open Research. Fastest AI Search.
- AI Tools: https://aitoolmall.com/
- Ai2 Playground
- Awesome AI Tools: Curated collection of 100+ AI tools. [Jun 2023]
- Airtable list: Generative AI Index | AI Startups
- AlphaXiv: an interactive extension of arXiv
- AniDoc: Animation Creation Made Easier ✍️
- Cherry Studio: a desktop client that supports multiple LLM providers.
- Content writing: http://jasper.ai/chat / 🗣️
- Duck.ai:💡Private, Useful, and Optional AI: DuckDuckGo offers free access to popular AI chatbots at Duck.ai
- Edge and Chrome Extension & Plugin
- MaxAI.me
- BetterChatGPT
- ChatHub All-in-one chatbot client Webpage
- ChatGPT Retrieval Plugin
- FLORA: an AI platform integrating text, image, and video models into a unified canvas.
- Future Tools: https://www.futuretools.io/
- God Tier Prompts: A community driven leaderboard where the best prompts rise to the top.
- Open Source Image Creation Tool
- ComfyUI - https://github.com/comfyanonymous/ComfyUI
- Stable Diffusion web UI - https://github.com/AUTOMATIC1111/stable-diffusion-webui
- INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations ref📑 [5 Dec 2024]
- MGX (MetaGPT X): Multi-agent collaboration platform to develop an application.
- Msty:💡The easiest way to use local and online AI models
- napkin.ai: a text-to-visual graphics generator [7 Aug 2024]
- Newsletters & Tool Databas: https://www.therundown.ai/
- Open Source No-Code AI Tools
- Anything-LLM — https://anythingllm.com
- Budibase — https://budibase.com
- Coze Studio — https://www.coze.com
- Dify — https://dify.ai
- Flowise — https://flowiseai.com
- n8n — https://n8n.io
- NocoBase — https://www.nocobase.com
- NocoDB — https://nocodb.com
- Sim — https://www.sim.ai
- Strapi — https://strapi.io
- ToolJet — https://www.tooljet.ai
- Oceans of AI - All AI Tools https://play.google.com/store/apps/details?id=in.blueplanetapps.oceansofai&hl=en_US
- Open source (huggingface):🤗http://huggingface.co/chat
- Pika AI - Free AI Video Generator
- Product Hunt > AI
- Quora Poe A chatbot service that gives access to GPT-4, gpt-3.5-turbo, Claude from Anthropic, and a variety of other bots. [Feb 2023]
- recraft.ai: Text-to-editable vector image generator
- Same.dev: Clone Any Website in Minutes
- skywork.ai: Deep Research is a multimodal generalist agent that can create documents, slides, and spreadsheets.
- Smartsub: AI-powered transcription, translation, and subtitle creation
- Terence Tao + Claude Code📺: Video discussion of Claude Code in advanced research workflows. [Mar 2026]
- TEXT-TO-CAD: Generate CAD from text prompts
- The leader: http://openai.com
- The runner-up: http://bard.google.com -> https://gemini.google.com
- Toolerific.ai: https://toolerific.ai/: Find the best AI tools for your tasks
- Vercel AI Vercel AI Playground / Vercel AI SDK git [May 2023]
- websim.ai: a web editor and simulator that can generate websites. [1 Jul 2024]
- allAIstartups: https://www.allaistartups.com/ai-tools
- PromptCraft-Robotics: Robotics and a robot simulator with ChatGPT integration git [Feb 2023]
- ChatGPT-Robot-Manipulation-Prompts: A set of prompts for Communication between humans and robots for executing tasks. git [Apr 2023]
- Siemens Industrial Copilot ✍️ [31 Oct 2023]
- LeRobot🤗: Hugging Face. LeRobot aims to provide models, datasets, and tools for real-world robotics in PyTorch. git [Jan 2024]
- Mobile ALOHA: Stanford’s mobile ALOHA robot learns from humans to cook, clean, do laundry. Mobile ALOHA extends the original ALOHA system by mounting it on a wheeled base ✍️ [4 Jan 2024] / ALOHA: A Low-cost Open-source Hardware System for Bimanual Teleoperation.
- Figure 01 + OpenAI: Humanoid Robots Powered by OpenAI ChatGPT 📺 [Mar 2024]
- Gemini Robotics: Robotics built on the foundation of Gemini 2.0 [12 Mar 2025]
- FRVR Official Teaser📺: Prompt to Game: AI-powered end-to-end game creation [16 Jun 2023]
- rewind.ai: Rewind captures everything you’ve seen on your Mac and iPhone [Nov 2023]
- Vercel announced V0.dev: Make a snake game with chat [Oct 2023]
- Mobile ALOHA📺: A day of Mobile ALOHA [4 Jan 2024]
- groq: An LPU Inference Engine, the LPU is reported to be 10 times faster than NVIDIA’s GPU performance ✍️ [Jan 2024]
- Sora📺: Introducing Sora — OpenAI’s text-to-video model [Feb 2024]
- Oasis✍️: Minecraft clone. Generated by AI in Real-Time. The first playable AI model that generates open-world games. ✍️ git [31 Oct 2024]
- LLM-generated datasets:
- Self-Instruct📑: Seed task pool with a set of human-written instructions. [20 Dec 2022]
- Self-Alignment with Instruction Backtranslation📑: Without human seeding, use LLM to produce instruction-response pairs. The process involves two steps: self-augmentation and self-curation. [11 Aug 2023]
- LLMDataHub: Awesome Datasets for LLM Training: A quick guide (especially) for trending instruction finetuning datasets
- Open LLMs and Datasets: A list of open LLMs available for commercial use.
- SQuAD: The Stanford Question Answering Dataset (SQuAD), a set of Wikipedia articles, 100,000+ question-answer pairs on 500+ articles. [16 Jun 2016]
- Synthetic Data Vault (SDV) : Synthetic data generation for tabular data [May 2018]
- RedPajama: LLaMA training dataset of over 1.2 trillion tokens git [17 Apr 2023]
- FineWeb🤗:🤗HuggingFace. crawled 15 trillion tokens of high-quality web data from the summer of 2013 to March 2024. [Apr 2024]
- MS MARCO Web Search: A large-scale information-rich web dataset, featuring millions of real clicked query-document labels [Apr 2024]
- Nemotron-Personas-Japan: Synthesized Data for Sovereign AI🤗: The first open synthetic dataset that captures Japan's demographic, geographic, and cultural spectrum. [23 Sep 2025]
- Synthetic Data of LLMs: A reading list on LLM based Synthetic Data Generation [Oct 2024]
- Open Thoughts: Fully Open Data Curation for Thinking Models [28 Jan 2025]
Pretrain for a base model
{
"text": ...,
"meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...},
"red_pajama_subset": "common_crawl" | "c4" | "github" | "books" | "arxiv" | "wikipedia" | "stackexchange"
}databricks-dolly-15k: Instruction-Tuned 🤗: SFT training - QA pairs or Dialog
{
"prompt": "What is the capital of France?",
"response": "The capital of France is Paris."
},
{
"prompt": "Can you give me a recipe for chocolate chip cookies?",
"response": "Sure! ..."
}Anthropic human-feedback🤗: RLHF training - Chosen and Rejected pairs
{
"chosen": "I'm sorry to hear that. Is there anything I can do to help?",
"rejected": "That's too bad. You should just get over it."
}-
Dataset example
Category Instruction Context Response 0 Open QA How do I get rid of mosquitos in my house? You can get rid of mosquitos in your house by ... 1 Classification Classify each country as "African" or "European" Nigeria: African
Rwanda: African
Portugal: European2 Information Extraction Extract the unique names of composers from the text. To some extent, European and the US traditions... Pierre Boulez, Luigi Nono, Karlheinz Stockhausen 3 General QA Should investors time the market? Timing the market is based on predictions of t... Instruction Chosen Response Rejected Response What is Depreciation Depreciation is the drop in value of an asset ... What is Depreciation – 10 Important Facts to K... What do you know about the city of Aberdeen in Scotland? Aberdeen is a city located in the North East of Scotland. It is known for its granite architecture and its offshore oil industry. As an AI language model, I don't have personal knowledge or experiences about Aberdeen. Describe thunderstorm season in the United States and Canada. Thunderstorm season in the United States and Canada typically occurs during the spring and summer months, when warm, moist air collides with cooler, drier air, creating the conditions for thunderstorms to form. Describe thunderstorm season in the United States and Canada.
- Artificial Analysis LLM Performance Leaderboard🤗: Performance benchmarks & pricing across API providers of LLMs
- Awesome LLMs Evaluation Papers: Evaluating Large Language Models: A Comprehensive Survey git [Oct 2023]
- Can Large Language Models Be an Alternative to Human Evaluations?📑 [3 May 2023]
- ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?📑: Open-Source LLMs vs. ChatGPT; Benchmarks and Performance of LLMs [28 Nov 2023]
- Docker cagent: Deterministic, replayable flows: the vcr (vcr.py) pattern records/replays http calls, making LLM and agent tests fast, reliable, and CI-friendly. [Sep 2025]
- Evaluation of Large Language Models: A Survey on Evaluation of Large Language Models📑: [6 Jul 2023]
- Evaluation Papers for ChatGPT [28 Feb 2023]
- Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge):💡Key considerations and Use cases when using LLM-evaluators [Aug 2024]
- LightEval:🤗 a lightweight LLM evaluation suite that Hugging Face has been using internally [Jan 2024]
- LLM Model Evals vs LLM Task Evals
:
Model Evalsare really for people who are building or fine-tuning an LLM. vs The best LLM application builders are usingTask evals. It's a tool to help builders build. [Feb 2024] - LLMPerf Leaderboard: Evaulation the performance of LLM APIs. [Dec 2023]
- LLM-as-a-Judge:💡LLM-as-a-Judge offers a quick, cost-effective way to develop models aligned with human preferences and is easy to implement with just a prompt, but should be complemented by human evaluation to address biases. [Jul 2024]
- OCR Arena: a free playground for testing and evaluating leading foundation VLMs and open source OCR models side-by-side. [Nov 2025]
- Prometheus: Inducing Fine-grained Evaluation Capability in Language Models📑: We utilize the FEEDBACK COLLECTION, a novel dataset, to train PROMETHEUS, an open-source large language model with 13 billion parameters, designed specifically for evaluation tasks. [12 Oct 2023]
- The Leaderboard Illusion📑:💡Chatbot Arena's benchmarking is skewed by selective disclosures, private testing advantages, and data access asymmetries, leading to overfitting and unfair model rankings. [29 Apr 2025]
- BIG-bench📑: Consists of 204 evaluations, contributed by over 450 authors, that span a range of topics from science to social reasoning. The bottom-up approach; anyone can submit an evaluation task. git [9 Jun 2022]
- BigBench: 204 tasks. Predicting future potential [Published in 2023]
- GLUE & SuperGLUE: GLUE (General Language Understanding Evaluation)
- HELM📑: Evaluation scenarios like reasoning and disinformation using standardized metrics like accuracy, calibration, robustness, and fairness. The top-down approach; experts curate and decide what tasks to evaluate models on. git [16 Nov 2022]
- MMLU (Massive Multitask Language Understanding): Over 15,000 questions across 57 diverse tasks. [Published in 2021]
- MMLU (Massive Multi-task Language Understanding)📑: LLM performance across 57 tasks including elementary mathematics, US history, computer science, law, and more. [7 Sep 2020]
- TruthfulQA🤗: Truthfulness. [Published in 2022]
- CodeXGLUE: Programming tasks.
- HumanEval: Challenges coding skills. [Published in 2021]
- MBPP: Mostly Basic Python Programming. [Published in 2021]
- SWE-bench: Software Engineering Benchmark. Real-world software issues sourced from GitHub. (GPT-5.2: 55.6% Pro, 80% Verified; Gemini 3: 76.2%)
- SWE-Lancer✍️: OpenAI. full engineering stack, from UI/UX to systems design, and include a range of task types, from $50 bug fixes to $32,000 feature implementations. [18 Feb 2025] (GPT-5.2: 74.6% IC Diamond)
- Vibe Code Bench: Claude Sonnet 4.5 (Thinking)and GPT 5.1 are head and shoulders above the competition. GPT 5.1 stands out especially for its low cost and high performance.
- LiveCodeBench Pro✍️: Algorithmic coding problems. (Gemini 3: Elo 2,439)
- Chatbot Arena🤗: Human-ranked ELO ranking.
- MT Bench: Multi-turn open-ended questions
- CharXiv Reasoning✍️: Scientific chart reasoning. (GPT-5.2: 88.7% with Python, 82.1% without tools)
- ScreenSpot-Pro✍️: UI screenshot understanding. (GPT-5.2: 86.3% with Python, 64.2% without tools; Gemini 3: high performance)
- MMMU-Pro✍️: Multimodal reasoning. (GPT-5.2: 80.4% with Python, 79.5% without tools; Gemini 3: 81.0%)
- Video-MMMU✍️: Video understanding. (GPT-5.2: 85.9%; Gemini 3: 87.6%)
- OpenAI MRCRv2📑: Multi-round co-reference resolution. (GPT-5.2: 77.0% at 128k-256k tokens; Gemini 3: 77.0% at 128k)
- BrowseComp✍️: Long context web browsing (128k, 256k). (GPT-5.2: 92.0% at 128k, 89.8% at 256k; Gemini 3: reference available)
- Tau2-bench📑: Multi-turn tool usage in customer support. (GPT-5.2: 98.7% Telecom, 82.0% Retail)
- Vending-Bench 2✍️: Year-long business simulation. (Gemini 3: $5,478.16 mean net worth, 272% higher than GPT-5.1)
- LiveCodeBench Pro✍️: Algorithmic coding problems. (Gemini 3: Elo 2,439)
- ARC (AI2 Reasoning Challenge): Measures general fluid intelligence. (GPT-5.2: 86.2% ARC-AGI-1, 52.9% ARC-AGI-2; Gemini 3: 31.1% ARC-AGI-2, 45.1% with Deep Think)
- DROP🤗: Evaluates discrete reasoning.
- HellaSwag: Commonsense reasoning. [Published in 2019]
- LogicQA: Evaluates logical reasoning skills.
- GPQA Diamond✍️: PhD-level scientific knowledge. (GPT-5.2: 92.4%; Gemini 3: 91.9%, 93.8% with Deep Think)
- Humanity's Last Exam✍️: Hardest reasoning benchmark. (GPT-5.2: 45.5% with search/Python; Gemini 3: 37.5%, 40%+ with Deep Think)
- WMT🤗: Evaluates translation skills.
- GSM8K: Arithmetic Reasoning. [Published in 2021]
- MATH: Tests ability to solve math problems. [Published in 2021]
- AIME 2025✍️: Competition math benchmark. (GPT-5.2: 100%; Gemini 3: 100% with code, 95% without tools)
- FrontierMath✍️: Expert-level mathematics. (GPT-5.2: 40.3% Tier 1-3, 14.6% Tier 4; Gemini 3: >20x improvement on MathArena Apex)
- HMMT✍️: High school math tournament. (GPT-5.2: 99.4% Feb 2025)
- Alpha Arena: a benchmark designed to measure AI's investing abilities. [Oct 2025]
- Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering📑 [14 Nov 2024]
- Korean SAT LLM Leaderboard: Benchmarking 10 years of Korean CSAT (College Scholastic Ability Test) exams [Oct 2024]
- OpenAI BrowseComp✍️: A benchmark assessing AI agents’ ability to use web browsing tools to complete tasks requiring up-to-date information, reasoning, and navigation skills. Boost from tools + reasoning. Human trainer success ratio = 29.2% × 86.4% ≈ 25.2% [10 Apr 2025]
- OpenAI GDPval✍️: OpenAI's benchmark evaluating AI performance on real-world tasks across 44 occupations [25 Sep 2025]
- OpenAI MLE-bench📑: A benchmark for measuring the performance of AI agents on ML tasks using Kaggle. git [9 Oct 2024] > Agent Framework used in MLE-bench,
GPT-4o (AIDE) achieves more medals on average than both MLAB and OpenHands (8.7% vs. 0.8% and 4.4% respectively) - OpenAI Paper Bench✍️: a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. git [2 Apr 2025]
- OpenAI SimpleQA Benchmark✍️: SimpleQA, a factuality benchmark for short fact-seeking queries, narrows its scope to simplify factuality measurement. git [30 Oct 2024]
- Social Sycophancy: A Broader Understanding of LLM Sycophancy📑: ELEPHANT; LLM Benchmark to assess LLM Sycophancy. Dataset (query): OEQ (Open-Ended Questions) and Reddit. LLMs (prompted as judges) to assess the presence of sycophancy in outputs with prompt [20 May 2025]
- Evaluating LLMs and RAG Systems✍️ (Jan 2025)
- Automated evaluation
- n-gram metrics: ROUGE, BLEU, METEOR → compare overlap with reference text.
- ROUGE: multiple variants (N, L, W, S, SU) based on n-gram, LCS, skip-bigrams.
- BLEU: 0–1 score for translation quality.
- METEOR: precision + recall + semantic similarity.
- Probabilistic metrics: Perplexity → lower is better predictive performance.
- Embedding metrics: Ada Similarity, BERTScore → semantic similarity using embeddings.
- Human evaluation
- Measures relevance, fluency, coherence, groundedness.
- Automated with LLM-based evaluators.
- Built-in methods
- Agent Trace✍️: Data spec for recording AI agent attribution, reasoning steps, and tool calls.
- agenta: OSS LLMOps workflow: building (LLM playground, evaluation), deploying (prompt and configuration management), and monitoring (LLM observability and tracing). [Jun 2023]
- Azure ML Prompt flow: A set of LLMOps tools designed to facilitate the creation of LLM-based AI applications [Sep 2023] > How to Evaluate & Upgrade Model Versions in the Azure OpenAI Service✍️ [14 Aug 2024]
- Azure Machine Learning studio Model Data Collector: Collect production data, analyze key safety and quality evaluation metrics on a recurring basis, receive timely alerts about critical issues, and visualize the results. ✍️ [Apr 2024]
- circuit‑tracer: Anthrophic. Tool for finding and visualizing circuits within large language models. a circuit is a minimal, causal computation pathway inside a transformer model that shows how internal features lead to a specific output. [May 2025]
- DeepEval: LLM evaluation framework. similar to Pytest but specialized for unit testing LLM outputs. [Aug 2023]
- DeepTeam: A LLM Red Teaming Framework. [Mar 2025]
- Giskard: The testing framework for ML models, from tabular to LLMs [Mar 2022]
- Langfuse: git LLMOps platform that helps teams to collaboratively monitor, evaluate and debug AI applications. [May 2023]
- Language Model Evaluation Harness:💡Over 60 standard academic benchmarks for LLMs. A framework for few-shot evaluation. Hugginface uses this for Open LLM Leaderboard🤗 [Aug 2020]
- LangWatch scenario:💡LangWatch Agentic testing for agentic codebases. Simulating agentic communication using autopilot [Apr 2025]
- LLMOps Database: A curated knowledge base of real-world LLMOps implementations.
- Maxim AI: git End-to-end simulation, evaluation, and observability plaform, helping teams ship their AI agents reliably and >5x faster. [Dec 2023]
- Machine Learning Operations (MLOps) For Beginners✍️: DVC (Data Version Control), MLflow, Evidently AI (Monitor a model). Insurance Cross Sell Prediction git [29 Aug 2024]
- Netdata: AI-powered real-time infrastructure monitoring platform [Jun 2013]
- OpenAI Evals: A framework for evaluating large language models (LLMs) [Mar 2023]
- Opik: an open-source platform for evaluating, testing and monitoring LLM applications. Built by Comet. [2 Sep 2024]
- Pezzo: Open-source, developer-first LLMOps platform [May 2023]
- phoenix: AI Observability & Evaluation [Nov 2022]
- promptfoo: Test your prompts. Evaluate and compare LLM outputs, catch regressions, and improve prompt quality. [Apr 2023]
- PromptTools: Open-source tools for prompt testing [Jun 2023]
- Ragas: Evaluation framework for your Retrieval Augmented Generation (RAG) [May 2023]
- traceloop openllmetry: Quality monitoring for your LLM applications. [Sep 2023]
- TruLens: Instrumentation and evaluation tools for large language model (LLM) based applications. [Nov 2020]
- 30 requirements for an MLOps environment🗣️: Kirk Borne twitter [15 Jul 2023]
- Challenges in evaluating AI systems✍️: The challenges and limitations of various methods for evaluating AI systems, such as multiple-choice tests, human evaluations, red teaming, model-generated evaluations, and third-party audits. 🗄️ [4 Oct 2023]
- Economics of Hosting Open Source LLMs✍️: Comparison of cloud vendors such as AWS, Modal, BentoML, Replicate, Hugging Face Endpoints, and Beam, using metrics like processing time, cold start latency, and costs associated with CPU, memory, and GPU usage. git [13 Nov 2024]
- Pretraining on the Test Set Is All You Need📑: On that note, in the satirical Pretraining on the Test Set Is All You Need paper, the author trains a small 1M parameter LLM that outperforms all other models, including the 1.3B phi-1.5 model. This is achieved by training the model on all downstream academic benchmarks. It appears to be a subtle criticism underlining how easily benchmarks can be "cheated" intentionally or unintentionally (due to data contamination). 🗣️ [13 Sep 2023]
- Sakana AI claimed 100x faster AI training, but a bug caused a 3x slowdown: Sakana’s AI resulted in a 3x slowdown — not a speedup. [21 Feb 2025]
- Your AI Product Needs Evals [29 Mar 2024] / How to Evaluate LLM Applications: The Complete Guide [7 Nov 2023]