Models

There are three things that matter when picking a model: how good is it at the task, how fast is it, and how much does it cost. Everything else is a footnote.

Frontier families

Family	Notes
Claude (Anthropic)	Strongest at long contexts, agentic loops, and code. Opus is the heavy, Sonnet is the workhorse, Haiku is the fast cheap one.
GPT (OpenAI)	Broad capability, deepest tool/function ecosystem. o-series for reasoning. Mini for sub-tasks.
Gemini (Google)	Multimodal native, longest context, cheapest at the high end. Flash for speed.
Grok (xAI)	Improving fast. Strong reasoning at the top tier.

Pick by task, not family. Code-writing agents tend to do well with Claude. Multimodal RAG over PDFs tends to favor Gemini. Search-heavy work tends to favor GPT.

Open models worth knowing

Llama (Meta). The workhorse open model.
Qwen (Alibaba). Top open model on most benchmarks lately, strong at code and math.
Mistral. European, smaller models, often good fits for fine-tuning.
DeepSeek. Strong reasoning models at low cost via API.

Hosted via Together, Fireworks, Groq, or Replicate. Self-hosted via vLLM or Ollama.

Specialty models

Embeddings. OpenAI text-embedding-3-large, Cohere embed-v3, Voyage. See RAG.
Rerankers. Cohere Rerank, Voyage Rerank. Cheap to add, large quality bump.
Whisper / Deepgram for speech-to-text.
ElevenLabs for TTS.
Replicate / Fal for image and video.

Selection process

Start with the best frontier model. Get the prompt right.
Once it works, try the smaller/cheaper model in the same family. Run evals. If it passes, keep it.
Only consider open models when cost, privacy, or latency forces it.

Reading

Artificial Analysis: independent benchmarks across providers
Simon Willison: Things we learned about LLMs