Model selection should account for accuracy, latency, cost, and task risk rather than defaulting to the strongest model for every task.
OpenAI model selection plus Kerp fireside summary
Wiki page
A source-backed wiki draft on choosing between frontier, smaller, cheaper, and local AI models for everyday builder workflows, with criteria for routing, evaluation, and escalation.
Model selection for everyday builders is the practice of choosing which AI model or model-running environment to use for a task. The choice depends on the task's risk, required reasoning depth, accuracy needs, latency, cost, privacy requirements, local-control needs, and the available ways to evaluate or escalate the output.
The problem is practical rather than abstract. In the June 2026 fireside session with Adam Kerpelman, the group identified a common workflow issue: builders often reach for the strongest available model even when a smaller, cheaper, or local model may be sufficient. The missing piece is confidence. A builder needs a way to decide when a cheaper route is good enough, when the task should move to a stronger hosted model, and when human review is required before the output is used.
The rapid spread of general-purpose AI tools has made model choice part of ordinary work. A builder may use AI for drafting, summarizing, code review, form-filling, research preparation, grading support, project planning, or local experiments. These tasks do not all require the same model.
Official model-selection guidance from OpenAI frames the choice around accuracy, latency, and cost. Anthropic's prompt-engineering guidance adds an evaluation-oriented framing: define success criteria and test against them, and consider changing the model when latency or cost are not acceptable. Together, these sources support a simple principle: model choice is not only about capability. It is also about fit for purpose.
A useful model-selection decision starts with the task rather than the model list.
Task risk is the first filter. Low-risk tasks include drafts, summaries, internal notes, formatting, extraction from trusted text, and brainstorming. Higher-risk tasks include public claims, legal or financial decisions, security-sensitive operations, production code changes, user-facing support, and actions that can affect other people or systems. As risk rises, the need for stronger models, clearer tests, or human review also rises.
Reasoning depth is another filter. A simple rewrite or classification task may not require a frontier model. A task that requires multi-step reasoning, unfamiliar domain judgment, long-context synthesis, or tool coordination may justify a stronger model or a more explicit evaluation loop.
Latency and cost matter when work is repeated. A model that is acceptable for one-off analysis may be too slow or expensive for a background agent, batch job, or frequent personal workflow. Smaller models, cheaper hosted models, or local runtimes can be appropriate when the task is bounded and quality can be checked.
Privacy and control also affect the route. Local models and self-controlled runtimes can be attractive for experiments, offline work, or sensitive inputs, but local execution does not automatically solve every security or privacy concern. The actual risk depends on the full setup: the model, machine, storage, logs, network access, and surrounding tools.
Model routing can mean several different things. These should be kept separate.
Manual selection is the simplest pattern. A person chooses a model for the task based on risk, complexity, cost, and confidence. This is often enough for everyday builder workflows.
Automatic model selection routes prompts to a model based on factors such as prompt complexity, task type, model capabilities, and configured quality or cost preferences. OpenRouter documents this kind of auto-routing as a tool pattern.
Fallback routing changes route after a failure or unsuitable response. OpenRouter documents fallbacks for conditions such as rate limits, provider downtime, context-length issues, and moderation-related failures.
Deployment routing and load balancing distribute requests across model deployments or providers. LiteLLM Router documents strategies such as weighted routing, rate-limit-aware routing, latency-based routing, and cost-based routing.
Agent or source routing dispatches work to specialized agents, tools, or information sources. LangChain describes a router pattern for classifying or decomposing inputs and sending them to an appropriate destination.
For an everyday builder, these patterns can be used without adopting a large orchestration system. A simple workflow can start with manual selection, add evaluation checks, and only later add automatic routing or fallbacks when repetition justifies the complexity.
Local and smaller models are not merely weaker versions of frontier models. They fit different constraints.
A smaller hosted model can be useful for repetitive, bounded, or low-risk work: formatting, extracting fields, generating first-pass drafts, classifying short inputs, summarizing known material, or running cheap pre-checks before escalation.
A local model can be useful when a builder wants local control, offline operation, rapid experimentation, or a workflow that avoids sending every input to a hosted provider. Ollama and llama.cpp are examples of tooling that support local model workflows and local APIs or servers.
Local and small-model workflows still need evaluation. A local model may be fast and private enough for a draft but not reliable enough for a final answer. A smaller model may handle extraction but fail at planning. The route should match the task and include a way to detect failure.
Evaluation is the confidence layer in model selection. Without evaluation, cheaper or local models are hard to trust. With evaluation, a builder can use less expensive or more controlled routes for the work that fits them.
A simple evaluation loop has four parts:
1. Define the success criteria before running the task.
2. Run the task with the selected model.
3. Check the output against the criteria.
4. Accept, retry, revise the prompt, switch models, or escalate to human review.
The success criteria should be concrete. For a summary, they might include factual coverage, no invented claims, and a specific length. For code review, they might include finding regressions, citing file locations, and distinguishing certainty from suspicion. For public writing, they might include source support, tone, and removal of private operational detail.
Escalation is not a failure. It is part of the route. A small or local model can handle a first pass, while a stronger model or human reviewer handles ambiguous, high-risk, or public-facing decisions.
OpenRouter illustrates automatic model selection and fallback routing. It is relevant to this topic as a router example, but the source session only supports OpenRouter as a chat suggestion, not as confirmed usage by Kerp.
LiteLLM Router illustrates routing across deployments and providers using strategies such as rate-limit-aware, latency-based, and cost-based routing.
LangChain's router pattern illustrates dispatching work to specialized agents or sources after classifying or decomposing the input.
Ollama and llama.cpp illustrate local model runtime patterns. They are useful references for local and small-model workflows, but the final choice of a local runtime depends on hardware, model compatibility, privacy needs, and operational comfort.
Several parts of model selection remain unsettled for everyday builders.
The first is taxonomy. A decision matrix can help, but it can also become stale if it names live models, prices, or context windows too directly. Durable guidance should focus on task properties and escalation rules.
The second is privacy. Local execution can support privacy and control goals, but it should not be treated as automatically safe. The surrounding system matters.
The third is evaluation. Builders need lightweight tests that are practical enough for daily work. Overly formal evals may be ignored; no evals leave the builder guessing.
The fourth is when to automate routing. Manual selection may be better for varied, judgment-heavy work. Automated routing becomes more useful when tasks repeat, criteria are clear, and failures can be detected.
Model Orchestration is the broader infrastructure and workflow pattern for coordinating models, fallbacks, routes, and task dispatch.
Local And Small-Model Workflows covers the local runtime and lower-resource branch of the topic in more detail.
AI Workflow Evaluation Loops covers the confidence layer: how outputs are tested, accepted, retried, or escalated.
Context Management And Memory Systems is adjacent but separate. Context and memory affect model performance, but they are not the same topic as model selection.
- OpenAI: Model selection
- OpenAI: Evaluation best practices
- Anthropic: Prompt engineering overview
- OpenRouter: Auto Router
- OpenRouter: Model Fallbacks
- LiteLLM: Router - Load Balancing
- LangChain: Router pattern
- Ollama: API introduction
- llama.cpp: local build documentation
Model selection should account for accuracy, latency, cost, and task risk rather than defaulting to the strongest model for every task.
OpenAI model selection plus Kerp fireside summary
Evaluation loops help builders decide whether a cheaper, smaller, or local model is sufficient and when to escalate.
OpenAI evaluation best practices and Anthropic prompt-engineering guidance
Model routing includes several distinct patterns: manual selection, automatic model choice, fallback routing, deployment load balancing, and agent/source routing.
OpenRouter, LiteLLM, and LangChain docs
OpenRouter should be treated as a router-tool example and chat suggestion, not confirmed Kerp usage.
Session transcript plus OpenRouter docs
Local and small-model workflows are related but distinct from general model routing because they involve runtime, hardware, privacy, and local-control constraints.
Ollama, llama.cpp, and session transcript
Task routing prompt
Classify this task by risk, required reasoning depth, privacy sensitivity, latency need, and output confidence. Recommend whether to use a frontier, cheaper/smaller, or local model, and say what evidence would trigger escalation.
Evaluation loop prompt
Given this model output and task success criteria, identify what passed, what remains uncertain, and whether to accept, retry with the same model, or escalate to a stronger model.
No papers have been added yet.
session
session
No related projects have been linked yet.
No related threads have been linked yet.
No related profiles have been linked yet.
No related activity has been linked yet.