The Model Architecture Decisions That Define Your System's Ceiling
Framework selection, fine-tuning vs. RAG analysis, training approach specification, and a benchmark methodology — the model architecture decisions made with rigor, not gut feel.
You might be experiencing...
The Model Design & Selection sprint resolves the architecture decisions that determine your ML system’s capability ceiling — before you build infrastructure around the wrong choice.
Why Model Architecture Decisions Fail
Most ML model architecture decisions fail in one of three ways:
By default — the team uses the architecture they already know. PyTorch because the last project used PyTorch. Fine-tuning because someone read a fine-tuning tutorial. The decision is never made — it just happens.
By debate — the team identifies the right frameworks and approaches, forms opinions, and then cannot converge. The debate runs for weeks because there is no structured decision process and no agreed criteria.
By premature commitment — the team makes a decision quickly to unblock execution, without documenting the constraints that drove it. Six months later, when those constraints change, the decision gets relitigated — and nobody remembers why the original choice was made.
The Constraints That Drive Correct Decisions
Every model architecture decision is correct or incorrect relative to a specific set of constraints. The same use case has a different correct answer depending on:
Latency requirements — fine-tuning a 7B parameter model produces lower latency than a RAG pipeline. If you need sub-100ms inference, that matters.
Training data availability — fine-tuning requires thousands of high-quality labelled examples. RAG requires a document corpus and retrieval infrastructure. The correct choice depends on what you have and can obtain.
Inference budget — a large fine-tuned model running on GPU is expensive at scale. A retrieval-augmented pipeline over a smaller model may achieve comparable quality at lower cost. The cost model needs to be explicit.
Team capabilities — the correct architecture for a team that has run RAG pipelines before is different from the correct architecture for a team that has never done retrieval. We design for your team’s actual capabilities, not an idealised team.
What the Decision Documentation Delivers
Architecture decisions documented with explicit rationale survive team changes. When a new engineer joins and asks why the system uses RAG instead of fine-tuning, the answer is in the decision log — not in someone’s memory, not in a Slack thread, not in a document that says “we chose RAG” without saying why.
Engagement Phases
Use Case & Constraints Mapping
Structured analysis of your use case requirements — latency, throughput, accuracy targets, training data availability, inference budget, and team capabilities. We map every constraint that will determine the correct model architecture and framework choice. This phase produces the decision criteria that drive the rest of the sprint.
Model Architecture Evaluation
Systematic evaluation of the model architecture options against your documented constraints. For LLM use cases: fine-tuning vs. RAG vs. prompt engineering vs. hybrid approaches, with analysis of data requirements, cost, latency, and maintenance burden. For classical ML: model family selection with complexity-performance tradeoff analysis. Framework evaluation where relevant — PyTorch, JAX, TensorFlow, XGBoost, scikit-learn.
Benchmark Methodology & Decision Documentation
Design of the benchmark methodology that will be used to validate the chosen architecture: evaluation metrics, test set construction, baseline comparisons, and acceptance criteria. Delivery of the full decision documentation package — framework recommendation, fine-tuning vs. RAG decision doc, training approach specification, and architecture decision log.
Deliverables
Before & After
| Metric | Before | After |
|---|---|---|
| Decision Confidence | Framework debate running for weeks — team split, milestone at risk | Documented decision with explicit rationale — team aligned and execution unblocked |
| Benchmark Clarity | No formal benchmark methodology — model comparison based on intuition and ad hoc tests | Defined metrics, test set, baselines, and acceptance criteria — model selection objective and defensible |
| Architecture Risk Reduction | Major model architecture commitment made without structured analysis — risk of costly pivot | Constraints documented, options evaluated, decision justified — architecture risk quantified and mitigated |
Tools We Use
Frequently Asked Questions
What are the actual tradeoffs between fine-tuning and RAG — when does each make sense?
RAG is the right default for most LLM use cases where the knowledge base changes frequently, where factual grounding is critical, or where you cannot collect 10,000+ high-quality labelled examples. Fine-tuning is appropriate when you need the model to adopt a specific style or format, when latency is critical and retrieval overhead is unacceptable, or when you have a narrow, well-defined task with sufficient labelled data. The decision document we deliver maps these tradeoffs against your specific requirements — not the general case.
When should we use open-source models vs. API-based models (GPT-4, Claude)?
API-based models are the right default for prototyping and for use cases where data privacy allows it — they are faster to iterate with and the capability ceiling is high. Open-source models (Llama, Mistral, Qwen) become the right choice when data privacy requirements preclude sending data to third-party APIs, when inference volume makes API costs prohibitive at scale, or when you need fine-tuning control that API providers do not offer. We document this analysis with cost modelling specific to your expected inference volume.
When does gradient-boosted trees beat a neural network for structured data?
For most tabular data tasks — fraud detection, demand forecasting, credit scoring — gradient-boosted trees (XGBoost, LightGBM) outperform neural networks when the dataset is under 1M rows, latency is critical, and interpretability matters to stakeholders or regulators. Neural networks are competitive when the dataset is large, the feature space includes unstructured data (text, images), or the task requires learning complex cross-feature interactions at scale. We benchmark both approaches against your data and make a recommendation based on results, not convention.
How do you design a benchmark methodology that business stakeholders will accept?
A benchmark stakeholders accept has three properties: the metrics map to business outcomes (not just ML metrics), the test set reflects real-world distribution (not a held-out slice of training data), and the acceptance criteria are defined before evaluation starts. We design each of these with your team. The benchmark methodology we deliver includes the offline evaluation protocol and the online success metrics — connecting model performance to the business outcomes that justify the investment.
Build ML that scales.
Book a free 30-minute ML architecture scope call with our experts. We review your stack and tell you exactly what to fix before it breaks at scale.
Talk to an Expert