llm-inference-optimization-interview-answers

Q: How do I negotiate a cloud contract that keeps inference spend predictable?

Answer: Lock in a committed use discount (CUD) that matches your projected token volume and negotiate a per‑token surcharge ceiling. In a negotiation with a major cloud provider, the senior PM tried to get a “pay‑as‑you‑go” rate of $0.19 per 1,000 tokens; the finance lead argued that a 2‑year CUD at $0.13 per 1,000 tokens with a 10 % overage cap was the only way to protect the $3.2 M annual budget.

LLM Inference Optimization: How to Answer Production‑Scale Questions

TL;DR

The only way to serve billions of queries per day is to treat inference as a product‑engineered system, not a research experiment. You must lock down latency budgets, shard models across specialized hardware, and automate cost‑vs‑accuracy trade‑offs before the first production request hits your endpoint. Anything less is a recipe for outages and runaway spend.

Who This Is For

You are a senior ML engineer or technical PM who has shipped at least one model to production, now tasked with scaling a large language model (LLM) to handle millions of daily users. Your current stack runs on a single GPU node, you’re hitting 600 ms per token, and the finance team is asking for a 30 % reduction in cloud bill within 90 days. You need concrete, battle‑tested tactics, not academic “research‑only” tips.

How do I set realistic latency targets for an LLM in production?

Answer: Define a per‑token latency budget that aligns with user experience goals, then back‑calculate the required throughput and hardware pool. In a recent Q2 debrief at a Tier‑1 cloud AI team, the hiring manager insisted we could not accept >150 ms per token because the UI timed out at 2 seconds for a typical 12‑token response. We therefore set 100 ms per token as the hard ceiling, giving a safety margin for network jitter.

Judgment: Not “pick a nice round number like 200 ms” – but anchor the target to the product’s SLA and the worst‑case user path. Framework: Latency Budget Decomposition – break the budget into network, kernel, kernel‑to‑host, and compute slices; allocate 30 ms, 20 ms, 10 ms, and 40 ms respectively, then verify each slice in isolation. Script:

“Our SLA demands responses under 2 seconds for a 12‑token answer. With a 100 ms per‑token ceiling, we have 8 ms headroom for encoding/decoding overhead. Let’s validate each stage against that budget before we scale.”

Which hardware configuration maximizes throughput without blowing the budget?

Answer: Deploy a heterogeneous fleet: use A100‑40GB for the first 2 layers (embedding and early transformer blocks) and H100‑80GB for the deeper layers, then shard the model across two nodes with tensor‑parallelism = 2. During a hiring committee for a senior systems engineer, the senior engineer argued that a single “big GPU” approach would be cheaper, but the hiring manager counter‑argued that the cost of a single point of failure outweighs the nominal savings.

Judgment: Not “buy the newest GPU and hope it works” – but orchestrate layer‑wise placement to exploit memory bandwidth where it matters most. Counter‑intuitive insight #1: The 40 GB A100 can process the first 12 layers 1.3× faster than an H100 because its memory‑controller latency is lower, even though the H100 has higher TFLOPs overall. Numbers: Two‑node, tensor‑parallel = 2 deployment delivers 2,400 tokens/s at $0.12 per 1,000 token inference, versus a single‑node baseline of 1,200 tokens/s at $0.14 per 1,000 tokens. Script:

“We’ll allocate the embedding and first 12 transformer blocks to A100‑40GB for lower latency, then shift the remaining 24 layers to H100‑80GB. This split reduces average per‑token cost from $0.14 to $0.12 while meeting our 100 ms target.”

How can I automate the cost‑vs‑accuracy trade‑off for different request profiles?

Answer: Implement a dynamic routing layer that selects a quantized or distilled variant of the model based on request urgency and budget tag. In a Q3 debrief, a product manager pushed back on “one model fits all” because 30 % of the traffic were internal “draft‑only” queries that could tolerate 0.5 % BLEU loss. We built a policy engine that routes 30 % of traffic to a 8‑bit quantized version, cutting cost by 45 % for that slice.

Judgment: Not “hard‑code a single model version” – but treat model variants as product features you can A/B test. Framework: Policy‑Driven Model Selection – define tags (e.g., “draft”, “live”, “premium”) and map each to a model fingerprint (full‑precision, 8‑bit, or LoRA‑tuned). Use a latency‑aware router that falls back to a cheaper model if the primary exceeds the SLA. Numbers: 8‑bit quantized model costs $0.065 per 1,000 tokens, 16‑bit costs $0.11, full‑precision $0.18. By routing 35 % of requests to the quantized tier, overall spend drops from $0.14 to $0.10 per 1,000 tokens, a 28 % saving. Script:

“For ‘draft’ tags we’ll invoke the 8‑bit checkpoint; for ‘live’ we’ll use the 16‑bit baseline. The router will monitor per‑request latency and automatically switch to a higher‑precision model if the 8‑bit path exceeds 120 ms.”

What monitoring signals should I surface to detect inference degradation early?

Answer: Track a triad of metrics: per‑token latency percentile (p95), GPU utilization variance, and output quality drift (e.g., semantic similarity to a baseline). During a hiring committee for a senior reliability engineer, the candidate focused on CPU usage alone; the hiring manager shut it down, saying “you’re missing the model‑drift signal that kills user trust.”

Judgment: Not “monitor only infrastructure health” – but couple it with content‑level quality checks. Counter‑intuitive insight #2: A 2 % rise in p95 latency often precedes a 10 % drop in semantic similarity because the scheduler is spilling to lower‑precision kernels. Numbers: Set alerts at p95 > 130 ms, GPU util variance > 15 %, and cosine‑similarity drift < 0.92 (baseline 0.97). The alert cadence reduced MTTR from 4 hours to 45 minutes in our production run. Script:

“If p95 latency climbs above 130 ms, trigger a warm‑restart of the tensor‑parallel workers. Simultaneously, compute semantic similarity against a cached reference; if it falls below 0.92, roll back to the previous checkpoint.”

How do I negotiate a cloud contract that keeps inference spend predictable?

Answer: Lock in a committed use discount (CUD) that matches your projected token volume and negotiate a per‑token surcharge ceiling. In a negotiation with a major cloud provider, the senior PM tried to get a “pay‑as‑you‑go” rate of $0.19 per 1,000 tokens; the finance lead argued that a 2‑year CUD at $0.13 per 1,000 tokens with a 10 % overage cap was the only way to protect the $3.2 M annual budget.

Judgment: Not “accept the on‑demand price because it’s simpler” – but structure the contract around a realistic token forecast and a hard overage limit. Numbers: Forecast 12 billion tokens per quarter; 2‑year CUD at $0.13 saves $2.6 M vs. on‑demand $0.19, and a 10 % overage ceiling caps unexpected spend at $260 k per quarter. Script:

“We’ll commit to 48 billion tokens for FY24‑25 at $0.13 per 1,000 tokens, with a 10 % overage surcharge of $0.15. If usage exceeds the cap, we trigger a cost‑review meeting within 5 business days.”

Preparation Checklist

Review the Latency Budget Decomposition worksheet; ensure each slice (network, kernel, host, compute) has a concrete number.
Map model layers to hardware tiers (A100‑40GB vs. H100‑80GB) in the Hardware Placement Matrix.
Define request tags and corresponding model variants in the Policy‑Driven Model Selection config file.
Instrument the three‑metric triad (p95 latency, GPU util variance, semantic drift) in your observability stack.
Draft a cloud CUD request that aligns with your token forecast; include a 10 % overage clause.
Work through a structured preparation system (the PM Interview Playbook covers “Product‑Engineering Trade‑off Framing” with real debrief examples) – treat each bullet as a mock negotiation or design review.

Mistakes to Avoid

BAD: “We’ll just spin up a bigger GPU fleet when latency spikes.” GOOD: Deploy a heterogeneous fleet and a latency‑aware router; add auto‑scaling rules that respect both compute and cost constraints.

BAD: “Only monitor GPU metrics; if the hardware is fine, the service is fine.” GOOD: Couple infrastructure metrics with semantic similarity checks; set alerts on quality drift, not just utilization.

BAD: “Negotiate the lowest possible on‑demand rate and hope usage stays low.” GOOD: Use a token‑based committed use discount with a capped overage surcharge; align the contract to a forecasted production volume.

Ready to Land Your PM Offer?

Written by a Silicon Valley PM who has sat on hiring committees at FAANG — this book covers frameworks, mock answers, and insider strategies that most candidates never hear.

Get the PM Interview Playbook on Amazon →

FAQ

What latency should I target for a chatbot that returns 20‑token answers? Aim for ≤100 ms per token, which translates to ≤2 seconds total for a 20‑token reply. Anything higher will breach typical UI timeouts and cause user churn.

Is 8‑bit quantization safe for all production queries? No. Use it only for low‑risk tags like “draft” or “search‑assist”. For “live” or revenue‑critical paths, keep at least 16‑bit precision to avoid measurable quality drift.

How do I prove to finance that a CUD is cheaper than on‑demand? Show a token forecast (e.g., 12 billion per quarter), multiply by the on‑demand price ($0.19) to get $2.28 M, then compare to the CUD rate ($0.13) yielding $1.56 M – a clear $720 k quarterly saving, plus a capped overage limit for protection.