Self-Hosted Embedding Models vs API: Cost Break-Even Analysis (April 2026)
Every API pricing page tells you what you'll pay. This page tells you when you should stop paying. Interactive break-even calculator, the full A100 math, and the hidden costs nobody mentions.
Break-Even Calculator
At this volume, the API is cheaper by $1,083/month. Self-hosting breaks even at never tokens/month with this GPU option.
The Math: How We Calculate Self-Hosted Cost
BGE-M3 on a single A100 80GB GPU processes approximately 8,000 tokens per second with batch size 128 and FP16 precision. The cost math:
The key insight: self-hosted cost has a large fixed component (the GPU running all month) and a small variable component (compute per token, well below API rates). The break-even is where your API bill equals the fixed GPU cost plus variable cost. With OpenAI small at $0.02/M vs self-hosted at $0.052/M, you actually never reach a variable-cost break-even. The savings come entirely from the fixed GPU already being paid - if you are running the GPU at high utilization, additional tokens are essentially free.
Real Case Study: $85,000/Year Saved
Volume: ~5B tokens/month of internal document search. Switched from OpenAI ada-002 to self-hosted BGE-M3 on 2x A100 instances. Implementation took 3 weeks of engineering time. Cited in multiple public MLOps community write-ups (2024).
Hidden Costs of Self-Hosting (Be Honest About These)
GPU provisioning, monitoring, autoscaling, and incident response. Plan for 0.5-1 day/week of engineering time, especially early on. At $150k engineer salary, 0.5 days/week = ~$18k/year in hidden cost.
A100 spot instances can be interrupted with 2-minute notice. You need either on-demand fallback, spot-aware pipelines, or reserved instances. Reserved A100 at 1-year term is roughly $1.2/hour - 20% more than spot average.
When a better model releases (BGE-M3.5, NV-Embed-v3, etc.), you must: benchmark on your domain, decide to upgrade, re-embed your entire corpus, and validate retrieval quality. This is a 1-3 day engineering exercise, not a config change.
Getting 8,000 tokens/second out of BGE-M3 requires batch size tuning, FP16 precision setup, CUDA optimization, and load testing. Out of the box, naive implementations often achieve 2,000-3,000 tokens/second - 60% below optimal.
Containerising, version-pinning, health checks, rolling updates, and memory management for large models adds infrastructure complexity. Factor in the first 2-4 weeks of setup time, which has real opportunity cost.
Cloud GPU Options (April 2026)
| Provider | GPU | $/hr spot | $/hr reserved | Notes |
|---|---|---|---|---|
| AWS (p4d.24xlarge) | 8x A100 | $9-12 total | $20 total | Enterprise; complex setup; per-GPU ~$1.50 spot |
| Lambda Labs | A100 80GB | N/A | $1.29 | Simple billing; no spot; reliable reserved |
| RunPod | A100 80GB | $0.79-1.09 | $1.49 | Good spot availability; easy Docker deploy |
| Vast.ai | A100 (SXM4) | $0.85-1.20 | N/A | Cheapest but variable quality; good for testing |
| Modal | A100 80GB | ~$1.20 | N/A | Serverless; pay-per-second; great for burst |
Best Open-Source Models to Self-Host
| Model | Org | MTEB | Context | Dims | License | Notes |
|---|---|---|---|---|---|---|
| BGE-M3 | BAAI | 66.5 | 8,192 | 1024 | Apache 2.0 | Best overall; multilingual; high throughput |
| Nomic-Embed-Text-v1.5 | Nomic AI | 62.4 | 8,192 | 768 | MIT | Fully open weights; MRL support; fast |
| NV-Embed-v2 | NVIDIA | 69.3 | 32,768 | 4096 | CC-BY-4.0 | Highest MTEB; requires A100/H100; large |
| all-MiniLM-L6-v2 | sentence-transformers | 56.3 | 512 | 384 | Apache 2.0 | Very fast; CPU-friendly; lower quality |
| multilingual-e5-large | Microsoft | 61.5 | 512 | 1024 | MIT | Strong multilingual; CPU-runable |