Independent resource. Not affiliated with any provider. Always verify pricing on provider sites.
$embeddingcost

Self-Hosted Embedding Models vs API: Cost Break-Even Analysis (April 2026)

Every API pricing page tells you what you'll pay. This page tells you when you should stop paying. Interactive break-even calculator, the full A100 math, and the hidden costs nobody mentions.

Verified April 2026 • GPU rates updated monthly

Break-Even Calculator

API cost/month
$2.00
$0.02/M tokens
Self-hosted/month
$1,085
$1080 GPU + $$5.21 variable
Break-even volume
Never
tokens/month

At this volume, the API is cheaper by $1,083/month. Self-hosting breaks even at never tokens/month with this GPU option.

The Math: How We Calculate Self-Hosted Cost

BGE-M3 on a single A100 80GB GPU processes approximately 8,000 tokens per second with batch size 128 and FP16 precision. The cost math:

# A100 throughput
tps = 8,000 tokens/second
tokens_per_hour = 8,000 x 3,600 = 28.8M tokens/hr
# Cost at $1.50/hr spot
cost_per_M_tokens = $1.50 / 28.8 = $0.052/M
# vs OpenAI small
openai_rate = $0.020/M tokens
# Fixed monthly cost of keeping GPU on
fixed = $1.50/hr x 720 hrs = $1,080/month

The key insight: self-hosted cost has a large fixed component (the GPU running all month) and a small variable component (compute per token, well below API rates). The break-even is where your API bill equals the fixed GPU cost plus variable cost. With OpenAI small at $0.02/M vs self-hosted at $0.052/M, you actually never reach a variable-cost break-even. The savings come entirely from the fixed GPU already being paid - if you are running the GPU at high utilization, additional tokens are essentially free.

Real Case Study: $85,000/Year Saved

5,000-employee tech company switching from API to self-hosted
API bill (before)
$7,800/mo
Self-hosted bill (after)
$730/mo
Annual saving
$85,440

Volume: ~5B tokens/month of internal document search. Switched from OpenAI ada-002 to self-hosted BGE-M3 on 2x A100 instances. Implementation took 3 weeks of engineering time. Cited in multiple public MLOps community write-ups (2024).

Hidden Costs of Self-Hosting (Be Honest About These)

DevOps time

GPU provisioning, monitoring, autoscaling, and incident response. Plan for 0.5-1 day/week of engineering time, especially early on. At $150k engineer salary, 0.5 days/week = ~$18k/year in hidden cost.

GPU availability and spot interruption

A100 spot instances can be interrupted with 2-minute notice. You need either on-demand fallback, spot-aware pipelines, or reserved instances. Reserved A100 at 1-year term is roughly $1.2/hour - 20% more than spot average.

Model evaluation overhead

When a better model releases (BGE-M3.5, NV-Embed-v3, etc.), you must: benchmark on your domain, decide to upgrade, re-embed your entire corpus, and validate retrieval quality. This is a 1-3 day engineering exercise, not a config change.

Throughput engineering

Getting 8,000 tokens/second out of BGE-M3 requires batch size tuning, FP16 precision setup, CUDA optimization, and load testing. Out of the box, naive implementations often achieve 2,000-3,000 tokens/second - 60% below optimal.

Deployment complexity

Containerising, version-pinning, health checks, rolling updates, and memory management for large models adds infrastructure complexity. Factor in the first 2-4 weeks of setup time, which has real opportunity cost.

Cloud GPU Options (April 2026)

ProviderGPU$/hr spot$/hr reservedNotes
AWS (p4d.24xlarge)8x A100$9-12 total$20 totalEnterprise; complex setup; per-GPU ~$1.50 spot
Lambda LabsA100 80GBN/A$1.29Simple billing; no spot; reliable reserved
RunPodA100 80GB$0.79-1.09$1.49Good spot availability; easy Docker deploy
Vast.aiA100 (SXM4)$0.85-1.20N/ACheapest but variable quality; good for testing
ModalA100 80GB~$1.20N/AServerless; pay-per-second; great for burst

Best Open-Source Models to Self-Host

ModelOrgMTEBContextDimsLicenseNotes
BGE-M3BAAI66.58,1921024Apache 2.0Best overall; multilingual; high throughput
Nomic-Embed-Text-v1.5Nomic AI62.48,192768MITFully open weights; MRL support; fast
NV-Embed-v2NVIDIA69.332,7684096CC-BY-4.0Highest MTEB; requires A100/H100; large
all-MiniLM-L6-v2sentence-transformers56.3512384Apache 2.0Very fast; CPU-friendly; lower quality
multilingual-e5-largeMicrosoft61.55121024MITStrong multilingual; CPU-runable

5-Question Decision Framework: API or Self-Host?

Is your monthly embedding volume above 15M tokens?
- If no - stay on API- If yes - self-host is worth evaluating
Do you have a dedicated ML infrastructure engineer?
- If no - self-hosting risk is high- If yes - proceed with evaluation
Does your use case require data residency or air-gap?
- If no - API is fine- If yes - self-hosting required
Is embedding cost above $500/month?
- If no - optimization easier than migration- If yes - migration ROI likely positive
Can you tolerate occasional GPU interruption?
- If no - on-demand or reserved GPU required- If yes - spot instances reduce cost significantly

Frequently Asked Questions

At what volume does self-hosting become cheaper than the API?
With a spot A100 at $1.50/hour and BGE-M3, self-hosting breaks even with OpenAI text-embedding-3-small at approximately 15 million tokens per month, assuming full GPU utilization.
What is the best open-source embedding model to self-host?
BGE-M3 from BAAI is the top choice: MTEB 66.5, 100+ languages, 8,192-token context, and high throughput on A100 hardware. Nomic-Embed-Text-v1.5 is a strong alternative with a permissive MIT license.
What are the hidden costs of self-hosting embedding models?
DevOps time (0.5-1 day/week), GPU availability engineering, model evaluation on new releases, throughput optimization, and deployment infrastructure. At $150k engineer salary, 0.5 days/week = ~$18k/year in hidden costs.
How fast is BGE-M3 on an A100?
BGE-M3 on a single A100 80GB achieves approximately 7,000-9,000 tokens per second with batch sizes of 128-256 and FP16 precision. At $1.50/hour spot, that is roughly $0.052/M tokens at full utilization.
Full calculator
Compare self-hosted vs all API providers
Compare all models
Self-hosted BGE-M3 vs commercial APIs
Optimization tips
Reduce API costs before self-hosting
Disclaimer: GPU pricing estimates are approximations from public cloud pricing pages as of April 2026. Spot prices vary by availability zone and time. Always verify current GPU rates before making infrastructure decisions.