Self-Hosted Embedding Models vs API: Cost Break-Even Analysis (May 2026)

Q: What is the best open-source embedding model to self-host?

BGE-M3 from BAAI is the top choice for most use cases: MTEB score of 66.5, supports 100+ languages, strong multilingual retrieval, 8,192-token context, and high throughput on A100 hardware. Nomic-Embed-Text-v1.5 is a strong alternative with MIT license (fully open). For smaller hardware, all-MiniLM-L6-v2 is extremely fast but lower quality.

Q: What are the hidden costs of self-hosting embedding models?

Hidden costs include: DevOps time for provisioning, monitoring, and scaling GPU instances; GPU availability and spot interruption handling; model evaluation when upgrading versions (re-embedding the entire corpus); throughput engineering to saturate the GPU efficiently; and the opportunity cost of engineering time vs core product work.

Every API pricing page tells you what you'll pay. This page tells you when you should stop paying. Interactive break-even calculator, the full A100 math, and the hidden costs nobody mentions.

Verified May 2026 • GPU rates updated monthly

Break-Even Calculator

Monthly tokens

GPU option

Compare against

API cost/month

$2.00

$0.02/M tokens

Self-hosted/month

$1,085

$1080 GPU + $$5.21 variable

Break-even volume

Never

tokens/month

At this volume, the API is cheaper by $1,083/month. Self-hosting breaks even at never tokens/month with this GPU option.

The Math: How We Calculate Self-Hosted Cost

BGE-M3 on a single A100 80GB GPU processes approximately 8,000 tokens per second with batch size 128 and FP16 precision. The cost math:

# A100 throughput
tps = 8,000 tokens/second
tokens_per_hour = 8,000 x 3,600 = 28.8M tokens/hr
# Cost at $1.50/hr spot
cost_per_M_tokens = $1.50 / 28.8 = $0.052/M
# vs OpenAI small
openai_rate = $0.020/M tokens
# Fixed monthly cost of keeping GPU on
fixed = $1.50/hr x 720 hrs = $1,080/month

The key insight: self-hosted cost has a large fixed component (the GPU running all month) and a small variable component (compute per token, well below API rates). The break-even is where your API bill equals the fixed GPU cost plus variable cost. With OpenAI small at $0.02/M vs self-hosted at $0.052/M, you actually never reach a variable-cost break-even. The savings come entirely from the fixed GPU already being paid - if you are running the GPU at high utilization, additional tokens are essentially free.

Real Case Study: $85,000/Year Saved

5,000-employee tech company switching from API to self-hosted

API bill (before)

$7,800/mo

Self-hosted bill (after)

$730/mo

Annual saving

$85,440

Volume: ~5B tokens/month of internal document search. Switched from OpenAI ada-002 to self-hosted BGE-M3 on 2x A100 instances. Implementation took 3 weeks of engineering time. Cited in multiple public MLOps community write-ups (2024).

Hidden Costs of Self-Hosting (Be Honest About These)

DevOps time

GPU provisioning, monitoring, autoscaling, and incident response. Plan for 0.5-1 day/week of engineering time, especially early on. At $150k engineer salary, 0.5 days/week = ~$18k/year in hidden cost.

GPU availability and spot interruption

A100 spot instances can be interrupted with 2-minute notice. You need either on-demand fallback, spot-aware pipelines, or reserved instances. Reserved A100 at 1-year term is roughly $1.2/hour - 20% more than spot average.

Model evaluation overhead

When a better model releases (BGE-M3.5, NV-Embed-v3, etc.), you must: benchmark on your domain, decide to upgrade, re-embed your entire corpus, and validate retrieval quality. This is a 1-3 day engineering exercise, not a config change.

Throughput engineering

Getting 8,000 tokens/second out of BGE-M3 requires batch size tuning, FP16 precision setup, CUDA optimization, and load testing. Out of the box, naive implementations often achieve 2,000-3,000 tokens/second - 60% below optimal.

Deployment complexity

Containerising, version-pinning, health checks, rolling updates, and memory management for large models adds infrastructure complexity. Factor in the first 2-4 weeks of setup time, which has real opportunity cost.

Cloud GPU Options (May 2026)

Provider	GPU	$/hr spot	$/hr reserved	Notes
AWS (p4d.24xlarge)	8x A100	$9-12 total	$20 total	Enterprise; complex setup; per-GPU ~$1.50 spot
Lambda Labs	A100 80GB	N/A	$1.29	Simple billing; no spot; reliable reserved
RunPod	A100 80GB	$0.79-1.09	$1.49	Good spot availability; easy Docker deploy
Vast.ai	A100 (SXM4)	$0.85-1.20	N/A	Cheapest but variable quality; good for testing
Modal	A100 80GB	~$1.20	N/A	Serverless; pay-per-second; great for burst

Best Open-Source Models to Self-Host

Model	Org	MTEB	Context	Dims	License	Notes
BGE-M3	BAAI	66.5	8,192	1024	Apache 2.0	Best overall; multilingual; high throughput
Nomic-Embed-Text-v1.5	Nomic AI	62.4	8,192	768	MIT	Fully open weights; MRL support; fast
NV-Embed-v2	NVIDIA	69.3	32,768	4096	CC-BY-4.0	Highest MTEB; requires A100/H100; large
all-MiniLM-L6-v2	sentence-transformers	56.3	512	384	Apache 2.0	Very fast; CPU-friendly; lower quality
multilingual-e5-large	Microsoft	61.5	512	1024	MIT	Strong multilingual; CPU-runable

5-Question Decision Framework: API or Self-Host?

Is your monthly embedding volume above 15M tokens?

- If no - stay on API- If yes - self-host is worth evaluating

Do you have a dedicated ML infrastructure engineer?

- If no - self-hosting risk is high- If yes - proceed with evaluation

Does your use case require data residency or air-gap?

- If no - API is fine- If yes - self-hosting required

Is embedding cost above $500/month?

- If no - optimization easier than migration- If yes - migration ROI likely positive

Can you tolerate occasional GPU interruption?

- If no - on-demand or reserved GPU required- If yes - spot instances reduce cost significantly

Frequently Asked Questions

At what volume does self-hosting become cheaper than the API?

With a spot A100 at $1.50/hour and BGE-M3, self-hosting breaks even with OpenAI text-embedding-3-small at approximately 15 million tokens per month, assuming full GPU utilization.

What is the best open-source embedding model to self-host?

BGE-M3 from BAAI is the top choice: MTEB 66.5, 100+ languages, 8,192-token context, and high throughput on A100 hardware. Nomic-Embed-Text-v1.5 is a strong alternative with a permissive MIT license.

What are the hidden costs of self-hosting embedding models?

DevOps time (0.5-1 day/week), GPU availability engineering, model evaluation on new releases, throughput optimization, and deployment infrastructure. At $150k engineer salary, 0.5 days/week = ~$18k/year in hidden costs.

How fast is BGE-M3 on an A100?

BGE-M3 on a single A100 80GB achieves approximately 7,000-9,000 tokens per second with batch sizes of 128-256 and FP16 precision. At $1.50/hour spot, that is roughly $0.052/M tokens at full utilization.

Full calculator

Compare self-hosted vs all API providers

Compare all models

Self-hosted BGE-M3 vs commercial APIs

Optimization tips

Reduce API costs before self-hosting

Disclaimer: GPU pricing estimates are approximations from public cloud pricing pages as of May 2026. Spot prices vary by availability zone and time. Always verify current GPU rates before making infrastructure decisions.