How to Reduce Embedding Costs: 8 Techniques with Real Numbers (June 2026)

Each technique is shown with concrete savings math, not vague percentages. Implement in order of impact for your current volume.

Verified June 2026

TL;DR - 8 Techniques

Use the Batch API: 50% off (OpenAI), 33% off (Voyage)

Pick the Smaller Model: Up to 6.5x cheaper

Matryoshka Dimensions: Up to 4x storage reduction

Smarter Chunking: 10-25% token reduction

Cache Embeddings: Variable - up to 30%+ on queries

Self-Host Above the Break-Even: 80-95% cost reduction at scale

Binary Embeddings (AWS Titan V2): 32x storage reduction

De-Duplicate Before Embedding: Variable - 5-50% on indexing

Use the Batch API

50% off (OpenAI), 33% off (Voyage)

The OpenAI Batch API processes embedding requests asynchronously in exchange for a 50% discount. Submit a JSONL file and retrieve results within 24 hours. Voyage AI's batch tier gives 33% off with a 12-hour window.

Example: Indexing 1B tokens with OpenAI small: standard = $20.00, batch = $10.00. Saving: $10/run.

!Batch API is only suitable for indexing workloads. Real-time query embeddings always use standard rate.

OpenAI batch details

Pick the Smaller Model

Up to 6.5x cheaper

OpenAI text-embedding-3-large costs 6.5x more than text-embedding-3-small for roughly a 2-3 MTEB point advantage. For most production RAG applications, small is the right default. Test on a sample of your domain-specific queries before committing to large.

Example: 100M tokens/month: small = $2.00/mo, large = $13.00/mo. Annual saving: $132.

!If your use case is medical, legal, or multi-domain technical, test carefully. The quality difference is real, just usually not worth 6.5x.

Model comparison

Matryoshka Dimensions

Up to 4x storage reduction

OpenAI text-embedding-3-large and Google Gemini embedding models support Matryoshka Representation Learning. Set the 'dimensions' parameter in the API call to get a truncated but mathematically valid embedding. The API token price is unchanged - savings are entirely in downstream vector storage.

Example: 100M vectors at 3072 dims = 1,144 GB. At 1536 dims = 572 GB. On Pinecone: $378/mo vs $189/mo. Saving: $189/mo.

!MTEB quality drops roughly 1-2 points at 1536d, 3-5 points at 768d. Only available on MRL-trained models.

MRL on OpenAI

Smarter Chunking

10-25% token reduction

Chunk overlap inflates token counts 10-25% over raw text size. Moving from 25% overlap to 10% overlap on 500-token chunks reduces your billed tokens by ~15% with minimal retrieval quality impact. Smaller chunks (400-500 tokens) typically outperform larger chunks (800-1000 tokens) for precise technical retrieval.

Example: 1GB corpus: 25% overlap = ~940M tokens ($18.80 at OAI small). 10% overlap = ~820M tokens ($16.40). Saving: $2.40/GB indexed.

!Chunk strategy affects retrieval quality. Test your specific content type before changing production pipelines.

RAG scenarios

Cache Embeddings

Variable - up to 30%+ on queries

Many production RAG applications see 20-40% duplicate or near-duplicate queries, especially in customer support scenarios. An LRU cache keyed on query text (or a similarity hash) eliminates re-embedding identical queries. Even a simple in-memory cache with 1-hour TTL captures same-session repeats.

Example: Support bot with 30% duplicate queries at 2k queries/day x 30 tokens = 1.8M tokens/month. Caching saves 540k tokens = $0.01/month at OAI small. More significant at Voyage prices: $0.03/month.

!Cache invalidation complexity. Near-duplicate caching requires a secondary embedding lookup which has its own cost. Start with exact-match caching only.

Support bot scenario

Self-Host Above the Break-Even

80-95% cost reduction at scale

Above roughly 15M tokens/month, self-hosting BGE-M3 on an A100 GPU beats OpenAI small on pure token cost. The savings compound: a 5B tokens/month workload costs $100/month self-hosted vs $7,800/month on ada-002 (real case study).

Example: 5B tokens/month: OpenAI small = $100, self-hosted A100 spot ($1.50/hr full month) = $1,080 fixed + $260 variable. Net saving vs OAI = negative until GPU utilization exceeds 50%.

!DevOps overhead, GPU availability risk, model maintenance burden. Not worth it below 15-50M tokens/month depending on your team.

Break-even calculator

Binary Embeddings (AWS Titan V2)

32x storage reduction

Amazon Titan Text Embeddings V2 supports binary quantization: store each dimension as 1 bit instead of 32 bits. This reduces storage from 4 bytes/dimension to 0.125 bytes/dimension - a 32x reduction in raw storage. Combined with retrieval accuracy loss of 5-10%, this is an aggressive but effective storage optimization.

Example: 100M Titan V2 vectors (1024 dims) in float32: 381 GB. Binary: 11.9 GB. At Pinecone serverless: $125/mo vs $3.93/mo. Saving: $121/mo.

!5-10% reduction in retrieval accuracy. Only available for Amazon Titan V2. Binary retrieval requires compatible search infrastructure.

AWS Bedrock pricing

De-Duplicate Before Embedding

Variable - 5-50% on indexing

Hash documents before embedding and skip re-embedding content that has not changed. Many document corpora have 20-50% duplicate or near-duplicate content (boilerplate text, repeated headers, copied sections). Embedding unique content only eliminates these wasted tokens.

Example: 10k ticket knowledge base with 30% near-duplicates: 25M tokens without dedup vs 17.5M tokens with exact dedup. Saving at OAI small: $0.15 per re-indexing pass.

!Exact dedup (hash) is easy. Semantic dedup (near-duplicate detection) requires an embedding pass itself - only economical at very large scale.

Full cost calculator

Embedding Bill Audit: 10 Questions

1Are you using the Batch API for bulk indexing operations?

2Have you compared text-embedding-3-small vs large quality on your domain-specific eval set?

3Are you on ada-002? (If yes, migrate immediately - 5x cost reduction with better quality.)

4What is your chunk overlap percentage? Is it above 20%?

5Are you re-embedding documents that have not changed since the last indexing pass?

6Do you cache frequent query embeddings (at least exact-match)?

7What percentage of your monthly token volume is for querying vs indexing?

8Are you storing full 3072-dim vectors when 1536-dim would be sufficient?

9Is your embedding spend above $500/month? (Self-hosting evaluation threshold.)

10Are you on a managed vector DB? Have you evaluated pgvector for your scale?

Frequently Asked Questions

How much does the OpenAI Batch API save on embeddings?

The OpenAI Batch API saves exactly 50% on embedding costs. text-embedding-3-small drops from $0.020 to $0.010 per million tokens. This requires accepting up to 24-hour processing time, ideal for bulk indexing but not real-time queries.

Does reducing embedding dimensions affect retrieval quality?

Yes, but the effect is small for MRL models. For OpenAI text-embedding-3-large, going from 3072 to 1536 dimensions typically costs 1-2 MTEB points. Going to 768 costs 3-5 MTEB points. The API token price is unchanged - only downstream storage decreases.

What chunk size should I use for embeddings?

For most RAG applications, 400-600 tokens with 10-20% overlap is the sweet spot. Shorter chunks improve precision but increase token count. 1000-token chunks are often too long for technical documentation; 200-token chunks are too granular for prose.

Full calculator

See your costs before and after optimization

Matryoshka optimizer

Dimension cost-vs-quality tool for MRL models

Storage costs

Dimension reduction storage savings

Disclaimer: Savings estimates based on public pricing as of June 2026. Actual savings depend on your workload characteristics, query/index ratio, and infrastructure choices. Always verify current pricing on provider sites before making optimization decisions.