Each technique is shown with concrete savings math, not vague percentages. Implement in order of impact for your current volume.
1Use the Batch API
50% off (OpenAI), 33% off (Voyage) The OpenAI Batch API processes embedding requests asynchronously in exchange for a 50% discount. Submit a JSONL file and retrieve results within 24 hours. Voyage AI's batch tier gives 33% off with a 12-hour window.
Example: Indexing 1B tokens with OpenAI small: standard = $20.00, batch = $10.00. Saving: $10/run.
!Batch API is only suitable for indexing workloads. Real-time query embeddings always use standard rate.
OpenAI batch details2Pick the Smaller Model
Up to 6.5x cheaper OpenAI text-embedding-3-large costs 6.5x more than text-embedding-3-small for roughly a 2-3 MTEB point advantage. For most production RAG applications, small is the right default. Test on a sample of your domain-specific queries before committing to large.
Example: 100M tokens/month: small = $2.00/mo, large = $13.00/mo. Annual saving: $132.
!If your use case is medical, legal, or multi-domain technical, test carefully. The quality difference is real, just usually not worth 6.5x.
Model comparison3Matryoshka Dimensions
Up to 4x storage reduction OpenAI text-embedding-3-large and Google Gemini embedding models support Matryoshka Representation Learning. Set the 'dimensions' parameter in the API call to get a truncated but mathematically valid embedding. The API token price is unchanged - savings are entirely in downstream vector storage.
Example: 100M vectors at 3072 dims = 1,144 GB. At 1536 dims = 572 GB. On Pinecone: $378/mo vs $189/mo. Saving: $189/mo.
!MTEB quality drops roughly 1-2 points at 1536d, 3-5 points at 768d. Only available on MRL-trained models.
MRL on OpenAI4Smarter Chunking
10-25% token reduction Chunk overlap inflates token counts 10-25% over raw text size. Moving from 25% overlap to 10% overlap on 500-token chunks reduces your billed tokens by ~15% with minimal retrieval quality impact. Smaller chunks (400-500 tokens) typically outperform larger chunks (800-1000 tokens) for precise technical retrieval.
Example: 1GB corpus: 25% overlap = ~940M tokens ($18.80 at OAI small). 10% overlap = ~820M tokens ($16.40). Saving: $2.40/GB indexed.
!Chunk strategy affects retrieval quality. Test your specific content type before changing production pipelines.
RAG scenarios5Cache Embeddings
Variable - up to 30%+ on queries Many production RAG applications see 20-40% duplicate or near-duplicate queries, especially in customer support scenarios. An LRU cache keyed on query text (or a similarity hash) eliminates re-embedding identical queries. Even a simple in-memory cache with 1-hour TTL captures same-session repeats.
Example: Support bot with 30% duplicate queries at 2k queries/day x 30 tokens = 1.8M tokens/month. Caching saves 540k tokens = $0.01/month at OAI small. More significant at Voyage prices: $0.03/month.
!Cache invalidation complexity. Near-duplicate caching requires a secondary embedding lookup which has its own cost. Start with exact-match caching only.
Support bot scenario6Self-Host Above the Break-Even
80-95% cost reduction at scale Above roughly 15M tokens/month, self-hosting BGE-M3 on an A100 GPU beats OpenAI small on pure token cost. The savings compound: a 5B tokens/month workload costs $100/month self-hosted vs $7,800/month on ada-002 (real case study).
Example: 5B tokens/month: OpenAI small = $100, self-hosted A100 spot ($1.50/hr full month) = $1,080 fixed + $260 variable. Net saving vs OAI = negative until GPU utilization exceeds 50%.
!DevOps overhead, GPU availability risk, model maintenance burden. Not worth it below 15-50M tokens/month depending on your team.
Break-even calculator7Binary Embeddings (AWS Titan V2)
4x storage reduction Amazon Titan Text Embeddings V2 supports binary quantization: store each dimension as 1 bit instead of 32 bits. This reduces storage from 4 bytes/dimension to 0.125 bytes/dimension - a 32x reduction in raw storage. Combined with retrieval accuracy loss of 5-10%, this is an aggressive but effective storage optimization.
Example: 100M Titan V2 vectors (1024 dims) in float32: 381 GB. Binary: 11.9 GB. At Pinecone serverless: $125/mo vs $3.93/mo. Saving: $121/mo.
!5-10% reduction in retrieval accuracy. Only available for Amazon Titan V2. Binary retrieval requires compatible search infrastructure.
AWS Bedrock pricing8De-Duplicate Before Embedding
Variable - 5-50% on indexing Hash documents before embedding and skip re-embedding content that has not changed. Many document corpora have 20-50% duplicate or near-duplicate content (boilerplate text, repeated headers, copied sections). Embedding unique content only eliminates these wasted tokens.
Example: 10k ticket knowledge base with 30% near-duplicates: 25M tokens without dedup vs 17.5M tokens with exact dedup. Saving at OAI small: $0.15 per re-indexing pass.
!Exact dedup (hash) is easy. Semantic dedup (near-duplicate detection) requires an embedding pass itself - only economical at very large scale.
Full cost calculatorEmbedding Bill Audit: 10 Questions
1Are you using the Batch API for bulk indexing operations?
2Have you compared text-embedding-3-small vs large quality on your domain-specific eval set?
3Are you on ada-002? (If yes, migrate immediately - 5x cost reduction with better quality.)
4What is your chunk overlap percentage? Is it above 20%?
5Are you re-embedding documents that have not changed since the last indexing pass?
6Do you cache frequent query embeddings (at least exact-match)?
7What percentage of your monthly token volume is for querying vs indexing?
8Are you storing full 3072-dim vectors when 1536-dim would be sufficient?
9Is your embedding spend above $500/month? (Self-hosting evaluation threshold.)
10Are you on a managed vector DB? Have you evaluated pgvector for your scale?