AnyAi.fyi - Discover ANY AI to make more online for less.

venturebeat

Why your LLM bill is exploding — and how semantic caching can cut it by 73%

Our LLM API bill was growing 30% month-over-month. Traffic was increasing, but not that fast. When I analyzed our query logs, I found the real problem: Users ask the same questions in different ways."What's your return policy?," "How do I return something?", and "Can I get a refund?" were all hitting our LLM separately, generating nearly identical responses, each incurring full API costs.Exact-match caching, the obvious first solution, captured only 18% of these redundant calls. The same semantic question, phrased differently, bypassed the cache entirely.So, I implemented semantic caching based on what queries mean, not how they're worded. After implementing it, our cache hit rate increased to 67%, reducing LLM API costs by 73%. But getting there requires solving problems that naive implementations miss.Why exact-match caching falls shortTraditional caching uses query text as the cache key. This works when queries are identical:# Exact-match cachingcache_key = hash(query_text)if cache_key in cache: return cache[cache_key]But users don't phrase questions identically. My analysis of 100,000 production queries found:Only 18% were exact duplicates of previous queries47% were semantically similar to previous queries (same intent, different wording)35% were genuinely novel queriesThat 47% represented massive cost savings we were missing. Each semantically-similar query triggered a full LLM call, generating a response nearly identical to one we'd already computed.Semantic caching architectureSemantic caching replaces text-based keys with embedding-based similarity lookup:class SemanticCache: def __init__(self, embedding_model, similarity_threshold=0.92): self.embedding_model = embedding_model self.threshold = similarity_threshold self.vector_store = VectorStore() # FAISS, Pinecone, etc. self.response_store = ResponseStore() # Redis, DynamoDB, etc. def get(self, query: str) -> Optional[str]: """Return cached response if semantically similar query exists.""" query_embedding = self.embedding_model.encode(query) # Find most similar cached query matches = self.vector_store.search(query_embedding, top_k=1) if matches and matches[0].similarity >= self.threshold: cache_id = matches[0].id return self.response_store.get(cache_id) return None def set(self, query: str, response: str): """Cache query-response pair.""" query_embedding = self.embedding_model.encode(query) cache_id = generate_id() self.vector_store.add(cache_id, query_embedding) self.response_store.set(cache_id, { 'query': query, 'response': response, 'timestamp': datetime.utcnow() })The key insight: Instead of hashing query text, I embed queries into vector space and find cached queries within a similarity threshold.The threshold problemThe similarity threshold is the critical parameter. Set it too high, and you miss valid cache hits. Set it too low, and you return wrong responses.Our initial threshold of 0.85 seemed reasonable; 85% similar should be "the same question," right?Wrong. At 0.85, we got cache hits like:Query: "How do I cancel my subscription?"Cached: "How do I cancel my order?"Similarity: 0.87These are different questions with different answers. Returning the cached response would be incorrect.I discovered that optimal thresholds vary by query type:Query typeOptimal thresholdRationaleFAQ-style questions0.94High precision needed; wrong answers damage trustProduct searches0.88More tolerance for near-matchesSupport queries0.92Balance between coverage and accuracyTransactional queries0.97Very low tolerance for errorsI implemented query-type-specific thresholds:class AdaptiveSemanticCache: def __init__(self): self.thresholds = { 'faq': 0.94, 'search': 0.88, 'support': 0.92, 'transactional': 0.97, 'default': 0.92 } self.query_classifier = QueryClassifier() def get_threshold(self, query: str) -> float: query_type = self.query_classifier.classify(query) return self.thresholds.get(query_type, self.thresholds['default']) def get(self, query: str) -> Optional[str]: threshold = self.get_threshold(query) query_embedding = self.embedding_model.encode(query) matches = self.vector_store.search(query_embedding, top_k=1) if matches and matches[0].similarity >= threshold: return self.response_store.get(matches[0].id) return NoneThreshold tuning methodologyI couldn't tune thresholds blindly. I needed ground truth on which query pairs were actually "the same."Our methodology:Step 1: Sample query pairs. I sampled 5,000 query pairs at various similarity levels (0.80-0.99).Step 2: Human labeling. Annotators labeled each pair as "same intent" or "different intent." I used three annotators per pair and took a majority vote.Step 3: Compute precision/recall curves. For each threshold, we computed:Precision: Of cache hits, what fraction had the same intent?Recall: Of same-intent pairs, what fraction did we cache-hit?def compute_precision_recall(pairs, labels, threshold): """Compute precision and recall at given similarity threshold.""" predictions = [1 if pair.similarity >= threshold else 0 for pair in pairs] true_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 1) false_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 0) false_negatives = sum(1 for p, l in zip(predictions, labels) if p == 0 and l == 1) precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0 recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0 return precision, recallStep 4: Select threshold based on cost of errors. For FAQ queries where wrong answers damage trust, I optimized for precision (0.94 threshold gave 98% precision). For search queries where missing a cache hit just costs money, I optimized for recall (0.88 threshold).Latency overheadSemantic caching adds latency: You must embed the query and search the vector store before knowing whether to call the LLM.Our measurements:OperationLatency (p50)Latency (p99)Query embedding12ms28msVector search8ms19msTotal cache lookup20ms47msLLM API call850ms2400msThe 20ms overhead is negligible compared to the 850ms LLM call we avoid on cache hits. Even at p99, the 47ms overhead is acceptable.However, cache misses now take 20ms longer than before (embedding + search + LLM call). At our 67% hit rate, the math works out favorably:Before: 100% of queries × 850ms = 850ms averageAfter: (33% × 870ms) + (67% × 20ms) = 287ms + 13ms = 300ms averageNet latency improvement of 65% alongside the cost reduction.Cache invalidationCached responses go stale. Product information changes, policies update and yesterday's correct answer becomes today's wrong answer.I implemented three invalidation strategies:Time-based TTLSimple expiration based on content type:TTL_BY_CONTENT_TYPE = { 'pricing': timedelta(hours=4), # Changes frequently 'policy': timedelta(days=7), # Changes rarely 'product_info': timedelta(days=1), # Daily refresh 'general_faq': timedelta(days=14), # Very stable}Event-based invalidationWhen underlying data changes, invalidate related cache entries:class CacheInvalidator: def on_content_update(self, content_id: str, content_type: str): """Invalidate cache entries related to updated content.""" # Find cached queries that referenced this content affected_queries = self.find_queries_referencing(content_id) for query_id in affected_queries: self.cache.invalidate(query_id) self.log_invalidation(content_id, len(affected_queries))Staleness detectionFor responses that might become stale without explicit events, I implemented periodic freshness checks:def check_freshness(self, cached_response: dict) -> bool: """Verify cached response is still valid.""" # Re-run the query against current data fresh_response = self.generate_response(cached_response['query']) # Compare semantic similarity of responses cached_embedding = self.embed(cached_response['response']) fresh_embedding = self.embed(fresh_response) similarity = cosine_similarity(cached_embedding, fresh_embedding) # If responses diverged significantly, invalidate if similarity < 0.90: self.cache.invalidate(cached_response['id']) return False return TrueWe run freshness checks on a sample of cached entries daily, catching staleness that TTL and event-based invalidation miss.Production resultsAfter three months in production:MetricBeforeAfterChangeCache hit rate18%67%+272%LLM API costs$47K/month$12.7K/month-73%Average latency850ms300ms-65%False-positive rateN/A0.8%—Customer complaints (wrong answers)Baseline+0.3%Minimal increaseThe 0.8% false-positive rate (queries where we returned a cached response that was semantically incorrect) was within acceptable bounds. These cases occurred primarily at the boundaries of our threshold, where similarity was just above the cutoff but intent differed slightly.Pitfalls to avoidDon't use a single global threshold. Different query types have different tolerance for errors. Tune thresholds per category.Don't skip the embedding step on cache hits. You might be tempted to skip embedding overhead when returning cached responses, but you need the embedding for cache key generation. The overhead is unavoidable.Don't forget invalidation. Semantic caching without invalidation strategy leads to stale responses that erode user trust. Build invalidation from day one.Don't cache everything. Some queries shouldn't be cached: Personalized responses, time-sensitive information, transactional confirmations. Build exclusion rules.def should_cache(self, query: str, response: str) -> bool: """Determine if response should be cached."" # Don't cache personalized responses if self.contains_personal_info(response): return False # Don't cache time-sensitive information if self.is_time_sensitive(query): return False # Don't cache transactional confirmations if self.is_transactional(query): return False return TrueKey takeawaysSemantic caching is a practical pattern for LLM cost control that captures redundancy exact-match caching misses. The key challenges are threshold tuning (use query-type-specific thresholds based on precision/recall analysis) and cache invalidation (combine TTL, event-based and staleness detection).At 73% cost reduction, this was our highest-ROI optimization for production LLM systems. The implementation complexity is moderate, but the threshold tuning requires careful attention to avoid quality degradation.Sreenivasa Reddy Hulebeedu Reddy is a lead software engineer.

Discover Copy

Rating

Innovation

Pricing

Technology

Usability

We have discovered similar tools to what you are looking for. Check out our suggestions for similar AI tools.

venturebeat

Microsoft's Fabric IQ teaches AI agents to understand business operati

Semantic intelligence is a critical element of actually understanding what data means and how it can be used.Microsoft is now deeply integrating semantics and ontologies in [...]

More Copy

Match Score: 117.96

venturebeat

Monitoring LLM behavior: Drift, retries, and refusal patterns

<h2>The stochastic challenge</h2>Traditional software is predictable: Input A plus function B always equals output C. This determinism allows engineers to develop robust tests. On [...]

More Copy

Match Score: 79.13

venturebeat

Context architecture is replacing RAG as agentic AI pushes enterprise retri

Redis built its name as the caching layer that kept web applications from collapsing under load. The problem it is targeting now has the same structure but is harder to solve: production AI a [...]

More Copy

Match Score: 77.83

venturebeat

Under the hood of AI agents: A technical guide to the next frontier of gen

Agents are the trendiest topic in AI today — and with good reason. Taking gen AI out of the protected sandbox of the chat interface and allowing it to act directly on the world represents a [...]

More Copy

Match Score: 73.15

Trump’s ‘Big, Beautiful Bill’ is a middle finger to US solar energy

The so-called “Big, Beautiful Bill” will, if passed, make sweeping changes to the US’ clean energy market. While some of the worst provisions affecting the industry were stripped out du [...]

More Copy

Match Score: 70.60

venturebeat

Karpathy shares 'LLM Knowledge Base' architecture that bypasses R

AI vibe coders have yet another reason to thank <a href="https://x.com/karpathy/status/2039805659525644595">Andrej Karpathy</a>, the coiner of the term. <p& [...]

More Copy

Match Score: 68.72

venturebeat

Most RAG systems don’t understand sophisticated documents — they shred

By now, many enterprises have deployed some form of RAG. The promise is seductive: index your PDFs, connect an LLM and instantly democratize your corporate knowledge.But fo [...]

More Copy

Match Score: 68.26

venturebeat

How xMemory cuts token costs and context bloat in AI agents

Standard RAG pipelines break when enterprises try to use them for long-term, multi-session LLM agent deployments. This is a critical limitation as demand for persistent AI assistants grows.&l [...]

More Copy

Match Score: 64.80

venturebeat

The AI context gap: Enterprise AI organizations have a trust problem, not a

Across 101 enterprises, the infrastructure that feeds AI agents their business context is being built faster than it can be trusted. Retrieval-augmented generation is already the default cont [...]

More Copy

Match Score: 62.76