Discover ANY AI to make more online for less.

select between over 22,900 AI Tool and 17,900 AI News Posts.


venturebeat
Why your LLM bill is exploding — and how semantic caching can cut it by 73%

Our LLM API bill was growing 30% month-over-month. Traffic was increasing, but not that fast. When I analyzed our query logs, I found the real problem: Users ask the same questions in different ways."What's your return policy?," "How do I return something?", and "Can I get a refund?" were all hitting our LLM separately, generating nearly identical responses, each incurring full API costs.Exact-match caching, the obvious first solution, captured only 18% of these redundant calls. The same semantic question, phrased differently, bypassed the cache entirely.So, I implemented semantic caching based on what queries mean, not how they're worded. After implementing it, our cache hit rate increased to 67%, reducing LLM API costs by 73%. But getting there requires solving problems that naive implementations miss.Why exact-match caching falls shortTraditional caching uses query text as the cache key. This works when queries are identical:# Exact-match cachingcache_key = hash(query_text)if cache_key in cache:    return cache[cache_key]But users don't phrase questions identically. My analysis of 100,000 production queries found:Only 18% were exact duplicates of previous queries47% were semantically similar to previous queries (same intent, different wording)35% were genuinely novel queriesThat 47% represented massive cost savings we were missing. Each semantically-similar query triggered a full LLM call, generating a response nearly identical to one we'd already computed.Semantic caching architectureSemantic caching replaces text-based keys with embedding-based similarity lookup:class SemanticCache:    def __init__(self, embedding_model, similarity_threshold=0.92):        self.embedding_model = embedding_model        self.threshold = similarity_threshold        self.vector_store = VectorStore()  # FAISS, Pinecone, etc.        self.response_store = ResponseStore()  # Redis, DynamoDB, etc.    def get(self, query: str) -> Optional[str]:        """Return cached response if semantically similar query exists."""        query_embedding = self.embedding_model.encode(query)        # Find most similar cached query        matches = self.vector_store.search(query_embedding, top_k=1)        if matches and matches[0].similarity >= self.threshold:            cache_id = matches[0].id            return self.response_store.get(cache_id)        return None    def set(self, query: str, response: str):        """Cache query-response pair."""        query_embedding = self.embedding_model.encode(query)        cache_id = generate_id()        self.vector_store.add(cache_id, query_embedding)        self.response_store.set(cache_id, {            'query': query,            'response': response,            'timestamp': datetime.utcnow()        })The key insight: Instead of hashing query text, I embed queries into vector space and find cached queries within a similarity threshold.The threshold problemThe similarity threshold is the critical parameter. Set it too high, and you miss valid cache hits. Set it too low, and you return wrong responses.Our initial threshold of 0.85 seemed reasonable; 85% similar should be "the same question," right?Wrong. At 0.85, we got cache hits like:Query: "How do I cancel my subscription?"Cached: "How do I cancel my order?"Similarity: 0.87These are different questions with different answers. Returning the cached response would be incorrect.I discovered that optimal thresholds vary by query type:Query typeOptimal thresholdRationaleFAQ-style questions0.94High precision needed; wrong answers damage trustProduct searches0.88More tolerance for near-matchesSupport queries0.92Balance between coverage and accuracyTransactional queries0.97Very low tolerance for errorsI implemented query-type-specific thresholds:class AdaptiveSemanticCache:    def __init__(self):        self.thresholds = {            'faq': 0.94,            'search': 0.88,            'support': 0.92,            'transactional': 0.97,            'default': 0.92        }        self.query_classifier = QueryClassifier()    def get_threshold(self, query: str) -> float:        query_type = self.query_classifier.classify(query)        return self.thresholds.get(query_type, self.thresholds['default'])    def get(self, query: str) -> Optional[str]:        threshold = self.get_threshold(query)        query_embedding = self.embedding_model.encode(query)        matches = self.vector_store.search(query_embedding, top_k=1)        if matches and matches[0].similarity >= threshold:            return self.response_store.get(matches[0].id)        return NoneThreshold tuning methodologyI couldn't tune thresholds blindly. I needed ground truth on which query pairs were actually "the same."Our methodology:Step 1: Sample query pairs. I sampled 5,000 query pairs at various similarity levels (0.80-0.99).Step 2: Human labeling. Annotators labeled each pair as "same intent" or "different intent." I used three annotators per pair and took a majority vote.Step 3: Compute precision/recall curves. For each threshold, we computed:Precision: Of cache hits, what fraction had the same intent?Recall: Of same-intent pairs, what fraction did we cache-hit?def compute_precision_recall(pairs, labels, threshold):    """Compute precision and recall at given similarity threshold."""    predictions = [1 if pair.similarity >= threshold else 0 for pair in pairs]    true_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 1)    false_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 0)    false_negatives = sum(1 for p, l in zip(predictions, labels) if p == 0 and l == 1)    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0    recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0    return precision, recallStep 4: Select threshold based on cost of errors. For FAQ queries where wrong answers damage trust, I optimized for precision (0.94 threshold gave 98% precision). For search queries where missing a cache hit just costs money, I optimized for recall (0.88 threshold).Latency overheadSemantic caching adds latency: You must embed the query and search the vector store before knowing whether to call the LLM.Our measurements:OperationLatency (p50)Latency (p99)Query embedding12ms28msVector search8ms19msTotal cache lookup20ms47msLLM API call850ms2400msThe 20ms overhead is negligible compared to the 850ms LLM call we avoid on cache hits. Even at p99, the 47ms overhead is acceptable.However, cache misses now take 20ms longer than before (embedding + search + LLM call). At our 67% hit rate, the math works out favorably:Before: 100% of queries × 850ms = 850ms averageAfter: (33% × 870ms) + (67% × 20ms) = 287ms + 13ms = 300ms averageNet latency improvement of 65% alongside the cost reduction.Cache invalidationCached responses go stale. Product information changes, policies update and yesterday's correct answer becomes today's wrong answer.I implemented three invalidation strategies:Time-based TTLSimple expiration based on content type:TTL_BY_CONTENT_TYPE = {    'pricing': timedelta(hours=4),      # Changes frequently    'policy': timedelta(days=7),         # Changes rarely    'product_info': timedelta(days=1),   # Daily refresh    'general_faq': timedelta(days=14),   # Very stable}Event-based invalidationWhen underlying data changes, invalidate related cache entries:class CacheInvalidator:    def on_content_update(self, content_id: str, content_type: str):        """Invalidate cache entries related to updated content."""        # Find cached queries that referenced this content        affected_queries = self.find_queries_referencing(content_id)        for query_id in affected_queries:            self.cache.invalidate(query_id)        self.log_invalidation(content_id, len(affected_queries))Staleness detectionFor responses that might become stale without explicit events, I implemented  periodic freshness checks:def check_freshness(self, cached_response: dict) -> bool:    """Verify cached response is still valid."""    # Re-run the query against current data    fresh_response = self.generate_response(cached_response['query'])    # Compare semantic similarity of responses    cached_embedding = self.embed(cached_response['response'])    fresh_embedding = self.embed(fresh_response)    similarity = cosine_similarity(cached_embedding, fresh_embedding)    # If responses diverged significantly, invalidate    if similarity < 0.90:        self.cache.invalidate(cached_response['id'])        return False    return TrueWe run freshness checks on a sample of cached entries daily, catching staleness that TTL and event-based invalidation miss.Production resultsAfter three months in production:MetricBeforeAfterChangeCache hit rate18%67%+272%LLM API costs$47K/month$12.7K/month-73%Average latency850ms300ms-65%False-positive rateN/A0.8%—Customer complaints (wrong answers)Baseline+0.3%Minimal increaseThe 0.8% false-positive rate (queries where we returned a cached response that was semantically incorrect) was within acceptable bounds. These cases occurred primarily at the boundaries of our threshold, where similarity was just above the cutoff but intent differed slightly.Pitfalls to avoidDon't use a single global threshold. Different query types have different tolerance for errors. Tune thresholds per category.Don't skip the embedding step on cache hits. You might be tempted to skip embedding overhead when returning cached responses, but you need the embedding for cache key generation. The overhead is unavoidable.Don't forget invalidation. Semantic caching without invalidation strategy leads to stale responses that erode user trust. Build invalidation from day one.Don't cache everything. Some queries shouldn't be cached: Personalized responses, time-sensitive information, transactional confirmations. Build exclusion rules.def should_cache(self, query: str, response: str) -> bool:    """Determine if response should be cached.""    # Don't cache personalized responses    if self.contains_personal_info(response):        return False    # Don't cache time-sensitive information    if self.is_time_sensitive(query):        return False    # Don't cache transactional confirmations    if self.is_transactional(query):        return False    return TrueKey takeawaysSemantic caching is a practical pattern for LLM cost control that captures redundancy exact-match caching misses. The key challenges are threshold tuning (use query-type-specific thresholds based on precision/recall analysis) and cache invalidation (combine TTL, event-based and staleness detection).At 73% cost reduction, this was our highest-ROI optimization for production LLM systems. The implementation complexity is moderate, but the threshold tuning requires careful attention to avoid quality degradation.Sreenivasa Reddy Hulebeedu Reddy is a lead software engineer.

Rating

Innovation

Pricing

Technology

Usability

We have discovered similar tools to what you are looking for. Check out our suggestions for similar AI tools.

venturebeat
Microsoft's Fabric IQ teaches AI agents to understand business operati

<p>Semantic intelligence is a critical element of actually understanding what data means and how it can be used.</p><p>Microsoft is now deeply integrating semantics and ontologies in [...]

Match Score: 130.74

Trump’s ‘Big, Beautiful Bill’ is a middle finger to US solar energy
Trump’s ‘Big, Beautiful Bill’ is a middle finger to US solar energy

<p>The so-called “Big, Beautiful Bill” will, if passed, make sweeping changes to the US’ clean energy market. While some of the worst provisions affecting the industry were stripped out du [...]

Match Score: 78.19

venturebeat
Under the hood of AI agents: A technical guide to the next frontier of gen

<p>Agents are the trendiest topic in AI today — and with good reason. Taking gen AI out of the protected sandbox of the chat interface and allowing it to act directly on the world represents a [...]

Match Score: 77.36

venturebeat
Most RAG systems don’t understand sophisticated documents — they shred

<p>By now, many enterprises have deployed some form of RAG. The promise is seductive: index your PDFs, connect an LLM and instantly democratize your corporate knowledge.</p><p>But fo [...]

Match Score: 74.87

venturebeat
Karpathy shares 'LLM Knowledge Base' architecture that bypasses R

<p>AI vibe coders have yet another reason to thank <a href="https://x.com/karpathy/status/2039805659525644595">Andrej Karpathy</a>, the coiner of the term. </p><p& [...]

Match Score: 73.39

venturebeat
How xMemory cuts token costs and context bloat in AI agents

<p>Standard RAG pipelines break when enterprises try to use them for long-term, multi-session LLM agent deployments. This is a critical limitation as demand for persistent AI assistants grows.&l [...]

Match Score: 70.11

venturebeat
A weekend ‘vibe code’ hack by Andrej Karpathy quietly sketches the miss

<p>This weekend, <a href="https://x.com/karpathy/">Andrej Karpathy</a>, the former director of AI at Tesla and a founding member of OpenAI, decided he wanted to read a book [...]

Match Score: 66.45

venturebeat
This tree search framework hits 98.7% on documents where vector search fail

<p>A new open-source framework called <a href="https://github.com/VectifyAI/PageIndex"><u>PageIndex</u></a> solves one of the old problems of retrieval-augmente [...]

Match Score: 54.81

venturebeat
Red teaming LLMs exposes a harsh truth about the AI security arms race

<p>Unrelenting, persistent attacks on frontier models make them fail, with the patterns of failure varying by model and developer. Red teaming shows that it’s not the sophisticated, complex at [...]

Match Score: 54.05