Building RAG with Elasticsearch as a Vector Store
elasticsearch

Building RAG with Elasticsearch as a Vector Store

Implement a unified RAG architecture using Elasticsearch, LangChain, and Ollama for production-grade retrieval.

Termtrix
Termtrix
Jan 29, 2026
4min read

Building RAG with Elasticsearch as a Vector Store

Retrieval-Augmented Generation (RAG) is the industry-standard pattern for building applications that reason over private data, logs, and internal knowledge bases. While many tutorials suggest using separate databases for structured data and vector embeddings, using Elasticsearch as a unified storage layer offers significant architectural advantages.

In this guide, we will implement a production-relevant RAG setup using Elasticsearch as both the primary data store and the vector store, orchestrated by LangChain and powered by local Ollama embeddings.

Why Use Elasticsearch for RAG?

Most RAG implementations separate concerns by using a traditional database for metadata and a dedicated vector database for embeddings. However, Elasticsearch is capable of handling both, providing several benefits for enterprise environments:

  • Unified Storage: Store raw documents, metadata, and vector embeddings in a single location.
  • Advanced Filtering: Apply complex structured queries and permissions alongside semantic search.
  • Scalability: Benefit from production-grade reliability and horizontal scaling.
  • Hybrid Search: Combine traditional keyword matching with modern vector-based retrieval.

This approach is particularly effective for log analysis assistants, internal knowledge base search, and documentation Q&A systems.

High-Level Architecture

The workflow for this implementation follows a linear path from raw data to retrieved context:

  1. Data Source: A JSON file containing text, document IDs, and associated metadata.
  2. Embedding Layer: Ollama converts text into dense vectors using a local embedding model.
  3. Elasticsearch Vector Store: Acts as the repository for embeddings and metadata, performing similarity searches.
  4. Query Layer: Executes semantic search with optional metadata filters to retrieve relevant document chunks.

Technical Stack

To follow this implementation, you will need:

  • Python: The core logic.
  • LangChain: For orchestration.
  • Elasticsearch: For vector and document storage.
  • Ollama: To run local embedding models.

The embedding model used in this setup is qwen3-embedding:4b, allowing for a fully local pipeline without reliance on external cloud APIs.

from langchain_community.embeddings import OllamaEmbeddings

embeddings = OllamaEmbeddings(
    base_url="http://localhost:11434",
    model="qwen3-embedding:4b"
)

Core RAG Class Design

A clean implementation encapsulates the logic within a single class. The TermmtrixRag class handles initialization, document loading, and search operations.

from typing import List
from langchain_elasticsearch import ElasticsearchStore

class TermmtrixRag():
    def __init__(self):
        self.vector_store = self.load_es_store()
        self.docs: List = []
        self.doc_ids: List = []
    
    def load_es_store(self):
        return ElasticsearchStore(
            index_name="termtrix",
            embedding=embeddings,
            es_url="http://localhost:9200/",
            es_password="your_password",
            es_user="elastic"
        )

Document Ingestion and Embedding

Documents are typically ingested from structured formats like JSON. Each entry must be converted into a LangChain Document object. This ensures that the metadata remains attached to the text chunk, which is vital for filtering during retrieval.

from langchain_core.documents import Document 

# Example ingestion logic
def ingest_data(self, data_list):
    for item in data_list:
        self.docs.append(
            Document(
                page_content=item['text'],
                metadata=item['metadata']
            )
        )
        self.doc_ids.append(item['id'])
    
    self.vector_store.add_documents(
        documents=self.docs, 
        ids=self.doc_ids
    )

Advantages of this Approach:

  • One-time Embedding: Vectors are generated once and stored, saving compute during retrieval.
  • Metadata Persistence: Keywords, authors, and timestamps are stored alongside the vector for granular control.

Semantic Search and Filtering

Once the data is indexed, the system performs semantic search to find relevant context. Unlike keyword search, vector similarity search understands the intent behind a query.

Filtered Similarity Search

In production environments, raw similarity search is often insufficient. You frequently need to restrict searches based on document type, user permissions, or categories.

def similarity_search(self, query):
    results = self.vector_store.similarity_search_with_score(
        query=query,
        k=1,
        filter=[{"term": {"metadata.source": "tech_knowledge_base"}}],
    )
    for doc, score in results:
        print(f"* [Score={score:3f}] {doc.page_content} [{doc.metadata}]")

This filtered search is critical for multi-tenant systems or domain-specific retrieval, ensuring the RAG system only accesses the appropriate subset of data.

Practical Use Cases

This Elasticsearch-based RAG pattern is highly effective for:

  • Log Analysis: Filtering logs by service name or timestamp before performing a semantic search for error patterns.
  • Internal Tooling: Building Q&A bots that search through specific departments (e.g., HR vs. Engineering) using metadata filters.
  • Documentation Search: Providing accurate answers by retrieving the most recent version of a technical manual.

Limitations and Considerations

While Elasticsearch is powerful, there are a few considerations:

  • Memory Management: Vector search can be memory-intensive; ensure your Elasticsearch nodes are appropriately sized for the number of dimensions in your embedding model.
  • Re-indexing: If you change your embedding model (e.g., switching from qwen3 to another model), you must re-index all documents as the vector representations will change.

Conclusion

By using Elasticsearch as both a data layer and a vector store, you simplify your RAG architecture without sacrificing features. This setup provides enterprise-ready scalability, sophisticated filtering, and the ability to run entirely local embedding models via Ollama.

The next logical step is to pass these retrieved documents into a Large Language Model (LLM) to generate natural, conversational responses, transforming a search engine into a fully functional AI assistant.

Key Takeaways

  • Elasticsearch eliminates the need for a separate vector-only database.
  • Metadata filtering is essential for production-grade RAG systems.
  • Local embedding models like those provided by Ollama offer privacy and cost benefits.
  • LangChain provides the necessary abstractions to link these components together seamlessly.
#elasticsearch#rag#vector-search#langchain#ollama#generative-ai

Read Next

Building RAG with Elasticsearch as a Vector Store
System Design

Building RAG with Elasticsearch as a Vector Store

Build a production-ready RAG system using Elasticsearch as a unified vector store. Learn how to integrate LangChain and Ollama for efficient document retrieval.

Using an MCP Server with LangGraph: A Practical Guide to MCP Adapters
AI Agent

Using an MCP Server with LangGraph: A Practical Guide to MCP Adapters

Learn how to integrate an MCP server with LangGraph using MCP adapters to build deterministic, schema-validated AI agents. This practical guide explains why prompt-only tool calling fails and how MCP enables reliable, production-grade agent workflows.

🚀 Turbopack in Next.js: Does turbopackFileSystemCacheForDev Make Your App Lightning Fast?
Next js

🚀 Turbopack in Next.js: Does turbopackFileSystemCacheForDev Make Your App Lightning Fast?

How to Create a Perfect AWS Security Group (Production-Ready & Secure)
Cloud Security

How to Create a Perfect AWS Security Group (Production-Ready & Secure)

Learn how to design a production-ready AWS Security Group using least-privilege principles for EC2, RDS, and Redis—without breaking your app. AWS Security Group best practices Secure EC2 Security Group RDS Security Group configuration Redis Security Group AWS AWS least privilege networking Cloud security for backend apps

Load Testing FastAPI: Can Your API Handle 1 Million Requests?
Backend Engineering

Load Testing FastAPI: Can Your API Handle 1 Million Requests?

Learn how to load test a FastAPI application using Apache JMeter to simulate one million requests, analyze throughput and latency, and uncover real production bottlenecks before traffic hits.

How to Use PostgreSQL for LangGraph Memory and Checkpointing with FastAPI
AI Engineering

How to Use PostgreSQL for LangGraph Memory and Checkpointing with FastAPI

A deep dive into real-world issues when integrating LangGraph with FastAPI and Postgres. Learn why async context managers break checkpointing, how to fix _AsyncGeneratorContextManager errors, create missing tables, and deploy LangGraph agents correctly in production.

Building a Simple AI Agent Using FastAPI, LangGraph, and MCP
AI Agents

Building a Simple AI Agent Using FastAPI, LangGraph, and MCP

Build a production-ready AI agent using FastAPI, LangGraph, and MCP. Learn how to design tool-calling agents with memory, Redis persistence, and clean workflow orchestration.

Using Celery With FastAPI: Solving Async Event Loop Errors Cleanly--
Backend Engineering

Using Celery With FastAPI: Solving Async Event Loop Errors Cleanly--

Learn why async/await fails inside Celery tasks when using FastAPI, and discover a clean, production-safe pattern to avoid event loop errors using internal FastAPI endpoints.Python FastAPI Celery AsyncProgramming BackendEngineering DistributedSystems Microservices

Sharding PostgreSQL for Django Applications with Citus
Databases

Sharding PostgreSQL for Django Applications with Citus

Scale your Django application horizontally by sharding PostgreSQL with the Citus extension. Improve performance with distributed storage and parallel queries.