Building RAG with Elasticsearch as a Vector Store
Retrieval-Augmented Generation (RAG) is the industry-standard pattern for building applications that reason over private data, logs, and internal knowledge bases. While many tutorials suggest using separate databases for structured data and vector embeddings, using Elasticsearch as a unified storage layer offers significant architectural advantages.
In this guide, we will implement a production-relevant RAG setup using Elasticsearch as both the primary data store and the vector store, orchestrated by LangChain and powered by local Ollama embeddings.
Why Use Elasticsearch for RAG?
Most RAG implementations separate concerns by using a traditional database for metadata and a dedicated vector database for embeddings. However, Elasticsearch is capable of handling both, providing several benefits for enterprise environments:
- Unified Storage: Store raw documents, metadata, and vector embeddings in a single location.
- Advanced Filtering: Apply complex structured queries and permissions alongside semantic search.
- Scalability: Benefit from production-grade reliability and horizontal scaling.
- Hybrid Search: Combine traditional keyword matching with modern vector-based retrieval.
This approach is particularly effective for log analysis assistants, internal knowledge base search, and documentation Q&A systems.
High-Level Architecture
The workflow for this implementation follows a linear path from raw data to retrieved context:
- Data Source: A JSON file containing text, document IDs, and associated metadata.
- Embedding Layer: Ollama converts text into dense vectors using a local embedding model.
- Elasticsearch Vector Store: Acts as the repository for embeddings and metadata, performing similarity searches.
- Query Layer: Executes semantic search with optional metadata filters to retrieve relevant document chunks.
Technical Stack
To follow this implementation, you will need:
- Python: The core logic.
- LangChain: For orchestration.
- Elasticsearch: For vector and document storage.
- Ollama: To run local embedding models.
The embedding model used in this setup is qwen3-embedding:4b, allowing for a fully local pipeline without reliance on external cloud APIs.
from langchain_community.embeddings import OllamaEmbeddings
embeddings = OllamaEmbeddings(
base_url="http://localhost:11434",
model="qwen3-embedding:4b"
)
Core RAG Class Design
A clean implementation encapsulates the logic within a single class. The TermmtrixRag class handles initialization, document loading, and search operations.
from typing import List
from langchain_elasticsearch import ElasticsearchStore
class TermmtrixRag():
def __init__(self):
self.vector_store = self.load_es_store()
self.docs: List = []
self.doc_ids: List = []
def load_es_store(self):
return ElasticsearchStore(
index_name="termtrix",
embedding=embeddings,
es_url="http://localhost:9200/",
es_password="your_password",
es_user="elastic"
)
Document Ingestion and Embedding
Documents are typically ingested from structured formats like JSON. Each entry must be converted into a LangChain Document object. This ensures that the metadata remains attached to the text chunk, which is vital for filtering during retrieval.
from langchain_core.documents import Document
# Example ingestion logic
def ingest_data(self, data_list):
for item in data_list:
self.docs.append(
Document(
page_content=item['text'],
metadata=item['metadata']
)
)
self.doc_ids.append(item['id'])
self.vector_store.add_documents(
documents=self.docs,
ids=self.doc_ids
)
Advantages of this Approach:
- One-time Embedding: Vectors are generated once and stored, saving compute during retrieval.
- Metadata Persistence: Keywords, authors, and timestamps are stored alongside the vector for granular control.
Semantic Search and Filtering
Once the data is indexed, the system performs semantic search to find relevant context. Unlike keyword search, vector similarity search understands the intent behind a query.
Filtered Similarity Search
In production environments, raw similarity search is often insufficient. You frequently need to restrict searches based on document type, user permissions, or categories.
def similarity_search(self, query):
results = self.vector_store.similarity_search_with_score(
query=query,
k=1,
filter=[{"term": {"metadata.source": "tech_knowledge_base"}}],
)
for doc, score in results:
print(f"* [Score={score:3f}] {doc.page_content} [{doc.metadata}]")
This filtered search is critical for multi-tenant systems or domain-specific retrieval, ensuring the RAG system only accesses the appropriate subset of data.
Practical Use Cases
This Elasticsearch-based RAG pattern is highly effective for:
- Log Analysis: Filtering logs by service name or timestamp before performing a semantic search for error patterns.
- Internal Tooling: Building Q&A bots that search through specific departments (e.g., HR vs. Engineering) using metadata filters.
- Documentation Search: Providing accurate answers by retrieving the most recent version of a technical manual.
Limitations and Considerations
While Elasticsearch is powerful, there are a few considerations:
- Memory Management: Vector search can be memory-intensive; ensure your Elasticsearch nodes are appropriately sized for the number of dimensions in your embedding model.
- Re-indexing: If you change your embedding model (e.g., switching from
qwen3to another model), you must re-index all documents as the vector representations will change.
Conclusion
By using Elasticsearch as both a data layer and a vector store, you simplify your RAG architecture without sacrificing features. This setup provides enterprise-ready scalability, sophisticated filtering, and the ability to run entirely local embedding models via Ollama.
The next logical step is to pass these retrieved documents into a Large Language Model (LLM) to generate natural, conversational responses, transforming a search engine into a fully functional AI assistant.
Key Takeaways
- Elasticsearch eliminates the need for a separate vector-only database.
- Metadata filtering is essential for production-grade RAG systems.
- Local embedding models like those provided by Ollama offer privacy and cost benefits.
- LangChain provides the necessary abstractions to link these components together seamlessly.







