# What embeddings are and how they help AI understand meaning

An embedding is a way of turning a piece of text into a list of numbers that captures its meaning, so a computer can compare meaning instead of just looking for identical words.

## What is an embedding?

When you read a sentence, you understand its meaning even if you see different words that say the same thing. Computers do not have that intuition. An embedding model solves this by converting text into a vector—a fixed‑length list of numbers. Sentences with similar meanings end up with vectors that are close to each other in this numerical space, while unrelated sentences are farther apart.

For example, the phrases “I forgot my VPN password” and “I cannot log into the company network” share few words, but their vectors will be close because both talk about a login problem.

## Why embeddings matter for everyday AI

Embeddings power many features you use without noticing:

- **Semantic search:** Instead of requiring the exact words you typed, the system finds documents that talk about the same topic.
- **Retrieval‑augmented generation (RAG):** When you ask a question about internal policies, the assistant first looks for the most relevant pieces of text (using embeddings), then feeds those pieces to a language model to produce a grounded answer.
- **Clustering and recommendation:** You can automatically group similar customer feedback, support tickets, or product reviews by meaning, which helps spot trends without manual tagging.

In short, embeddings let the AI work with meaning, not just spelling.

## Example: getting an embedding with Regolo API

Embeddings give a chatbot “memory” of your knowledge base by turning your documents into vectors, saving them in a vector database, and then using those vectors to fetch the right pieces of information every time a user asks a question. At runtime, we do not ask the LLM to “remember everything”, but to answer based on the few, highly relevant chunks that we retrieve via embeddings.

## From documents to vectors

We start from raw content: PDFs, policies, FAQs, tickets, product docs, whatever you want the chatbot to know. We split this content into small chunks (for example 300–800 tokens), because smaller pieces are easier to retrieve precisely. For each chunk, we call the embeddings API (for example with the `gte-Qwen2` model) and we obtain a vector: a list of numbers that represents the meaning of that chunk.

We then store three things together in a vector database: the original text chunk, its metadata (source, title, URL, tags), and its embedding vector. The vector database is what lets us later ask “which stored vectors are closest to this new query vector?”. This is the core of “knowledge retrieval” for the chatbot.

## Indexing documents with the Regolo embeddings API

```
import requests
from typing import List, Dict

API_KEY = "YOUR_REGOLO_API_KEY"
EMBEDDINGS_URL = "https://api.regolo.ai/v1/embeddings"
MODEL = "gte-Qwen2"

def embed_texts(texts: List[str]) -> List[List[float]]:
    resp = requests.post(
        EMBEDDINGS_URL,
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "model": MODEL,
            "input": texts,
        },
        timeout=60,
    )
    resp.raise_for_status()
    data = resp.json()["data"]
    return [item["embedding"] for item in data]

# Example: chunks to index (in real life you will generate them by splitting documents)
chunks = [
    {
        "id": "policy-001",
        "text": "Employees can request remote work up to 3 days per week.",
        "metadata": {"source": "HR policy", "section": "remote_work"}
    },
    {
        "id": "policy-002",
        "text": "Support is available 24/7 via the internal ticketing system.",
        "metadata": {"source": "IT policy", "section": "support_hours"}
    },
]

embeddings = embed_texts([c["text"] for c in chunks])

# Here you would upsert into your vector database of choice
# (e.g. Pinecone, Qdrant, Weaviate, pgvector, etc.)
records_for_vector_db: List[Dict] = []
for chunk, vec in zip(chunks, embeddings):
    records_for_vector_db.append({
        "id": chunk["id"],
        "vector": vec,
        "text": chunk["text"],
        "metadata": chunk["metadata"],
    })

# pseudo-code: vector_db.upsert(records_for_vector_db)Code language: Python (python)
```

In this phase we are building the chatbot’s knowledge store. This is an offline or periodic process (for example every time documents change).

## How the chatbot uses embeddings at runtime

When a user asks a question, we do almost the same thing, but with their query instead of a document chunk:

1. We send the user question to the embeddings API and get the query embedding.
2. We ask the vector database: “give me the top N chunks whose embedding is closest to this query embedding”.
3. We take those chunks (their text and metadata) and build a prompt for the LLM, telling it: “Answer using only the information below, and say it clearly”.
4. The LLM generates the final answer, grounded in the retrieved chunks.

This is how embeddings “provide knowledge” to the chatbot: not by changing the LLM itself, but by selecting which pieces of your data we feed to the model for each question.

## Example: query + retrieval + chat completion

```
import requests

API_KEY = "YOUR_REGOLO_API_KEY"
EMBEDDINGS_URL = "https://api.regolo.ai/v1/embeddings"
CHAT_URL = "https://api.regolo.ai/v1/chat/completions"
EMBEDDING_MODEL = "gte-Qwen2"
CHAT_MODEL = "YOUR_CHAT_MODEL_ID"  # e.g. qwen3.5-9b or similar

def get_embedding(text: str):
    resp = requests.post(
        EMBEDDINGS_URL,
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "model": EMBEDDING_MODEL,
            "input": text,
        },
        timeout=30,
    )
    resp.raise_for_status()
    return resp.json()["data"][0]["embedding"]

def retrieve_from_vector_db(query_vector, top_k=5):
    # This is pseudo-code: replace with your actual vector DB call
    # return vector_db.search(vector=query_vector, top_k=top_k)
    raise NotImplementedError

def answer_with_llm(question: str, context_chunks):
    context_text = "\n\n".join(
        f"Source: {c['metadata'].get('source', 'unknown')}\n{c['text']}"
        for c in context_chunks
    )
    system_prompt = (
        "You are a helpful assistant. Answer the user's question using ONLY "
        "the information in the CONTEXT. If the answer is not in the context, say you don't know.\n\n"
        f"CONTEXT:\n{context_text}"
    )

    resp = requests.post(
        CHAT_URL,
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "model": CHAT_MODEL,
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": question},
            ],
        },
        timeout=60,
    )
    resp.raise_for_status()
    return resp.json()["choices"][0]["message"]["content"]

def chatbot(question: str):
    query_vec = get_embedding(question)
    context_chunks = retrieve_from_vector_db(query_vec, top_k=4)
    return answer_with_llm(question, context_chunks)Code language: Python (python)
```

In a real system, `retrieve_from_vector_db` will call your vector database (for example via HTTP or a client library), perform a similarity search with the query embedding, and return the best matches with their stored text and metadata.

## Why the vector database is essential

Without a vector database we would have to compare the query vector with every document vector manually on each request, which does not scale beyond a tiny dataset. A vector database is optimized to:

- Store millions (or billions) of vectors.
- Use approximate nearest neighbor search to find the closest vectors quickly.
- Attach metadata and filters (for example: “only HR policies in Italian”).

So the pipeline is always:

1. We embed content once (or when it changes) and save it in the vector DB.
2. On each question, we embed the query and search the vector DB.
3. We pass only the most relevant chunks to the LLM, which then answers.

This separation keeps the LLM “stateless” with respect to your knowledge, and moves all your custom knowledge into the retrieval layer driven by embeddings.

---

## FAQ

**FAQ: Using Embeddings to Power a Chatbot’s Knowledge Base with Regolo**

**1. What does it mean to give a chatbot “knowledge” using embeddings?**
Embeddings turn pieces of text (documents, policies, FAQs) into numerical vectors that capture their meaning. By storing those vectors in a vector database and retrieving the most relevant ones for each user question, the chatbot can answer based on your actual data instead of relying only on the LLM’s training.

**2. Why can’t we just put all our documents into the LLM’s prompt?**
LLMs have a limited context window (usually a few thousand tokens). Large knowledge bases would exceed that limit, making it impossible to fit everything in a single prompt. Retrieval lets us send only the few most relevant chunks, keeping the prompt small and the answer focused.

**3. How do we create the embeddings for our documents?**
We split the documents into small chunks (e.g., 300–800 tokens) and call the Regolo embeddings endpoint with a model such as `gte-Qwen2`. The API returns a vector for each chunk, which we then store together with the original text and any metadata (source, tags, etc.) in a vector database.

**4. Which vector database should we use?**
Any vector database that supports similarity search works—common choices include Qdrant, Pinecone, Weaviate, Milvus, or pgvector (PostgreSQL). The key is that it can store millions of vectors and return the nearest neighbors to a query vector quickly.

**5. Do we need to re‑embed the documents every time they change?**
Yes. Whenever a source document is added, updated, or deleted, you should re‑embed the affected chunks and update the vector database accordingly. This keeps the chatbot’s knowledge current.

**6. What happens at runtime when a user asks a question?**

1. We embed the user query with the same embedding model.
2. We ask the vector database for the top‑N chunks whose vectors are closest to the query vector.
3. We take those chunks (their text and metadata) and build a prompt that tells the LLM to answer using only that information.
4. The LLM generates the final answer, grounded in your data.

**7. Can we improve the relevance of the retrieved chunks?**
Absolutely. A common two‑stage approach is:

- **First stage:** Use embeddings to recall a broad set of candidates (high recall).
- **Second stage:** Pass those candidates to a reranker model (e.g., `Qwen3‑Reranker‑4B`) that scores them more precisely for the specific query, improving precision before the final answer is generated.

**8. Do embeddings generate the answer themselves?**
No. Embeddings only produce vectors. The actual answer is generated by a language model (chat/completions model) that receives the retrieved context as part of its prompt.

**9. Is this approach GDPR‑friendly when using Regolo?**
Regolo hosts its models on European infrastructure, and you retain full control over the data you embed and store in your own vector database. As long as you handle personal data according to your internal policies and the vector database is compliant, the pipeline respects GDPR requirements.

**10. What are the main benefits of this architecture?**

- **Scalability:** Knowledge bases can grow to millions of documents without hitting LLM context limits.
- **Freshness:** Updating the vector database instantly reflects new information.
- **Accuracy:** Answers are grounded in your verified sources, reducing hallucinations.
- **Cost‑effectiveness:** The LLM only processes a small amount of relevant text per query, lowering token usage and latency.

---

St**art your free 30-day trial at [regolo.ai](https://regolo.ai/) and deploy LLMs with complete privacy by design.**

👉 [Talk with our Engineers](https://regolo.ai/contacts/) or [Start your 30 days free →](https://regolo.ai/pricing)

---

- [Discord](https://discord.gg/ZzZvuR2y) - Share your thoughts
- [GitHub Repo](https://github.com/regolo-ai/) - Code of blog articles ready to start
- Follow Us on X [@regolo\_ai](https://x.com/regolo_ai)
- Open discussion on our [Subreddit Community](https://www.reddit.com/r/regolo_ai/)

---

*Built with ❤️ by the Regolo team. Questions? [regolo.ai/contact](https://regolo.ai/contact)* or chat with us on [Discord](https://discord.gg/ZzZvuR2y)