Google recently announced that it will add a new tab called AI Mode. The answers/results seen under this new tab, according to Google itself, "is based on our quality and ranking systems, and we also use novel approaches with the model's reasoning capabilities to improve veracity".

It also gives us clues about some of its features, such as "It uses a query distribution technique, which generates multiple related searches simultaneously on subtopics and multiple data sources, and then brings those results together to provide an easy-to-understand answer".
But what does this really mean? To better understand how a system like this, which enables Retrieval-Augmented Generation of answers, might work, we need to talk about RAG (Retrieval-Augmented Generation).
What is a RAG system?
LLMs often lack sufficient information on certain topics, making them more prone to hallucinations or returning inaccurate answers. For example, if we ask about current news or your company's new products, it won't have this information and the answer will likely be inaccurate or even invented.
A RAG system allows LLMs to access up-to-date and/or specific data, for example, basing its answer on specific documents, thus minimizing the risk of hallucinations or inaccurate answers.
Components of RAG Systems
A RAG system has two components or subsystems: an information retrieval system and an LLM.
The retrieval system
This system is responsible for extracting relevant pieces of information from a knowledge base in response to a given query.
This knowledge base could be a list of specific documents, for example, your company's PDF documents, a database, or, for example, a knowledge graph.
LLM
The role of the LLM is to receive the query and the information returned by the retrieval system as input to generate an output (answer) based on the results.
RAG systems have some key steps:
-
Indexing the corpus (or knowledge base). In this process, data is converted to embeddings that are stored in a vector database.
-
When a user makes a query, the retrieval system searches for relevant documents based on the query.
-
The user's query along with the returned results are used to create a prompt.
-
The prompt is used to instruct the LLM to generate a final answer.
Different retrieval systems
Within each component of a RAG system, there are a variety of possibilities that we can use depending on the use case. In this article, we will focus mainly on the retrieval system.
Simple Similarity-Based Retrieval (Bi-Encoders)
Retrieval systems based on simple similarity (using Bi-Encoders) are a very efficient method for obtaining results from a knowledge base with many elements.
Bi-Encoder systems obtain fixed-size embeddings regardless of the length of the text they process. The process of obtaining embeddings does not require many resources, and therefore we can search over millions of documents using cosine distance (or other metrics for vector calculations) without excessive latency.
These Bi-Encoder systems are neural models that transform the text of the question and the text of the answer into embeddings separately.

Texts with similar meanings will be close in the vector space and capture general semantic relationships, but they often obtain results that may not match the expected answer. It is often said that similarity is not the same as relevance. As a typical example, the question "Why is the sky blue?" will have a different meaning from the answer, since a possible answer would be "The scattering of sunlight causes the blue color." The meaning of this answer is very different (or very dissimilar) to the text of the query, something that SEOs should often remember!
To perform a similarity-based search, they compare the embedding of the query and the embedding of each text of all possible answers. Vector calculations such as cosine distance or Euclidean distance are used for this calculation.
This method is suitable for tasks such as semantic similarity between two texts, simple semantic search, and clustering. It is also used in the first step of advanced information retrieval (or two-stage retrieval system), where a CrossEncoder model (also known as a re-ranker) is used to re-order a list of results obtained by the Bi-Encoders.
Common Bi-Encoder models are:
Other relevant models:
The advantage of Bi-Encoders over Cross-Encoders is scalability and efficiency if you have a large number of data. Since the embeddings for the query and the embeddings for the results are generated independently, they can be pre-processed and stored in a vector database so that searches can be performed later with low latency.
The fact that the embeddings of the query and the embeddings of the results are generated independently has its disadvantages, for example, the loss of precision when offering results for complex or ambiguous queries. This means that they may fall short for tasks that require more advanced context analysis.
Neural Matching-Based Retrieval (Cross-Encoders)
These systems, unlike Bi-Encoders, calculate the similarity given a pair of texts, that is, they process both the query and the texts of the corpus at the same time.

This ability to process the query and the document at the same time allows them to capture relationships and semantic details much better and makes them better understand search intentions in ambiguous or complex queries.
The downside of these models is that they are slower and require more resources because they have to do the calculations for each pair of texts (query-document), which prevents pre-processing and indexing, since we would need the search query in advance, but we won't have it until the user performs the search...
Since we cannot save and index the calculation for each query-document pair, the computational cost to perform this calculation will increase with the size of the corpus. That is, if you have to search through a lot of records that make up the corpus in your database and then obtain the relevance for each query-document pair, it will take a long time or require a lot of processing power, so much so that many times it will not even be possible or it will not be profitable, in terms of cost/benefit, compared to other methods.
For this reason, Cross-Encoder models are normally used to reorder (Re-Ranking) a limited number of results obtained from the retrieval using a Bi-Encoder model.
There are a variety of pre-trained Cross-Encoder models for different tasks; generally, these CrossEncoder models are trained with the MS MARCO dataset:
We can say that, in general, Cross-Encoders are useful for semantic similarity search or re-ranking a small number of documents.
Bi-Encoders vs. Cross-Encoders: Which is suitable for a RAG system?
As we have seen, Bi-Encoders are faster than Cross-Encoders, which makes them more scalable and very useful in the first stage of retrieval, especially when the number of documents/texts to search is very high.
Cross-Encoders have higher accuracy but require much more computation (and requirements), making them much slower; this is the main reason why Cross-Encoders are not recommended in the initial stage of retrieval.
Comparative table between Bi-Encoders and Cross-Encoders:
Feature |
Bi-Encoders |
Cross-Encoders |
Query and Document Processing |
Processed independently with two separate encoders. |
Processed simultaneously with a single encoder. |
Speed |
Fast due to independent encoding and ANN search. |
Slow due to pairwise processing and lack of pre-indexing. |
Accuracy |
Generally lower than Cross-Encoders. |
Generally higher than Bi-Encoders. |
Scalability |
Highly scalable due to embedding pre-computation. |
Less scalable due to pairwise processing. |
Embedding Pre-computation |
Possible for documents. |
Not possible. |
Capture of Detailed Interactions |
Limited, as texts are processed separately. |
High, as it considers the interaction between texts. |
Common Use Cases in RAG |
Initial retrieval, large-scale semantic search. |
Re-ranking of retrieved results, high-precision tasks. |
Example Models |
paraphrase-multilingual-MiniLM-L12-v2, distiluse-base-multilingual-cased-v2. |
cross-encoder/ms-marco-MiniLM-L-6-v2, cross-encoder/stsb-roberta-base. |
Ideally, you should use a hybrid method, with a Bi-Encoder in the first stage that returns a limited number of candidates and use a Cross-Encoder in the second stage to reorder these candidates.
Symmetric vs. Asymmetric Search
Something that we must take into account is whether in our use case a symmetric or asymmetric search should be performed, since this will define which model we should use to obtain the embeddings and will also define what the appropriate retrieval strategy will be.
Symmetric search
In symmetric search, the main characteristic is that the length of the query and the result are similar. For example, the query "How to learn SEO online?" could return results like "How to learn SEO on the internet?".
Symmetric search is useful for identifying duplicate content of titles or very similar paragraphs. A "trick" to know if our search is symmetric is to think if changing the order of the query and the answer would alter the result and/or the semantic relationship between them.
Asymmetric search
In asymmetric search, the length of the query is usually much shorter than the length of the answer (for example, a paragraph).
This type of search is the one we should use if we want a question-answer system or chats, since given a very specific query, we want to find an answer that is normally more detailed.
Two-stage search: Retrieval and Re-ranking
The most advanced semantic search systems today use a two-stage system:
Retrieval stage
The objective in this stage is to return a limited number of results (either documents or text fragments) that are potentially relevant to the query.
In this retrieval stage, various techniques are used to expand the scope or coverage. For example, keyword matching search (algorithms like BM25) is combined with semantic search and augmented search to obtain a set of results trying not to leave out potentially relevant documents.
This is also where a technique called query expansion or distribution comes in. This technique can use various methods to, given an initial query, obtain other equivalent queries in such a way that they cover different conceptually related terms or contexts, and it sounds very similar to the technique that Google says it uses for AI-Mode!
But surely Google uses more refined methods, for example I wrote not long ago How Google selects the snippet that answers a question, where I explain the method that could be used based on a recent Google patent
I leave a fragment of the article that gives us an approximate idea:
Embeddings and similarity of the snippet to the user's query
Once Google retrieves an initial set of relevant documents, for example, the "Top 10,000", the
next step is to find the exact snippet of text that best answers your query. To do this, Google appears to perform the
following steps:
- Generation of Query Embeddings: First, Google converts your query into a vector
of embeddings.
- Similarity Comparison: Then, it compares this vector with the embeddings of snippets of
each of the retrieved documents to find the one that most closely resembles it.
- Consideration of Header Hierarchy: Google also
takes into account the similarity/proximity of each candidate snippet to its
hierarchically superior headers (H1, H2, H3, etc.). Possibly "adds" the similarity between the candidate snippet and
the concatenation of the Hx that are hierarchically in its immediate upper level until reaching the title.
Go to the article
How does Google select the snippet that answers a question?
But even using these techniques to improve queries could generate results that are not as relevant as we would like. For example, for a question and answer system, the result may be similar to the question but not offer an answer to the question; it is in these types of cases that we must remember that similarity is not the same as relevance.
Re-ranking stage
To improve the results obtained in the retrieval stage, a second reordering of the returned results (Re-Ranking) can be applied.
The candidates obtained in the retrieval stage are passed to a Cross-Encoder model along with the query; this generates a score between 0 and 1 that indicates the relevance of the document to the query.
The main advantage of using Cross-Encoders at this stage is their ability to pay attention to both the query and the document together. Cross-Encoder models capture contextual relationships and semantic details, something that Bi-Encoders might not do.
Other useful techniques for RAG
In addition to what we have seen, which are, let's say, the basic techniques of current RAGs, there are other techniques that improve or complement these.
For example, how you divide the text of a document into paragraphs or sentences is also vital for the operation of the system.
Another technique, or separate case, would be the integration of the RAG system with a knowledge graph that allows the system to understand the connections and be able to answer questions that require several hops or chains of thought.
Using LLM to generate the answer
As a last step in the RAG system, we will instruct the LLM that we use to generate the answer based solely on the results returned by the retrieval system. Depending on our objective and the LLM, we will generate an appropriate prompt to respond as we want. We could even tell it to reorder the results!
In future articles, I will try to explain what nearest neighbor search is, approximate nearest neighbor indexes, and how these indexes are also crucial to taking a RAG system into production.