Matryoshka representations : A guide to faster semantic search

9 min readFeb 12, 2024

Use MRL-powered embeddings and get faster, more accurate search!

A few days back, OpenAI released their new embedding models called text-embedding-3-small and text-embedding-3-large. These models boasted of lower prices, better performance and impressive scores on the MTEB benchmark leaderboard on huggingface. While all of this was somewhat “ordinary” in OpenAI’s book of AI achievements, there was another small announcement which many people missed. These embedding models were trained using Matryoshka representation learning.

As per the OpenAI announcement, Matryoshka representation learning(MRL) will enable developers to shorten the embeddings without the embedding losing its concept-representing properties. Due to the shortening, naturally, there will be a drop in the accuracy.

Note : I have used the terms embeddings and vectors inter-changeably.

Why is this important?

Adjusting the length of embeddings dynamically opens up a new world of possibilities. In this guide, we are going to be implementing an example which uses these new MRL-powered embeddings(text-embedding-3-small specifically) to perform semantic search on a movies dataset.

How will we use these embeddings?

In simple semantic search, the query vector is compared against the index to find matches, and the top n scores are selected. To utilize MRL-powered embeddings, we will first generate embeddings for the data and store them in an index. Additionally, we will slice these embeddings, normalize them, and store them in a separate index called short_embeddings. As an initial step, we will compare the short_embeddings with a short version of our query vector, obtaining 2n results from this comparison. Subsequently, we will retrieve the original, longer embeddings of these 2n results and compare them with the full-length query embedding, selecting the top n results from this comparison. In summary, the process involves using the short embeddings to identify a promising pool of results, followed by the application of high-dimensional embeddings to determine the most relevant results.

To really see the impact of MRL, we will compare the performance of MRL-enabled search with non MRL search using the same set of embeddings. We are going to be using Python and libraries like numpy and pandas to help us out here. No third party tools to tamper our metrics!

We will first use a OpenAI REST API to convert all of our movies data into embeddings. Then, we will setup 2 functions which will be used to present us the non-MRL and MRL semantic search performance.

All the code is available here : https://github.com/ujjwalm29/movie-search/tree/master/level_4_1_openai_embeddings_MRL

Let’s get started.

Step 1 : Create embeddings from the raw data

Full Code : https://github.com/ujjwalm29/movie-search/blob/master/level_4_1_openai_embeddings_MRL/embeddings_util.py

The idea here is to create embeddings from the title, overview and genres of popular english movies and store them permanently in a .pkl file.

Creating embeddings using the OpenAI API has advantages and disadvantages. The advantage is, you don’t need to have a powerful computer or download any models to create embeddings of your data. The disadvantage is, you need to pay to create embeddings and your application will need a connection to the internet to work. To follow along in this tutorial, you need an OpenAI API key. You can generate one using this guide.

The first thing we want to do is install the OpenAI python library.

pip install openai

Then, we initialize our openai client.

from openai import OpenAI

client = OpenAI(api_key=os.getenv("API_KEY"))

Create function to make REST API call to get embeddings.

def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding

First, we will create a string by concatenating the title, overview and categories(genres) of the movie. This string will be used to make the embedding.

def get_embedding_for_row(row):
    """
    Generate embedding for a single row.
    """
    categories = ""
    col_val = json.loads(row['genres'].replace("'", '"'))
    for category in col_val:
        categories += f" {category['name']}"
    concatenated_text = str(row['title']) + " " + str(row['overview']) + " " + categories
    return get_embedding(concatenated_text)

To execute this process for 40K+ rows, we will use multithreading. This will significantly fasten up the process. Once the embeddings are created, the embeddings will be appended to the original dataset and stored into a file. If the file exists, the function will just get the dataframe from the file and won’t attempt to create the embeddings from scratch.

def create_embeddings(df, file_name='df_with_openai_embeddings.pkl'):
    start_time = time.time()

    if os.path.isfile(file_name):
        df_with_embeddings = pd.read_pickle(file_name)
    else:
        # Use ThreadPoolExecutor to parallelize embedding generation
        with ThreadPoolExecutor(max_workers=10) as executor:
            # Submit all embedding generation tasks
            future_to_row = {executor.submit(get_embedding_for_row, row): index for index, row in df.iterrows()}

            embeddings = [None] * len(df)  # Initialize a list to hold embeddings
            for future in as_completed(future_to_row):
                index = future_to_row[future]
                print(index)
                try:
                    embedding = future.result()  # Obtain result
                    if embedding is not None:
                        embeddings[index] = embedding
                        # Update token counts here if needed
                except Exception as exc:
                    print(f"Row {index} generated an exception: {exc}")
                    embeddings[index] = None  # Handle failed requests if necessary

        df['embeddings'] = embeddings
        df.to_pickle(file_name)
        df_with_embeddings = df

    end_time = time.time()
    print(f"The operation took {end_time - start_time} seconds.")

    return df_with_embeddings

We indexed around 45K+ movies. This should be a decent sample size for our test. All of this was done for less than 8 cents. Pretty cool.

Step 2 : Search and Rank

Let’s get into the juicy part.

First, we need to get our data from the file and filter out the movies for which embeddings don’t exist.

def get_dataframe_from_file(file_name: str):
    if os.path.isfile(file_name):
        # Load DataFrame from the file
        df_with_embeddings = pd.read_pickle(file_name)
    else:
        print("File does not exist")
        return

    # Filter rows where 'embeddings' is not None
    final_df = df_with_embeddings[df_with_embeddings['embeddings'].notnull()]

    return final_df

Next, we will implement semantic search using full length embeddings. The function takes in a search query and a dataframe as input. The idea is to create the embedding representation of the search query and compare it with ALL of the embeddings we created for movies. We will be using cosine_similarity as our scoring mechanism. There is some other converting to numpy array and reshaping going on as well.

def search_and_rank_without_MRL(query, df):
    

    print(f"Search query : {query}")
    # Generate the embedding for the query
    query_embedding = get_embedding(query)
    query_embedding_arr = np.array(query_embedding)

    # Convert the 'embeddings' column back to numpy array
    df_embeddings = np.array(df['embeddings'].tolist())
    
    start_time = time.time()
    # Calculate similarities (cosine similarity)
    similarities = cosine_similarity(query_embedding_arr.reshape(1, -1), df_embeddings).flatten()

    # Get the indices of the top 10 most similar entries
    top_indices = np.argsort(similarities)[::-1][:10]

    # Get the titles of the top 10 results
    top_titles = df.iloc[top_indices]['title']

    # End time
    end_time = time.time()

    # Calculate duration
    duration = end_time - start_time

    print(f"The operation took {duration} seconds.")

    return top_titles.tolist()

Before we move on to search and rank using short embeddings, we need some utility functions to create short embeddings using the longer embeddings. We take the long embeddings, chop them off and normalize them.

def normalize_l2(x):
    x = np.array(x)
    if x.ndim == 1:
        norm = np.linalg.norm(x)
        if norm == 0:
            return x
        return x / norm
    else:
        norm = np.linalg.norm(x, 2, axis=1, keepdims=True)
        return np.where(norm == 0, x, x / norm)


def get_short_embeddings(df_w_embeddings):
    df_w_embeddings['normalized_embeddings'] = df_w_embeddings['embeddings'].apply(lambda x: normalize_l2(x[:256]))
    return df_w_embeddings

Normalize function obtained from OpenAI cookbooks. Shoutout to them for providing examples of shortening of embeddings instead of asking users to send an API request and charging for it.

Next, let’s implement the search and rank function using MRL-powered embeddings. In this function, search will happen in 2 phases. In the first phase, we will compare a shortened query embedding with the short_embeddings column and get the top 20 results. Next, we will compare the longer query embedding with the original embeddings of the 20 results obtained in the previous step. This will give us the top 10 most relevant results.

def search_and_rank_with_MRL(query, df):

    print(f"Search query : {query}")
    # Generate the embedding for the query
    query_embedding = get_embedding(query)
    query_embedding_arr_short = np.array(normalize_l2(query_embedding[:256]))

    # Phase 1
    # Convert the 'embeddings' column back to numpy array
    df_embeddings = np.array(df['normalized_embeddings'].tolist())

    start_time = time.time()
    # Calculate similarities (cosine similarity)
    similarities = cosine_similarity(query_embedding_arr_short.reshape(1, -1), df_embeddings).flatten()

    # Get the indices of the top 20 most similar entries
    top_indices_unranked = np.argsort(similarities)[::-1][:20]

    # Phase 2
    top_unranked_embeddings = np.array(df.iloc[top_indices_unranked]['embeddings'].tolist())
    refined_similarities = cosine_similarity(np.array(query_embedding).reshape(1, -1), top_unranked_embeddings).flatten()

    # Get the indices of the top 10 most similar entries, based on refined similarities
    top_indices = top_indices_unranked[np.argsort(refined_similarities)[::-1][:10]]

    # Get the titles of the top 10 results
    top_titles = df.iloc[top_indices]['title']

    # End time
    end_time = time.time()

    # Calculate duration
    duration = end_time - start_time

    print(f"The operation took {duration} seconds.")

    return top_titles.tolist()

We are all set! Let’s see the results!

Step 3 : Execute

Results for simple semantic search.

Search query : the godfather
The operation took 0.3428826332092285 seconds.
Results : ['The Godfather Trilogy: 1972-1990', 'The Godfather: Part III', 'The Godfather: Part II', 'The Godfather', 'The Last Godfather', 'Counselor at Crime', 'The New Godfathers', "Jane Austen's Mafia!", 'Pay or Die!', 'Big Deal on Madonna Street']

Search query : godfather
The operation took 0.23723793029785156 seconds.
Results : ['Counselor at Crime', 'The Last Godfather', 'The Godfather Trilogy: 1972-1990', 'The Godfather: Part III', 'The Godfather', 'House of Strangers', 'The Godfather: Part II', 'Street People', 'Gotti', "Jane Austen's Mafia!"]

Search query : God father
The operation took 0.2135179042816162 seconds.
Results : ['The Father and the Foreigner', 'God Willing', 'Just a Father', 'Oh, God!', "Father's Day", 'God Tussi Great Ho', 'Godfather', 'Walter', 'I Am Your Father', 'Varalaru']

Search query : Man of Steel
The operation took 0.22432422637939453 seconds.
Results : ['Man of Steel', 'The Mad Scientist', 'Superman II', 'Superman Returns', "It's A Bird, It's A Plane, It's Superman!", 'Batman v Superman: Dawn of Justice', 'Superman vs. The Elite', 'Atom Man vs Superman', 'Superman III', 'Superman IV: The Quest for Peace']

Search query : flying super hero
The operation took 0.24869990348815918 seconds.
Results : ['The Flying Man', 'A Flying Jatt', 'Phantom Boy', 'Sky High', 'Superhero Movie', 'Superheroes', 'American Hero', 'Hero at Large', 'Up, Up, and Away', 'Crumbs']

Results for MRL semantic search.

Search query : the godfather
The operation took 0.04521799087524414 seconds.
Results : ['The Godfather Trilogy: 1972-1990', 'The Godfather: Part III', 'The Godfather: Part II', 'The Godfather', 'The Last Godfather', 'Counselor at Crime', 'The New Godfathers', "Jane Austen's Mafia!", 'Pay or Die!', 'Big Deal on Madonna Street']

Search query : godfather
The operation took 0.04109597206115723 seconds.
Results : ['Counselor at Crime', 'The Last Godfather', 'The Godfather Trilogy: 1972-1990', 'The Godfather: Part III', 'The Godfather', 'The Godfather: Part II', 'Street People', 'Gotti', 'The Mobfathers', 'Padre Padrone']

Search query : God father
The operation took 0.03357100486755371 seconds.
Results : ['The Father and the Foreigner', 'God Willing', 'Just a Father', 'Oh, God!', 'Godfather', 'I Am Your Father', 'Varalaru', 'The Holy Man', 'For My Father', 'The Little Devil']

Search query : Man of Steel
The operation took 0.03288698196411133 seconds.
Results : ['Man of Steel', 'The Mad Scientist', 'Superman Returns', "It's A Bird, It's A Plane, It's Superman!", 'Batman v Superman: Dawn of Justice', 'Superman vs. The Elite', 'Superman IV: The Quest for Peace', 'Superman', 'Superman and the Mole-Men', 'Kryptonita']

Search query : flying super hero
The operation took 0.03297901153564453 seconds.
Results : ['The Flying Man', 'A Flying Jatt', 'Phantom Boy', 'Sky High', 'Superheroes', 'American Hero', 'Up, Up, and Away', 'Crumbs', 'Alter Egos', 'LEGO DC Comics Super Heroes: Justice League - Gotham City Breakout']

Woah!

Analysis

First, let’s talk about relevance. The top results in both the searches are pretty much the same. The results look highly relevant to the query we searched for. So, both of our searches are relevant for sure!

Secondly, let’s talk about performance. The MRL-powered search takes almost 10x less time than non-MRL searches. This is the real power of Matrayoshka Representations! If you think about it, the MRL search_and_rank function has several extra steps and 2 cosine similarity computations. But it looks like the use of smaller sized vector to filter results really pays off!

Conclusion

As far as I can tell, these embedding models by OpenAI are some of the first which support shortening of embeddings. In my opinion, this is going to change a lot of things regarding how semantic search is implemented in applications today. No need to support multiple models for multiple levels of filtering. Would be exciting to see some open source MRL models.

Hope you enjoyed the guide! Don’t forget to give a like below! For corrections, feedback please post a comment. Also, I am looking for a full time job! If you or a connection is hiring a relevance engineer/backend engineer shoot me an email at ujjwalm29[at]gmail.com or ping me on LinkedIn.

Edits:

Previously, the movies were being filtered before embedding were created. Now, all movies are part of the search.
The time taken to search included the network call to get the search_query embedding. The location of the start_time has been corrected to exlude the network call.