How we moved off of OpenAI for semantic search in only 2 weeks

SOVRA and Pavilion partner to launch a new search integration, accelerating public procurement efficiency

Welcoming KC Badala

Welcoming Andrew Jabara

In this post, we’ll share how the Pavilion Engineering team deployed and evaluated an open source model for text embeddings in 2 weeks along with some helpful tips and tricks we learned along the way.

‍

Background

First, some background: Pavilion is a marketplace where government purchasing officials looking for things like street lights, safety equipment, and staffing services can find shareable contracts that let them buy what they need. The ability to effectively search through our corpus of over 100k contracts is critical to those buyers; it's hard to use contracts if you can't find them, after all. Around 6 months ago, we dramatically improved our search by switching from a keyword-based strategy to a semantic search strategy (a story for a future blog post). Making this switch required choosing and using an embedding model. At the time we tried a few different models including OpenAI's text-embedding-ada-002 and the open source multi-qa-mpnet-base-dot-v1. After some initial experimentation, we decided that OpenAI's model was good enough for our purposes at the time. It provided solid recall for our contract dataset, didn't require any extra effort to implement, and was more affordable than running our own infrastructure.

‍

However, as it usually goes when the rubber meets the road, we encountered shortcomings with OpenAI's embeddings soon after widely adopting the model in production. Most notably:

Response times for individual embeddings averaged 250ms (easily a quarter of our search’s total latency) with p95 response times frequently eclipsing 30 seconds.
OpenAI had weekly API outages during peak times 😱.

‍

Since search is critical to our government buyers, long response times and frequent outages from a service underlying our core infrastructure quickly drew our attention. We couldn't truly rely on embeddings to power our search until they were at least as reliable as the rest of our services. So, we explored alternative embedding model options, and that's where this story begins.

‍

Finding alternative models

While text-embedding-ada-002 performs fine on the MTEB Leaderboard, there are open source alternatives that perform as well or better on common benchmarks (ada-002 is 20th as of this post). Knowing this, we decided that it was worth investigating other models after our first pass with OpenAI’s model.

‍

Our first step was selecting a few high-performing Open Source models. We were especially keen on models that excelled in retrieval and reranking which are more indicative of success for our semantic search use-case over summarization or question-answer benchmarks. The bge and gte families of models seemed ideal, and from these we tested both base and large options. Since we're most commonly embedding short text, we wanted to understand if the larger models gave enough of an improvement in our search metrics to warrant the slight increase in latency that comes with them.

‍

Deploying the models

After finding alternatives and deciding on a data set, we had to deploy the models to make them usable. To do this, we deployed a barebones bottle server via a Docker image in a dedicated ECS service so we could minimize overhead and maximize GPU-utilization while generating embeddings.

‍

While building the docker image and deploying the infrastructure, we encountered a few gotchas that you might see as well.

‍

Building a Docker Image

The first step in deploying our own embedding infrastructure was crafting a Dockerfile to build the image that’s eventually deployed to Amazon ECS.

When deploying your own GPU-enabled infrastructure and building a Docker image that can utilize it, you'll first want to extend from a compatible base Docker image. Rather than using more common images like python:slim or ubuntu:23.10, we needed to use one of NVIDIA's CUDA images since AWS uses NVIDIA GPU's. For example, our Dockerfile is based on the nvidia/cuda base image:

FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04

Since this image doesn't come with python pre-installed, the next step is installing python and necessary dependencies. At the least, you'll want to install your preferred tools for generating embeddings in python (we like sentence-transformers and pytorch):

RUN apt-get -y install python3 python-is-python3 python3-dev build-essential python3-pi RUN pip install sentence-transformers RUN pip install torch --index-url https://download.pytorch.org/whl/cu118

Note: when installing pytorch, ensure the index-url matches the version of CUDA the docker image is based on.

‍

Next, add a step to pre-download your default model. This is important to drastically reduce startup time in the event you need to scale up or redeploy. It's better to spend the time once during the build step, than every time your service has to start or serve its first request:

RUN python -c 'from sentence_transformers import SentenceTransformer; model = SentenceTransformer("$MODEL_NAME")'

Finally, you'll want to add your embedding server code and startup commands.

‍

Deploying to ECS

Using ECS with GPU instances isn't as straightforward as Fargate or other EC2 instance types. In addition to the standard setup, there are a few special additions:

Rather than using AWS's default AMI's in the launch template, you'll have to use an ECS GPU-optimized one so the container running on the instance has access its GPU.
Add custom user data to the launch template to enable gpu support. This is found either in the advanced configuration in the AWS UI, or the user_data argument in terraform's aws_launch_template resource. The custom config has to set ECS_DISABLE_PRIVILEGED and ECS_ENABLE_GPU_SUPPORT to true. It should look something like this:

#!/bin/bash echo "ECS_CLUSTER=$YOUR_CLUSTER_NAME" >> /etc/ecs/ecs.config echo "ECS_DISABLE_PRIVILEGED=true" >> /etc/ecs/ecs.config echo "ECS_ENABLE_GPU_SUPPORT=true" >> /etc/ecs/ecs.config

‍

Evaluating the models

Now that we could effectively run other models over our test dataset, it was time to evaluate them against OpenAI's text-embedding-ada-002 model. To be specific, we tested e5-large-v2, gte-large, bge-large-en-v1.5, and all-mpnet-base-v2. These models ran the gamut from small (all-mpnet-base-v2) to extremely large (bge-large-en-v1.5) and would allow us to consider not only performance on our benchmark dataset, but also observed latency when generating embeddings themselves. We might be willing to tolerate slightly worse benchmark performance to have extremely fast embeddings or vice-versa.

‍

After identifying the models we'd like to test against OpenAI's, we created a test set of data of 2000 queries and 3500 contracts over which we could compare the relevance of the top results for a given query for each model. Since generic benchmarks aren't always great indicators for a specific use-case, creating a specific data set to test on was essential for us to evaluate how a given model works for our specific case.

‍

We then tested each model across 2000 queries, tracking embedding latency and result relevance for each. We found that performance generating the embedding remained fairly consistent between 25-50ms regardless of the size of the model - most importantly, a huge improvement over OpenAI's 250ms-2s response times. This led us to focus more directly on how each model performed. Our ranking ended up looking like:

gte-large
bge-large-en-v1.5
text-embedding-ada-002
e5-large-v2
all-mpnet-base-v2 (a pretty distant last place)

‍

Interestingly, these results differ from generic benchmarks, which is a great reminder about their limitations. Additionally, e5 and all-mpnet performed significantly worse than the top 3. With our benchmarks complete, we then performed some manual testing to determine the open source model that'd ultimately face off against OpenAI in a real world A/B test. Our manual evaluation confirmed benchmark results that gte-large performed slightly better than bge-large, and it moved forward to a real-world faceoff.

‍

After waiting with bated breath for the A/B test to run its course, the results rolled in: gte-large improved our search clickthrough rate by ~7% and dropped end-to-end latency by 200ms (20%), in addition to being significantly more reliable than OpenAI's API!

‍

Takeaways

In the end, we migrated completely away from OpenAI's embeddings API and have enjoyed a significantly faster, more stable, and more relevant search experience since. Last week, we were able to watch the goings-on at OpenAI, rooting for our friends on their team, with no fear of our search going down if they had trouble keeping the lights on – which was very reassuring. We also learned a few things along the way:

It's possible to run your own ML services performantly and reliably.
Open source embedding models really do perform well.
Generic benchmarks are directionally helpful, but no substitute for testing on your own datasets.

‍

Interested in building software to empower public servants and improve lives at scale by making government purchasing work better? We're hiring!