How to use OpenAI's embeddings to make expert chatbots

April 13, 2023

If you want to create an AI-powered tool that knows about a specific domain or some type of context and can answer questions about it, you may be tempted to consider OpenAI’s GPT-3’s fine-tuning capabilities. But this could be an expensive, lengthy and impractical approach.

In this article, we will show step by step how to create a GPT-3-powered chatbot that can answer any questions about any content source you may want, without necessarily having to fine tune the model.

First, let’s start with an interactive demo of what you can expect in terms of results. In this demo, we will show an AI bot that can answer questions about a specific Wikipedia article of your choice. We can enter any Wikipedia URL into the first input field, and then our AI bot will process that content and we can immediately start asking any questions about it in the second input field.

These questions can be specific, such as asking for a detail or number buried within the text of the article, or they can also be more broad or high-level, such as, “What is this article about?”

And again, we are not leveraging any fine-tuning capabilities from GPT-3 to achieve this. In fact, doing so could be prohibitively expensive and slow for such a dynamic use case as this one, where we are allowing you to enter any Wikipedia URL, even if it was created or updated just moments ago.

Let’s dive into how we built this.

First, let’s explore what we can achieve with prompt engineering alone, as we’ll use this in our solution later. With some basic prompt engineering, you can give GPT-3 a chunk of text to use as context and then ask it a question about that context, for which you can also specify the format. For example:

Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know."

Context: <chunk of text>

Q: <your question about that text>


This prompt was largely inspired by OpenAI’s own tutorial, and it does a good job of mitigating hallucinations or inaccurate answers and restricting GPT-3 to using the information provided as context.

This is great for cases where you have that specific context at hand when writing your prompt, but it doesn’t work so well if that context exceeds the maximum token budget of your prompt, for example if you want to ask a question about a long article, blog post, project documentation, an HR policies manual, etc.

Let’s say that you want to build an AI-powered chatbot that knows all about your company’s wiki, so employees can ask it questions instead of relying on your wiki’s search functionality or having to sift through long documents to find the answer they’re looking for. The entire content of your wiki would be prohibitively large to inject into a single prompt, so that’s not going to work.

You could consider fine-tuning GPT-3 with your company’s wiki content so you can ask questions about it. And yes, that may work, but it’s also the most complex, time consuming, expensive and potentially overkill solution to this problem. 

This is the type of situation where OpenAI embeddings can shine, and we will use this wiki use case for the rest of this article as an example of what embeddings can achieve.

OpenAI Embeddings

Before talking about OpenAI’s take on embeddings, it’s useful to know about “word embeddings,” which are a key concept in machine learning and have been around for a long time. If you want to learn more about how they work and all the things they’re used for, there’s a lot of free high-quality resources online. We recommend Google’s crash course, or AssemblyAI’s great introductory video, which will give you a comprehensive overview.

In this article, however, we will focus on OpenAI’s embeddings, which you can read more about in their official documentation. While OpenAI has supported embeddings for a while now, they announced their new and improved model for embeddings in December 2022, which greatly reduced costs and improved performance and accuracy compared to their prior offering, making embeddings even more compelling. You can learn more about model improvements at a glance in OpenAI’s announcement, including stronger performance, unification of capabilities, larger context, smaller embedding size and reduced price.

With our example of using embeddings in the context of company wikis, it’s interesting to consider that this approach is precisely what connected workspace platform Notion is planning to use for their own wiki AI search functionality, as stated in OpenAI’s “examples of the embeddings API in action.” This was not part of Notion’s initial suite of AI features recently announced, which instead were more focused on helping wiki users generate and refine content, but it’s nonetheless a “low hanging fruit” feature for them to add, and it may be coming next.

OpenAI embeddings are not traditional word embeddings, but rather more generalized “text embeddings,” a higher level abstraction that allows you to transform an entire chunk of text into a vector of floating point numbers, as opposed to just a word.

Unlike traditional word embeddings, which represent each word as a fixed vector in a high-dimensional space, OpenAI's embeddings are contextualized, meaning that the representation of each word depends on the surrounding context in which it appears.

Moreover, because OpenAI is leveraging state of the art general language models, their embeddings do an incredible job of deriving the intent or the essential meaning behind the words you use. It’s so powerful, in fact, that you can even compare embeddings written in different languages and still be able to determine how related they are to each other conceptually.

This mathematical representation of text then allows us to perform useful operations against them, such as calculating the Euclidean distance or cosine similarity between these vectors to determine how related they are to each other.

Did you know?

Euclidean distance vs cosine similarity:

When calculating the similarity between two embeddings, cosine similarity and Euclidean distance are two commonly used measures, but they have different properties and are suited for different types of applications:

Cosine similarity measures the cosine of the angle between two vectors in a high-dimensional space. It ranges from -1 (perfectly dissimilar) to 1 (perfectly similar), with 0 indicating no similarity. Cosine similarity is often used in text-based applications, where the goal is to measure the similarity of two pieces of text based on their content, regardless of their length or frequency distribution of the individual terms. Cosine similarity is particularly useful in such cases because it is unaffected by the length of the vectors or the frequency of the individual terms, and it focuses on the direction of the vectors instead.

Euclidean distance, on the other hand, measures the distance between two vectors in a high-dimensional space. It is calculated as the square root of the sum of squared differences between the corresponding elements of the two vectors. Euclidean distance ranges from 0 (perfectly similar) to infinity (perfectly dissimilar), with larger values indicating greater dissimilarity. Euclidean distance is often used in image or signal processing applications, where the goal is to compare two signals or images based on their pixel-by-pixel differences.

So, following the same example, how can embeddings help us make a chatbot for your company’s wiki? Our method will consist of the following steps:

  1. Prepare our contextual text data (all of your wiki pages, in this use case) and transform all of it into OpenAI embeddings (more on that below), as many as needed to ensure each complies with our desired size, based on token limitations.
  2. When a user asks a question related to that context, we will transform the question itself into an OpenAI embedding as well. From there, we find the contextual embedding generated in step one that is most related to the question.
  3. We will then inject the original text that we used to create that particular embedding into our prompt and ask GPT-3 to use that context to answer the user’s question, leveraging OpenAI’s Completions endpoint.

With this approach, we can leverage an arbitrarily large amount of text as contextual information, making it “fit” into a single prompt by only using the bits and pieces of that contextual data that are relevant to the question at hand.

Implementing the Proof of Concept

All of the code shown below is written in TypeScript.

Preparing our contextual data

This step varies depending on your own data: where it’s stored, what format it’s in and how you can access it. Following our company wiki example, perhaps this step would require you to use an API from your wiki provider to retrieve the content, or you may download wiki pages as individual files and process them one at a time, or something along those lines.

For our wiki example here, we created individual files—in some cases with entire wiki pages, and in other cases we created one file per section of a wiki page—in order to ensure that we keep each file under 1,000 words.

This size limit on words is just an example and you should explore what works best for you. Below are some factors you should consider when deciding on an input size limit.

The latest OpenAI Embeddings model supports up to 8,191 input tokens, and usage of this API is priced per input token, at a rate of $0.0004 per 1,000 tokens.

Did you know?

What’s the relationship between word counts and tokens?

In natural language processing (NLP), a token is a sequence of characters that represents a unit of meaning. This can include words, punctuation marks, and other symbols. This is why token counts are typically larger than word counts, but on the other hand tokens can be more efficient in some respects, too.

For example, the words "running," "ran," and "runs" are all forms of the same verb, but they would each be counted as separate words in a word count. By using tokens instead of words, NLP systems can avoid such ambiguities and ensure that each unit of meaning is counted only once.

Check out OpenAI’s documentation for more information on how regular words translate to tokens, but a reasonable approximation is that ~100 tokens correspond to about ~75 words.

In our case, our 1,000 words would translate to approximately 1,300 tokens.

With this in mind, you may be tempted to create very large embeddings, maybe even using the full input budget allowed. That won’t work so well for our purposes, since we plan to incorporate the text we used to create the embeddings into our GPT-3 prompt later on.

The GPT-3 family of models unfortunately do not have the same input size capabilities as the embeddings model. Even the latest GPT-3.5-Turbo model only supports up to 4,096 input tokens.

Moreover, the maximum number of output tokens that the model can generate is directly tied to the size of the input provided. This is because the model has a fixed budget of computational resources that it can allocate to generating the output sequence, and if more resources are used for processing the input, fewer resources will be available for generating the output.

So, if you choose a larger input tokens size, it is likely that the model will generate a smaller number of output tokens in response. Conversely, if you choose a smaller input tokens size, the model may be able to generate a larger number of output tokens.

Ultimately, this step will require some experimentation for best results, but in our case, limiting the contextual embedding to about 1,000 words worked great.

Creating the Embeddings

For our proof of concept, we chose to store our newly created embeddings in memory, which worked well enough. But for a more robust solution, we would recommend using a database or search engine with support for vectors, such as Pinecone, Weaviate or even Redis, which recently added support for vectors as well.

const Embeddings: IEmbedding[] = [];

const loadContent = async () => {
  const files = await fs.promises.readdir(`./${CONTENT_FOLDER}`);
  await Promise.all( (file) => {
      const filePath = path.join(`./${CONTENT_FOLDER}`, file);

      fs.readFile(filePath, "utf8", async (err, text) => {
        if (err) throw err;
        const vector = await createEmbedding(text);
        Embeddings.push({ file, vector, text });

Here we are loading each of our pre-prepared content files and creating an OpenAI embedding from their content. To create the embedding, we are using OpenAI’s SDK, and our `createEmbedding` function looks like this:

async function createEmbedding(text: string) {
  const response = await openaiClient.createEmbedding({
    model: process.env.OPENAI_EMBEDDINGS_MODEL as string,
    input: text,

  return response["data"]["data"][0]["embedding"];

This is just a basic proof of concept, and we would of course add proper error handling to make this code production-ready.

The embeddings model we are using here is the latest one available: text-embedding-ada-002.

After creating the embedding, we add it to an array. Each of our embeddings objects has the following attributes:

  • File: the name of the file that we used to create the embedding. At the moment, we are only using this for logging and debugging purposes.
  • Vector: this is an array of floating point numbers, which is the output generated by OpenAI for our input text when we create the embeddingt.
  • Text: the text content found in this file that we used to generate this vector.

You only need to follow this process once, but of course you could also create triggers in your application to ensure these embeddings are automatically updated whenever a new wiki page is created, modified or removed.

Finding the most relevant embedding

We will calculate the cosine similarity between two vectors to quantify how close or related they are to each other. If you are using a vector search engine or database, this operation may come built into that tool so you won’t need to write this code, but otherwise you can do it yourself. Our cosine similarity function looks like this:

function cosineSimilarity(vec1: number[], vec2: number[]) {
  if (vec1.length !== vec2.length) {
    throw new Error("Vectors must have the same length");

  let dotProduct = 0;
  let magnitude1 = 0;
  let magnitude2 = 0;

  for (let i = 0; i < vec1.length; i++) {
    dotProduct += vec1[i] * vec2[i];
    magnitude1 += Math.pow(vec1[i], 2);
    magnitude2 += Math.pow(vec2[i], 2);

  magnitude1 = Math.sqrt(magnitude1);
  magnitude2 = Math.sqrt(magnitude2);

  return dotProduct / (magnitude1 * magnitude2);

Now, what we really want to do is find out which one of our contextual embeddings is most relevant to a question a user asks. To do that, we will convert the question itself into an embedding, calculate the cosine similarity between that new question embedding and all contextual embeddings, and then pick the most similar one. Here’s what that looks like in code:

const mostRelevantEmbedding = async (question: string) => {
  const questionVector = await createEmbedding(question);

  const mostSimilarEmbedding = Embeddings.reduce(
    (best, embedding) => {
      const similarity = cosineSimilarity(questionVector, embedding.vector);
      if (similarity > best.similarity) {
        return { embedding, similarity };
      } else {
        return best;
    { embedding: null, similarity: -Infinity } as {
      embedding: IEmbedding | null;
      similarity: number;

    `Most relevant embedding for '${question}' was found in ${

So, let’s say a user asks, “How many vacation days do I get at this company?” Provided we have a wiki page that explains that, we will get a console message like “Most relevant embedding for ‘how many vacation days do I get at this company?’ was found in”

All that is left is for us to pull the content found in that file, ask GPT-3 to analyze that content and use it to answer the user’s question.

Answering user questions

To answer user questions, we will use the same prompt we discussed at the beginning of this article. But now we’ll need a function that creates it dynamically:

function questionWithEmbedding(question: string, embedding: IEmbedding) {
  const prompt = `Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"
  Context: ${embedding.text}
  Q: ${question}

  return prompt;

Once we have our prompt ready, we can use the completions endpoint to submit our prompt to GPT-3 and get a response:

export const createCompletion = async (prompt: string) => {
  const response = await openaiClient
      model: process.env.OPENAI_COMPLETIONS_MODEL as string,
      prompt: prompt,
      max_tokens: parseInt(process.env.OPENAI_COMPLETIONS_MAX_TOKENS as string),
      temperature: parseFloat(
        process.env.OPENAI_COMPLETIONS_TEMPERATURE as string
    .catch(function (error: any) {
  return response &&[0].text;

Finally, here’s the function that we will use as our API to interact with our AI chatbot:

export const askQuestion = async (question: string) => {
  const embedding = (await mostRelevantEmbedding(question)) as IEmbedding;

  const prompt = questionWithEmbedding(question, embedding);

  return await createCompletion(prompt);

One more thing…

At this point, we already have the foundations for a chatbot capable of answering specific questions about a body of context that can arbitrarily exceed the input size limitations of GPT-3 models. This is pretty cool and useful already! But we can make it even more powerful still.

What about the case when someone has a question that is not necessarily about a specific detail, but rather about the big picture of the content? Even questions as simple as “What’s this wiki page about?” or “Summarize this wiki page” would start getting troublesome if the wiki page is large enough to require multiple embeddings, so we would not be able to submit the full context to GPT-3 to summarize it.

But there is something we can do to greatly mitigate this problem. We can create an additional “virtual” embedding that holds a summary of the entire wiki page or any large document we want, and that will be used to answer those big-picture questions. Here’s how we can do that:

  1. Let’s say we have split a very long wiki page into multiple sections or “chunks” to keep each one under our desired word count. For this example, let’s say all of those split chunks are stored in a folder, one file for each.
  2. We will ask GPT-3 to make a summary of each of those chunks.
  3. We will then concatenate each of those summaries.
  4. Lastly, we will make a new embedding from our new full summary, and this embedding will be selected as the most relevant one when appropriate.
async function createSummary() {
  const allSummaries: string[] = [];
  const files = await fs.promises.readdir(`./${YOUR_WIKI_PAGE_SECTIONS_FOLDER}`);
  await Promise.all( (file) => {
      const filePath = path.join(`./${YOUR_WIKI_PAGE_SECTIONS_FOLDER}`, file);

      const text = await fs.promises.readFile(filePath, "utf8");

      const prompt = `Summarize the provided text in 1 or 2 paragraphs.
      Context: ${text}
      const sectionSummary = (await createCompletion(prompt)) as string;

  return allSummaries;

export const generateArticleSummary = async () => {
  const summaries = await createSummary();
  const summaryGroup = => summary.trim()).join("\n\n");

  const filePath = `${YOUR_WIKI_PAGE_SECTIONS_FOLDER}/`;
  fs.writeFile(filePath, summaryGroup, (err) => {
    if (err) throw err;
    console.log(`The file: ${filePath} has been saved!`);

How big you make each chunk summary will depend on the length of your wiki page or document, which will determine how big the full summary embedding can get. This is also a place to explore and calibrate as needed.

Note that you could also repeat this process as needed to make lower and lower resolution versions until your size limit is met. I.e., you can make a summary of the summary, further missing out on some details but making it more compact. Again, your mileage may vary, but we’ve had some good results using this method in our tests.

Features and Limitations

Some interesting features or beneficial side effects worth mentioning:

  • Users can ask questions in any language, and not only will our embeddings still be selected appropriately, but moreover, the answer to the question will be written in the question's own language.

    E.g., “What’s our vacations policy” -> you get an answer in English
    E.g., “Cuál es nuestra política sobre vacaciones?” -> our chatbot selects the right embedding, and then GPT-3 accurately answers the question in Spanish.
  • This approach doesn’t rely on exact word matches. For example, even if your time off policy doesn’t include the word “vacations” but uses “paid time off” instead, users that ask about vacations will still get the correct answer anyway. This is because these models dig down to the intent and meaning behind the words you use, which is precisely what makes this technology incredibly powerful compared to traditional search engines.

Continually adding context to the chatbot is easy, cheap and fast. In our wiki example, the instant someone adds a new wiki page, a simple webhook would suffice to ensure our bot can start answering questions about it.

Some limitations or negative side effects:

  • Extremely large documents may require a different approach for summarization. We already covered how some broader or big-picture questions can challenge this approach and have offered a workaround for that. But still, if your content is extremely long (like the entire Lord of the Rings trilogy) and you want to ask a question like “How does Frodo become the ring-bearer?” then it simply may not be possible to inject all the relevant context to answer that question into a single GPT-3 prompt, no matter how much we try to summarize each individual chapter of the books.
  • The quality of the answers are only as good as the quality of the context you provide to GPT-3. This means that if you are using this approach to create a Slack bot to answer employee questions about your HR manual, for example, you better make sure that your HR manual is well written and up to date.
  • This chatbot doesn’t remember previous messages in the conversation. Unlike more advanced chat interfaces like ChatGPT, this chatbot won’t remember your previous messages, so follow-up prompts like “But what about X?” may be met with confusion.

Both GPT-3 and embedding models have inherent limitations that should be considered carefully as well. In this article we covered technical considerations such as input size, but there are other, more subtle side effects which may or may not apply to your use cases. For example, take this excerpt from OpenAI’s own limitations & risks documentation: “...We found that our models more strongly associate (a) European American names with positive sentiment, when compared to African American names, and (b) negative stereotypes with Black women.”

Taking this one step further

Throughout this article we’ve used “your company’s wiki” as an example or vehicle to talk about OpenAI embeddings and what you can do with them. But the possibilities are much greater and ambitious than the idea we discussed.

Again, following our example, you could make a full AI Wiki Assistant that not only provides a chat interface to answer any questions about wiki content, but can also help proactively in other ways:

  • It can alert you if the content of two or more wiki pages are redundant or incompatible, so can you review and decide what to do about it.
  • It can let you know if the language and tone used throughout is consistent or otherwise detect outliers for your review, or it could even attempt to do something about that automatically and already present you with its proposed modification.
  • You don’t just need to use OpenAI embeddings and GPT-3. You could also extend this assistant with traditional custom software development tools and business rules to organize the content, assign DRIs, rate the quality each piece of content, provide recommendations to improve it, and in general make it an invaluable resource to maintain a healthy and high-quality wiki at your company.

What’s Next

We have a few more interesting ideas to explore what is possible with embeddings. We also want to share more about the situations in which fine-tuning works best and how to get the most out of that, so subscribe to get notified about future entries in this series!

And if you are interested in developing AI solutions, or enhancing tools or systems that you already have with AI features, our team can help you throughout the entire process—from ideation and discovery, to planning and implementation.

Share this post

Questions? We have answers.

Set up a free consultation.