Speed up your AI & LLM-integration with this simple trick

LLMs are the latest trend and probably occupy at least one ticket in the backlog of every product. Their integration via common APIs such as OpenAI has become increasingly easy. In this article, we want to show how the use of data streams can drastically improve the perceived performance of your AI applications.

We always learned how important it is, to finish your thoughts before starting to speak. In the case of AI integrations however, this is counter productive. LLMs are able to start speaking without knowing how they are going to end their sentence. And we can leverage that! Let's look at the following comparison:

You can see a small app that generates short stories. With the click on the button, it generates two stories with the same keywords, however it uses two different endpoints. The left column calls an endpoint, that fetches data from OpenAIs completion API while using async/await to retrieve the full response in one chunk, while the right column uses http streaming. In the end, they reach the complete story in more or less the same time, but in the right column, users can already start reading after about half a second of delay, which creates a much smoother experience.

How to do that?

We have to tackle this problem from two sides. The server needs to start process the AI response and needs to forward the messages to the browser as soon as they arrive and secondly, the frontend needs to also handle the streaming response, adding text to the HTML as soon as it's available. Let's start with the browser here.

Consuming an HTTP stream in the browser

Luckily, async iterables and the fetch API make this really easy! First we have to make a regular fetch call and check for any network errors:

 const response = await fetch(
  `/api/stream?q=${encodeURIComponent(keywords)}`
);

if (!response.ok) {
  output.textContent = `Error: ${response.statusText}`;
  return;
}

Next, we have to transform the actual bytes that are sent over the wire into a proper string. This is especially tricky to do manually, since a single UTF-8 character could be sent in two separate chunks of the stream. Luckily the browser has a built-in helper for that: TextdecoderStream.

We can use the ReadableStream from response.body and pipe that into a TextEncoderStream to get a stream of proper utf-8 strings:

const sourceStream = response.body!;

const textStream = sourceStream.pipeThrough(
  new TextDecoderStream()
);

And now, with this readable stream, we can make use of the fact that ReadableStreams implement the AsyncIterable interface to use a simple loop:

for await (const chunk of textStream) {
  output.textContent += chunk;
}

Producing an HTTP stream from Node.js

Now, the only thing that is left is producing this HTTP stream in our Node.js backend.

First of all, we need to define our request handler. It doens't really matter too much, what web framework you are using, we are using fastify.

app.get("/stream", async (req, res) => {
  const keywords =
    (req.query as { q: string }).q || "Hansel and Gretel";
  // Prepare content type to send headers early.
  res.header("Content-Type", "text/plain; charset=utf-8");

Next we need to create the text completion stream instead of waiting for the full response. For Open AI, this can be done by simply setting the stream flag to true:

const stream = await openai.chat.completions.create({
  model: "gpt-4o-mini-2024-07-18",
  messages: getPrompt(keywords),
  stream: true,
});

And now comes the magic. We can use the pipeline function from Node.js to take this OpenAI stream (that implements the AsyncIterable interface), pipes it through a function to extract just the text, and outputs it into the WritableStream of the network response and thus, directly to the client:

await pipeline(stream, extractText, res.raw);

The extractText function is using the Async Generator feature and is itself also quite small:

async function* extractText(
  source: AsyncIterable<OpenAI.Chat.Completions.ChatCompletionChunk>
) {
  for await (const data of source) {
    const next = data.choices[0].delta.content;
    if (next) yield next;
  }
}

And now, we can put it all together:

Complete code

Frontend

async function streamStory(
  keywords: string,
  output: HTMLDivElement
) {
  const response = await fetch(
    `/api/stream?q=${encodeURIComponent(keywords)}`
  );

  if (!response.ok) {
    output.textContent = `Error: ${response.statusText}`;
    return;
  }

  output.textContent = "";

  const sourceStream = response.body!;

  const textStream = sourceStream.pipeThrough(
    new TextDecoderStream()
  );

  for await (const chunk of textStream) {
    output.textContent += chunk;
  }
}

Backend

import fastify from "fastify";
import OpenAI from "openai";
import { pipeline } from "stream/promises";

const app = fastify();

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

app.get("/stream", async (req, res) => {
  const keywords =
    (req.query as { q: string }).q || "Hansel and Gretel";

  res.header("Content-Type", "text/plain; charset=utf-8");

  const stream = await openai.chat.completions.create({
    model: "gpt-4o-mini-2024-07-18",
    messages: getPrompt(keywords),
    stream: true,
  });

  await pipeline(stream, extractText, res.raw);
});

await app.listen({ port: 3000 });

async function* extractText(
  source: AsyncIterable<OpenAI.Chat.Completions.ChatCompletionChunk>
) {
  for await (const data of source) {
    const next = data.choices[0].delta.content;
    if (next) yield next;
  }
}

function getPrompt(
  keywords: string
): OpenAI.Chat.Completions.ChatCompletionMessageParam[] {
  return [
    {
      role: "system",
      content:
        "Du bist ein Profi-Märchenautor! " +
        "Deine Geschichten enthalten keinerlei böse Wörter, " +
        "die nicht geeignet für sind für Kinder. Nutzer geben " +
        "dir Stichworte und du antwortest mit einer ca. 100 Wörter-langen Geschichte. " +
        "Schreibe die Geschichte auf Englisch:",
    },
    {
      role: "user",
      content: keywords,
    },
  ];
}