Why do synchronous OpenAI API calls cause 504 timeout errors?

Large Language Models take between 2 and 60 seconds to generate a response depending on output length. When a Node.js/Express server makes a synchronous HTTP request to the OpenAI API and holds the connection open waiting for a response, it blocks that server thread and exceeds typical gateway timeout limits (usually 30 seconds), causing a 504 error for the end user.

What is a message queue and why does it fix the AI timeout problem?

A message queue (like RabbitMQ or Redis Streams) is an asynchronous communication system. Instead of waiting for the AI to respond, your API server instantly pushes a job to the queue and returns a 202 Accepted response to the user. A separate background worker picks up the job, processes the LLM call at its own pace, and publishes the result back through a WebSocket when ready, completely decoupling the request from the response lifecycle.

What is the difference between REST API and WebSocket for AI streaming?

A REST API follows a request-response pattern where the client waits for the server to return a complete response. A WebSocket is a persistent, bidirectional connection that lets the server push data to the client in real-time as it becomes available. For AI applications, WebSockets enable streaming of tokens as the LLM generates them, creating a smooth typing-like experience rather than a long blank wait.

Should I use Redis Pub/Sub or RabbitMQ for AI job queues?

For early-stage startups, Redis Pub/Sub is simpler to set up since you likely already use Redis for caching. For production systems that need durable job persistence (so jobs survive a server crash), dead-letter queues, and complex routing, RabbitMQ is the better choice. Both are excellent options, and the architectural pattern described in this case study works identically with either.

Can I use this architecture with any LLM provider, not just OpenAI?

Yes. The event-driven pattern is completely provider-agnostic. The Python FastAPI worker simply needs an HTTP client pointed at your LLM provider's API. The same architecture works with Anthropic Claude, Google Gemini, open-source models via Ollama, or any other provider. Switching providers only requires changing the API key and endpoint URL inside the worker service.

Sharon Rosario — Decoupling AI: From Blocking Calls to Event-Driven Microservices

The 504 Error

Your AI feature is killing your server — here's why

The product demo worked perfectly. The investor loved it. You deployed to production. Then the 504 gateway timeout errors started.

This is the single most common failure mode I have seen in early-stage AI SaaS products. The demo environment has low concurrency, no load balancer timeouts, and a fast network. Then you go live.

In production, your Load Balancer has a 30-second timeout. Your Node.js Express server receives a request from User A asking the AI to summarize a long document. Your server makes an HTTP call to the OpenAI API and then waits. And waits. The AI is generating 800 tokens. That takes 45 seconds.

At the 30-second mark, your load balancer gives up and returns a 504 Gateway Timeout error to the user. Meanwhile, your Express server is still blocked, still holding that connection open, still waiting for OpenAI to respond.

Now multiply this by 20 concurrent users. Your Node.js server, which runs on a single event loop, is now holding 20 blocking connections open simultaneously. Your memory spikes. Your event loop is saturated. New requests cannot be accepted. Your entire application becomes unresponsive — not just the AI feature, but everything.

You have accidentally turned a slow AI feature into a denial-of-service attack against your own product.

The problem is not the AI. The problem is that you are treating AI like a REST API. It is not. A REST API returns in milliseconds. An LLM returns in seconds or minutes. They require fundamentally different architectural patterns.

Key takeaway

"Synchronous AI calls inside a web server are an architectural mistake. The solution is not better error handling or longer timeouts. The solution is decoupling the request lifecycle from the AI processing lifecycle entirely."

Why It Breaks

The three failure modes of synchronous AI

Understanding exactly how your current architecture fails will make the solution obvious. Each failure mode compounds the others.

Phase 1

Failure Mode 1: Blocking the Event Loop

Even async/await does not fully protect you here.

Node.js has a single-threaded event loop. Even with async/await, if you have 50 requests all awaiting OpenAI responses simultaneously, the event loop is managing 50 pending promises.
Memory grows linearly with each open connection since each awaiting request holds its full context in memory.
If OpenAI experiences a slow patch (which happens during peak hours), all 50 requests slow down simultaneously, cascading into a full application stall.

The Anti-Pattern — Synchronous AI in Express

// DO NOT DO THIS in production with multiple concurrent users
app.post('/api/summarize', async (req, res) => {
  const { text } = req.body;
  
  // This blocks your server for 10-60 seconds PER REQUEST
  // Under load, this will cause 504 timeouts and memory spikes
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: `Summarize: ${text}` }]
  });
  
  // By the time we get here, the load balancer may have
  // already closed the connection with a 504 error
  res.json({ summary: response.choices[0].message.content });
});

Phase 2

Failure Mode 2: No Retry or Fault Tolerance

A single API failure loses the job forever.

OpenAI experiences rate limits, temporary outages, and model errors. These are not edge cases; they are normal operating conditions.
In a synchronous architecture, when OpenAI returns an error, the entire user request fails immediately with no automatic retry.
The user must manually retry. If they do not, the work is lost and they have a negative experience. There is no dead-letter queue to inspect and replay failed jobs.

Phase 3

Failure Mode 3: No Horizontal Scalability

You cannot scale AI independently from your web server.

In a monolithic synchronous setup, if your AI feature is under high load, you must scale your entire web server — even the parts that handle simple CRUD operations and have no AI involvement.
This is both expensive and wasteful. You cannot independently scale your AI processing capacity based on the demand for AI features specifically.
An event-driven architecture lets you spin up more AI worker instances independently during peak AI demand, without touching your web server.

Key takeaway

"The pattern that fixes all three failure modes at once is an event-driven architecture with a message queue. One line replaces a blocking await with a non-blocking queue push. The rest of the architecture handles reliability, retries, and scalability automatically."

The Architecture

Decoupling the request from the AI — how it works

The architecture has three components. They communicate asynchronously, meaning each one operates at its own pace and no component waits for another.

Component 1: The Express API Server (The Dispatcher) The Express server's only job for AI requests is to validate the input, create a job ID, push the job to the Redis queue, and immediately return a `202 Accepted` response with the job ID. The entire operation takes under 10 milliseconds. The server is free to handle the next request immediately.

Component 2: The Redis Message Queue (The Brain) Redis, running in Pub/Sub or Streams mode, holds the pending AI jobs in a durable queue. If the AI worker crashes, the job stays in the queue. When the worker comes back online, it picks up where it left off. Redis handles backpressure automatically — if the workers are busy, jobs simply wait in line.

For production systems with strict delivery guarantees, Redis Streams provides persistent, ordered, consumer-group-based job processing. For simpler setups, BullMQ (a Node.js library built on Redis) provides a full-featured job queue with retries, rate limiting, and a monitoring dashboard out of the box.

Component 3: The Python FastAPI Worker (The Thinker) A completely independent Python service subscribes to the Redis queue, picks up jobs one by one, calls the OpenAI API (or any LLM), and publishes the result to a separate channel. Python is an excellent choice for this worker because the entire AI/ML ecosystem (LangChain, LlamaIndex, Transformers) is Python-first. You can swap out OpenAI for a local Ollama model or an Anthropic Claude model without touching your Node.js server.

The Return Path: WebSockets Once the Python worker publishes the result, the Node.js server (or a dedicated WebSocket server) receives it and pushes it to the waiting browser connection via WebSocket. The user sees their result appear on screen in real-time, token by token, as the LLM generates them — exactly like ChatGPT's interface.

When NOT to use this architecture

If your AI responses are consistently under 2 seconds and you have very low concurrency (under 10 simultaneous users), a simple async/await approach with a generous timeout is acceptable for early MVP stages. Introduce this architecture when you start seeing 504 errors in production, or proactively when you are building for more than 50 concurrent users.

Implementation

Building the three components

Build in this exact order: the queue first, then the worker, then the API endpoint. Test each layer independently before connecting them.

Dependencies for this guide

Node.js side: express, ioredis, socket.io, bullmq, uuid. Python side: fastapi, openai, redis, uvicorn. Infrastructure: Redis 7+ (managed services like Upstash or Railway work perfectly). Minimal setup — no Kafka, no Kubernetes, no complex infra required for early stage.

Step 1: Set up BullMQ queue in Node.js

BullMQ provides a robust, Redis-backed job queue with automatic retries, rate limiting, and a built-in monitoring dashboard. This replaces raw Redis Pub/Sub and handles failure cases automatically.

queue.js — the shared job queue definition

import { Queue, Worker, QueueEvents } from 'bullmq';
import IORedis from 'ioredis';

const connection = new IORedis(process.env.REDIS_URL, {
  maxRetriesPerRequest: null, // Required by BullMQ
});

// The queue that holds pending AI jobs
export const aiJobQueue = new Queue('ai-jobs', { connection });

// Events emitter — used to listen for job completion
export const aiJobEvents = new QueueEvents('ai-jobs', { connection });

console.log('AI job queue initialized');

Step 2: Update the Express route to dispatch, not process

The Express route becomes a dispatcher. It validates the request, creates a job, and immediately returns a job ID. The AI processing happens elsewhere.

routes/ai.js — the non-blocking dispatcher

import express from 'express';
import { v4 as uuidv4 } from 'uuid';
import { aiJobQueue, aiJobEvents } from '../queue.js';

const router = express.Router();

router.post('/summarize', async (req, res) => {
  const { text, userId } = req.body;
  
  if (!text) {
    return res.status(400).json({ error: 'text is required' });
  }

  // Create a unique job ID the client will use to receive results
  const jobId = uuidv4();

  // Add to queue — this takes ~1ms and returns immediately
  await aiJobQueue.add(
    'summarize',
    { text, userId, jobId },
    {
      jobId,
      attempts: 3,           // Retry up to 3 times on failure
      backoff: {
        type: 'exponential', // Wait 2s, 4s, 8s between retries
        delay: 2000,
      },
    }
  );

  // Return immediately — the client will connect via WebSocket
  // to receive the result when the worker finishes
  return res.status(202).json({
    jobId,
    message: 'Job queued. Connect to WebSocket with this jobId to receive results.',
    wsEndpoint: `ws://your-domain/ws/jobs/${jobId}`
  });
});

export default router;

Step 3: Build the Python FastAPI AI worker

The Python worker runs as a completely independent service. It reads from the BullMQ queue (via the Redis protocol), processes the LLM call, and publishes the result back to a Redis channel where the Node.js WebSocket server is listening.

worker.py — the Python AI worker service

import asyncio
import json
import redis.asyncio as aioredis
from openai import AsyncOpenAI
import os
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
REDIS_URL = os.getenv("REDIS_URL")

async def process_ai_job(job_data: dict, redis_client) -> None:
    """Process a single AI job and publish the result."""
    job_id = job_data.get("jobId")
    text = job_data.get("text")
    
    logger.info(f"Processing job {job_id}")
    
    try:
        # Stream the LLM response token by token
        stream = await openai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": "You are an expert summarizer. Produce concise, accurate summaries."
                },
                {
                    "role": "user",
                    "content": f"Please summarize the following text:\n\n{text}"
                }
            ],
            stream=True,  # Stream tokens as they are generated
        )
        
        full_response = ""
        async for chunk in stream:
            token = chunk.choices[0].delta.content or ""
            full_response += token
            
            # Publish each token to the Redis channel in real-time
            # The Node.js WebSocket server is subscribed to this channel
            await redis_client.publish(
                f"job-result:{job_id}",
                json.dumps({"type": "token", "token": token, "jobId": job_id})
            )
        
        # Signal completion
        await redis_client.publish(
            f"job-result:{job_id}",
            json.dumps({"type": "complete", "result": full_response, "jobId": job_id})
        )
        
        logger.info(f"Job {job_id} completed successfully")
        
    except Exception as e:
        logger.error(f"Job {job_id} failed: {e}")
        await redis_client.publish(
            f"job-result:{job_id}",
            json.dumps({"type": "error", "error": str(e), "jobId": job_id})
        )

async def poll_queue(redis_client) -> None:
    """Poll the BullMQ Redis queue for new jobs."""
    # BullMQ stores jobs in a sorted set — we read directly from Redis
    # In production, use BullMQ's Python client or a dedicated queue reader
    logger.info("AI Worker started. Polling for jobs...")
    
    while True:
        # BLPOP blocks until a job is available, then returns it immediately
        result = await redis_client.blpop("bull:ai-jobs:wait", timeout=5)
        
        if result:
            _, job_key = result
            job_data_raw = await redis_client.hget(f"bull:ai-jobs:{job_key.decode()}", "data")
            
            if job_data_raw:
                job_data = json.loads(job_data_raw)
                await process_ai_job(job_data, redis_client)
        
        await asyncio.sleep(0.1)

async def main():
    redis_client = await aioredis.from_url(REDIS_URL, decode_responses=True)
    await poll_queue(redis_client)

if __name__ == "__main__":
    asyncio.run(main())

Step 4: WebSocket bridge in Node.js

The Node.js server subscribes to the Redis result channel and forwards tokens to the connected browser in real-time. This is the final bridge between the AI worker and the user's screen.

websocket.js — real-time result delivery

import { Server } from 'socket.io';
import { createClient } from 'redis';

export function initWebSocket(httpServer) {
  const io = new Server(httpServer, {
    cors: { origin: process.env.FRONTEND_URL }
  });

  const subscriber = createClient({ url: process.env.REDIS_URL });
  subscriber.connect();

  io.on('connection', (socket) => {
    const { jobId } = socket.handshake.query;
    
    if (!jobId) {
      socket.disconnect(true);
      return;
    }

    console.log(`Client connected, watching job: ${jobId}`);

    // Subscribe to this specific job's result channel
    const channel = `job-result:${jobId}`;
    
    subscriber.subscribe(channel, (message) => {
      const data = JSON.parse(message);
      
      if (data.type === 'token') {
        // Forward each token to the browser as it arrives
        socket.emit('ai:token', { token: data.token });
      } else if (data.type === 'complete') {
        socket.emit('ai:complete', { result: data.result });
        subscriber.unsubscribe(channel);
        socket.disconnect();
      } else if (data.type === 'error') {
        socket.emit('ai:error', { error: data.error });
        subscriber.unsubscribe(channel);
        socket.disconnect();
      }
    });

    socket.on('disconnect', () => {
      subscriber.unsubscribe(channel);
    });
  });
}

Key takeaway

"Your Express server now responds to every AI request in under 10 milliseconds. The AI processing happens in a completely separate Python process. Under any load, your server remains responsive."

Streaming Frontend

The React client that receives streamed tokens

The frontend connects to the WebSocket immediately after receiving the job ID, and streams tokens directly into the UI as they arrive — just like ChatGPT.

This is the complete React hook that manages the entire flow: submitting the job, connecting to the WebSocket, and streaming the response into state. It is fully reusable across any AI feature in your application.

Complete React Hook: useAIStream

import { useState, useCallback } from 'react'; import { io } from 'socket.io-client'; export function useAIStream() { const [result, setResult] = useState(''); const [status, setStatus] = useState('idle'); // idle | queued | streaming | done | error const submit = useCallback(async (text) => { setResult(''); setStatus('queued'); // 1. Submit the job — returns immediately const { jobId } = await fetch('/api/summarize', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ text }), }).then(r => r.json()); // 2. Open WebSocket connection to receive streamed result const socket = io(import.meta.env.VITE_API_URL, { query: { jobId } }); setStatus('streaming'); socket.on('ai:token', ({ token }) => { setResult(prev => prev + token); }); socket.on('ai:complete', () => { setStatus('done'); socket.disconnect(); }); socket.on('ai:error', ({ error }) => { console.error('AI error:', error); setStatus('error'); socket.disconnect(); }); }, []); return { result, status, submit }; }

Observability

Knowing what your AI is doing in production

You cannot manage what you cannot see. These monitoring tools give you full visibility into your AI job pipeline without complex infrastructure.

BullMQ Board — Visual Queue Dashboard

Install @bull-board/express and mount it at /admin/queues. This gives you a real-time web dashboard showing active, waiting, completed, and failed jobs with full error logs. URL: https://github.com/felixmosh/bull-board

Track AI latency per model and prompt type

Log the start time and end time of each AI job in Redis. Surface the p50, p90, and p99 latency in your dashboard. If p99 exceeds 60 seconds, you need to investigate your prompt complexity or switch to a faster model.

Alert on failed job rate exceeding 5%

BullMQ tracks failed job counts automatically. Set up a simple cron job that checks the failed:total ratio every 5 minutes and sends a Slack alert if it exceeds 5%. This is typically a signal that OpenAI is rate-limiting your API key.

Monitor Redis memory usage

BullMQ jobs accumulate in Redis if not cleaned up. Configure the removeOnComplete and removeOnFail options in BullMQ to auto-delete jobs after they complete. For auditing, archive completed job metadata to Postgres before deletion.

Launch Checklist

Pre-launch production checklist

Every item here must be checked before you open AI features to paying customers.

Load balancer timeout is set to at least 35 seconds

Your load balancer now only needs to handle the initial 202 response (under 1 second). But the WebSocket connection must stay open indefinitely. Ensure your LB has WebSocket passthrough enabled and long-lived connection support.

BullMQ retry configuration is set correctly

Set attempts: 3 with exponential backoff. Configure removeOnComplete: 100 and removeOnFail: 500 to automatically clean up Redis while keeping recent job history for debugging.

Worker scales independently from the API server

Your Python worker should be deployed as a separate service (separate Dockerfile, separate Heroku Dyno or Railway service). This lets you add more worker instances when AI demand is high without touching the API server.

Dead-letter queue alert is configured

If a job fails all 3 retry attempts, it moves to the failed queue. Set up an alert so you are notified within 5 minutes of any job reaching this state, so you can investigate and manually retry if needed.

WebSocket reconnection logic is implemented in the frontend

Socket.io handles basic reconnection, but you should also implement a fallback: if the WebSocket connection fails, poll the API every 5 seconds for up to 3 minutes to check if the job completed.

References

Tools, libraries & further reading

The complete toolkit behind this architecture. Each tool was chosen for reliability, simplicity, and excellent documentation.

BullMQ — Robust Job Queue for Node.js

The queue library used in this guide. Backed by Redis. Supports priorities, rate limiting, retries, scheduled jobs, and a visual dashboard. Battle-tested in production at major companies.

https://docs.bullmq.io/

Socket.io — Real-Time WebSocket Library

Used for the real-time token streaming between the Python worker and the React frontend. Handles reconnection, rooms, and namespaces automatically.

https://socket.io/docs/v4/

Upstash Redis — Serverless Redis for Production

A managed Redis service with a generous free tier, ideal for job queues in early-stage startups. Fully compatible with BullMQ and ioredis.

https://upstash.com/

OpenAI Streaming API Reference

The official documentation for streaming completions from the OpenAI API in Python. Used in the Python worker to stream tokens token-by-token as they are generated.

https://platform.openai.com/docs/api-reference/streaming

FastAPI — Python Async Web Framework

The Python framework used for the AI worker service. Natively async, with automatic OpenAPI documentation generation and excellent performance.

https://fastapi.tiangolo.com/

Quick Answers

Frequently asked questions

The most common questions from engineers migrating from synchronous to event-driven AI architectures.

Why do synchronous OpenAI API calls cause 504 timeout errors?: Large Language Models take between 2 and 60 seconds to generate a response depending on output length. When a Node.js/Express server makes a synchronous HTTP request to the OpenAI API and holds the connection open waiting for a response, it blocks that server thread and exceeds typical gateway timeout limits (usually 30 seconds), causing a 504 error for the end user.
What is a message queue and why does it fix the AI timeout problem?: A message queue (like RabbitMQ or Redis Streams) is an asynchronous communication system. Instead of waiting for the AI to respond, your API server instantly pushes a job to the queue and returns a 202 Accepted response to the user. A separate background worker picks up the job, processes the LLM call at its own pace, and publishes the result back through a WebSocket when ready, completely decoupling the request from the response lifecycle.
What is the difference between REST API and WebSocket for AI streaming?: A REST API follows a request-response pattern where the client waits for the server to return a complete response. A WebSocket is a persistent, bidirectional connection that lets the server push data to the client in real-time as it becomes available. For AI applications, WebSockets enable streaming of tokens as the LLM generates them, creating a smooth typing-like experience rather than a long blank wait.
Should I use Redis Pub/Sub or RabbitMQ for AI job queues?: For early-stage startups, Redis Pub/Sub is simpler to set up since you likely already use Redis for caching. For production systems that need durable job persistence (so jobs survive a server crash), dead-letter queues, and complex routing, RabbitMQ is the better choice. Both are excellent options, and the architectural pattern described in this case study works identically with either.
Can I use this architecture with any LLM provider, not just OpenAI?: Yes. The event-driven pattern is completely provider-agnostic. The Python FastAPI worker simply needs an HTTP client pointed at your LLM provider's API. The same architecture works with Anthropic Claude, Google Gemini, open-source models via Ollama, or any other provider. Switching providers only requires changing the API key and endpoint URL inside the worker service.

Back to all case studies