API Design Guidance: Long-Running Background Jobs

The REST style is particularly suited for synchronous APIs, where requests generally return quickly (less than a few seconds). For operations that take a longer time to complete, it may not be optimal for applications to block with an open connection waiting for a response. Let’s look at how we can support long-running background jobs in our REST-based APIs by applying some techniques and design patterns using what is provided to us from the HTTP specification.

The challenge of long-running API requests

The need for long-running API requests may not immediately present itself. During the API design process, we often assume that all requests will perform optimally. However, integration with legacy systems or external third-party services may negatively impact our operation performances. Over time, what used to take a few milliseconds can quickly require seconds or even hours.

Long-running API requests incur additional cloud resource costs and tie up valuable resources, such as our services and load balancers. Plus, the longer the connection persists, the more likely that the connection will be lost due to network issues and the client will be forced to re-issue the request. For these situations, we need to allow API requests to work in the background rather than tying up precious resources while risking a disconnect that forces the client to submit the same request again, further tying up resources.

A long-running API design example

Let’s assume that we need to perform a bulk import request. This request may take only a few milliseconds if our backend systems aren’t overloaded, or it may take much longer during peak times.

First, we will need a bulk import operation:

POST /accounts/bulk-import

{ ... }
{ ... }
{ ... }
{ ... }

The API assumes that the bulk operation won’t be quick enough, so the API returns a 202 Accepted:

HTTP/1.1 202 Accepted
Location: https://api.example.com/v1/import-jobs/7937

The 202 Accepted response code informs the client that their request was properly formed, but that the server is unable to provide the result of the request immediately. An additional Location header is used to point to the URL where a subsequent GET request may be used to check on the status of the request and obtain the result.

Below is an example of the client polling the provided URL for job completion:

GET /v1/import-jobs/7937

In this case, the client finds out that the status is still inProgress:

HTTP/1.1 200 OK

{
    "jobId": "7937",
    "status": "InProgress",
    "percentComplete": "25",
    "suggestedNextPollTime": "2018-10-02T11:00:00.00Z",
    "estimatedCompletionTime": "2018-10-02T14:00:00.00Z"
}

This polling may continue at specific intervals, until the job is completed. The response may be the result, or in this case it contains a hypermedia link that points to the results through a separate URL:

HTTP/1.1 200 OK

{
    "jobId": "7937",
    "status": "Completed",
    "completionTime": "2018-10-02T14:00:00.00Z",
    "_links": {
      { "rel":"results", "https://api.example.com/v1/import-jobs/7937/results" },
      ...
    }
}

Avoiding polling for status updates

Not all requests may require a background job. Perhaps the backend isn’t too busy and can complete the request quickly. Or, perhaps the bulk import is only a few records and won’t take too long. In either case, it seems silly to require the client to poll for the result when it could have obtained it immediately.

When an immediate response is possible, consider returning a 200 OK response if the job is small enough to be completed in a specific time. For example, wait for up to 50 milliseconds for completion of a job, returning the response immediately if it completes in time and a 202 Accepted if the job will take longer.

Alternatively, the use of Webhooks would allow a system to be notified by the background job via a POST when it has completed. While this requires a bit of additional work by the client, it would prevent constant API polling to determine when the job is done. Alternatives such as server-sent events (SSE) or WebSockets may also work in this case.

Wrap-up

There are times when requests may take time to complete before the request can be provided back to the client. While HTTP is request/response, we can simulate an asynchronous API through the use of the long-running API design pattern. The client receives a 202 Accepted response with a URL that may be used to poll for the current status and obtain the result when it has been completed. Applying this pattern avoids long-running requests that can overburden load balancers and backend API servers.