API Design Guidance: Bulk and Batch Import

Some APIs require importing lots of data at once to prevent submitting 100s to 1000s of individual POST operations (a slow and tedious process as a result of a CRUD-based API design). There are two common use cases for bulk/batch processing: importing many resources via a single transaction or performing a background import of a large data set in an efficient manner. Let’s look at this design pattern and explore some variants that may help you with your next API design.

This is a multi-part series on API design guidance, where we take a look at tricks and hidden troubles in API designs and how to avoid them. While these articles may not be exhaustive, they will serve to identify common patterns and anti-patterns in API design.

Bulk vs. Batch processing

There are two common terms used in this pattern: ‘bulk’ or ‘batch’-based processing. While these terms are sometimes used interchangeably, I differentiate them in the following way:

Bulk operations process each submitted record independently, allowing failures to be caught but the remainder of the import to succeed. This means that out of 2000 records, perhaps 76 of them failed while the remaining 1924 records were imported successfully.

Batch operations process all submitted records within a single pass-or-fail transaction. This means that if 76 records fail in a batch of 2000, none of the records were actually stored. The client will need to correct (or remove) the 76 failed records and resubmit the revised batch.

Bulk Import Design Example

The request provides individual JSON documents for each resource to bulk import (see details below):

POST /accounts/bulk-import
Content-Type: application/json-seq

{ "id":"12", "name":"...", ... }
{ "id":"13", "name":"...", ... }
{ "id":"14", "name":"...", ... }
{ "id":"15", "name":null, ... }

The response provides the success or failure status for each record for processing and correct by the consuming app:

HTTP/1.1 207 Multi-Status
Content-Type: application/json

{
    "items": [
        { "id": "12", "status": 201, errors: [] },
        { "id": "13", "status": 201, errors: [] },
        { "id": "14", "status": 201, errors: [] },
        { "id": "15", "status": 400, errors: [ ... ] }
    ]
}

In this design, we return the result of each record individually by returning the 207 Multi-Status response. This allows clients to automatically perform a second attempt at importing the records or surface failures to a human.

Batch Import Design Example

Batch import operations are similar, but either success or fail as a complete request and may be required for some use cases. For example:

POST /accounts/batch-import
Content-Type: application/json-seq

{ "id":"12", "name":"...", ... }
{ "id":"13", "name":"...", ... }
{ "id":"14", "name":"...", ... }
{ "id":"15", "name":null, ... }
HTTP/1.1 400 Bad Request
Content-Type: application/json

{
    "items": [
        { "id": "12", "status": 200, errors: [] },
        { "id": "13", "status": 200, errors: [] },
        { "id": "14", "status": 200, errors: [] },
        { "id": "15", "status": 400, errors: [ ... ] }
    ]
}

Note that since this is a batch import, the response should be all-or-nothing. However, we still need to tell the client what couldn’t be imported so we can use the same response format as the bulk import. The primary difference is that records that succeeded don’t return a 201 Created status, but rather a 200 OK response to indicate no errors were encountered.

A Word of Caution on Very Large JSON Import Requests

Requiring clients to construct a single, very large JSON document may require large amounts of memory and CPU to both produce by the client and parse on the server.

To avoid this, each resource representation to be imported may be an isolated and complete, using a record separator approach such as a linefeed (LF). For example:

{ "id":"12", "name":"...", ... }
{ "id":"13", "name":"...", ... }
{ "id":"14", "name":"...", ... }
{ "id":"15", "name":"...", ... }

Each record in the request may be parsed and processes individually as if it was an incoming stream. This has three advantages:

  1. Avoiding the server waiting for all of the content to be received before parsing
  2. Avoiding increased memory requirements by the consuming application to construct a single, properly formed JSON structure. Instead, consuming applications may construct individual records for resource creation and flush the request to the server, allowing the server to treat the incoming request as a stream and process records as they are available
  3. Avoiding malformed JSON from causing the entire request from failing

The content type should be application/json-seq, as defined by RFC 7464 to differentiate between a single JSON document containing multiple records, and record-separated JSON documents that may be parsed individually.

Note: The OpenAPI specification doesn’t currently support specifying this content type properly, since it isn’t an array. Therefore, use caution when choosing this approach as code generators, validators, and other automation tools may not be able to properly handle this content type.

Adaptive Sync vs. Async Processing Design

For some situations, the amount of data provided, and the backend I/O required to support the operation, may be small enough to return the results immediately. In this case, return a 207 Multi-Status response with the details immediately.

For bulk imports that will take longer than a few seconds, the API should return validate the request, queue the work to be done, then respond immediately a 202 Accepted response along a Location header to the resource instance that represents the bulk import process details. Clients may then follow-up to see if the import was completed and obtain the results of the bulk import process, if so desired. This supports the requirement of offloading bulk/batch import operations to background workers.

Wrap-Up

Adding bulk or batch import support to your API can help you and your clients optimise their interactions. By applying the application/json-seq content type, each record can be parsed and processed independently while conserving memory and CPU by reducing the parsing effort required for very large documents. Finally, by using the different response codes in an appropriate manner, your API design can be used to adapt to a variety of circumstances, including high workload periods where resources are limited and or when large datasets are submitted.