Introduction
JSON has become one of the most popular formats for encoding structured data, particularly when building RESTful APIs that exchange data with clients or between services. It can also be used in other contexts, such as data processing or machine learning, where structured data needs to be stored or transmitted in a standardized format.
While Python's built-in json module provides a simple and reliable way to work with JSON data, it is useful to know that there are several other libraries available that offer better performance and additional features.
In this article, let’s compare some of the most popular JSON libraries for Python such as `orjson`, `ujson`, `simplejson`, rapidjson, and `nujson`. We will do a brief overview of each library's purpose and history, and then compare their performance, features, and use cases to help you choose the best option for your needs.
JSON
First of all let’s take a look at JSON itself. JSON (JavaScript Object Notation) is a lightweight data interchange format. The main purpose of JSON is to be easily writable and readable for humans and easily parsable for machines at the same time.
JSON is suitable for a variety of purposes, including transferring data between a client and a server in web applications and APIs and storing data in databases or files (MongoDB, Elasticsearch, etc.).
Syntax
Let’s do a quick syntax overview:
- It’s not necessary to have a root object by specification, although it’s widely used. A JSON object is represented as a collection of key-value pairs, enclosed in curly braces.
- Each key-value pair consists of a key (enclosed in double quotes) followed by a colon and a value. Multiple key-value pairs are separated by commas.
- Values can be of several types, including strings (enclosed in double quotes), numbers (integers or floating-point), booleans (true or false), null value (null), arrays (ordered collections of values enclosed in square brackets), and objects (collections of key-value pairs enclosed in curly braces).
Here is an example of a JSON object for the movie Shrek (we will use it later for our benchmark part):
{
"title": "Shrek",
"year": 2001,
"genres": ["Animation", "Comedy", "Family"],
"director": "Andrew Adamson, Vicky Jenson",
"writers": ["William Steig", "Ted Elliott", "Terry Rossio", "Joe Stillman"],
"cast": [
{ "name": "Mike Myers", "character": "Shrek" },
{ "name": "Eddie Murphy", "character": "Donkey" },
{ "name": "Cameron Diaz", "character": "Princess Fiona" },
{ "name": "John Lithgow", "character": "Lord Farquaad" }
],
"description": "A mean lord exiles fairytale creatures to the swamp of a grumpy ogre, who must go on a quest and rescue a princess for the lord in order to get his land back.",
"poster": "https://www.example.com/shrek-poster.jpg",
"imdb_rating": 7.9
}
If you aren't familiar with the syntax of JSON or want to investigate how it is defined in detail, there are two main resources worth checking: json.org or ECMA-404.
Serialization
Serialization of JSON is the process of converting structured data, such as Python objects or database records, into a JSON-formatted string that can be transmitted or stored. This serialized JSON data can then be deserialized, or converted back into its original format, when needed.
Serialization of JSON is a common task in web development, particularly when building RESTful APIs that exchange data with clients. It can also be used in other contexts, such as data processing or machine learning, where structured data needs to be stored or transmitted in a standardized format.
JSON serialization has several potential disadvantages, including limited support for data types, lack of schema validation (which can be solved by using JSON Schema), large file sizes, and performance overhead.
Tasting!
As we embark on this delightful journey through the world of Python JSON libraries, let's take a closer look at their unique flavors and characteristics:
json (built-in)
What was originally called simplejson was added as json to the Python standard library starting from version 2.6. It was written by Bob Ippolito. The built-in json library is included with Python. While it may not be the most performance-oriented choice, it's perfect for simple use cases and quick JSON encoding and decoding tasks. For many developers, this library is the go-to choice for everyday JSON handling.
simplejson
simplejson follows its own development path, separate from the standard json module, while still maintaining compatibility. This allows simplejson to cater to the needs of users who require support for older Python versions.
ujson or “Ultra JSON”
ujson is a high-performance JSON library, written in pure C, designed for speed without sacrificing compatibility with the built-in json library's API. For applications that require rapid JSON encoding and decoding, particularly when dealing with large data sets, ujson is an excellent choice to elevate your JSON processing capabilities. It is compatible with Python 3.7 and above.
nujson
`nujson`: A fork of the high-performance `ujson` library, nujson specializes in handling NumPy objects, particularly arrays, by providing native serialization support. It's an excellent choice for projects that involve scientific or numerical data and require efficient processing of NumPy objects alongside standard JSON data types.
python-rapidjson
RapidJSON is a high-performance library, written in C++. It’s fast, doesn’t have additional dependencies, is memory-friendly, and it also has validation support based on JSON Schema. While RapidJSON itself is a C++ library, there are Python bindings available, such as python-rapidjson, that allow you to use its functionality within Python projects. It’s compatible with Python 3.
orjson
orjson is a high-performance JSON library implemented in Rust, offering Python bindings and focusing on speed and JSON specification compliance. Its unique features include serialization of non-string keys in dictionaries and improved handling of datetime objects. It’s suitable for projects where performance is critical and non-traditional data types are frequently used.
Usage
The usage of these JSON libraries in Python is quite similar due to their compatibility with the built-in json library's API. The primary functions for encoding and decoding JSON data, dumps() and loads(), are consistent across these libraries. It allows you to simply replace the built-in json library with an alternative by importing it as json.
# Built-in json
import json
# simplejson
import simplejson as json
# ujson
import ujson as json
# python-rapidjson
import rapidjson as json
# orjson
import orjson as json
# nujson
import nujson as json
Then, the usage of each of the mentioned libraries will look the same:
# Serialization
encoded_data = json.dumps(data)
# Deserialization
decoded_data = json.loads(encoded_data)
Benchmarking
Thus, since the usage of these libraries is quite similar, let's proceed to the pinnacle of our exploration journey and evaluate the performance of these JSON libraries in terms of speed.
First of all we need to install all those libraries:
pip install numpy simplejson ujson python-rapidjson orjson nujson
We will use a previously mentioned sample JSON dataset derived from the movie Shrek. This dataset will be stored in a file named shrek.json for our testing purposes.
The code for our benchmarking function will appear as follows:
def benchmark(library):
encoded_data = library.dumps(data)
decoding = lambda: library.loads(encoded_data)
encoding = lambda: library.dumps(data)
encode_time = timeit.timeit(encoding, number=10000)
decode_time = timeit.timeit(decoding, number=10000)
return encode_time, decode_time
Whereas data is already loaded `shrek.json` into the python dictionary:
with open('shrek.json', 'r') as f:
data = json.load(f)
Results
Here are the results of the benchmark tests for the JSON libraries, displaying the time taken for encoding and decoding operations:
Library | Encoding time | Decoding time | |
---|---|---|---|
Built-in json | 0.094794 | 0.057344 | |
simplejson | 0.150766 | 0.063476 | |
ujson | 0.043851 | 0.043310 | |
nujson | 0.039100 | 0.042777 | |
python-rapidjson | 0.038320 | 0.053166 | |
orjson | 0.009230 | 0.022392 |
Based on these results, we can observe the following:
- Built-in json: This library shows moderate performance in both encoding and decoding operations.
- simplejson: The results indicate that this library is slower in encoding compared to the built-in json, and has approximately the same decoding times.
- ujson: This library demonstrates better performance in both encoding and decoding operations than the built-in json and simplejson.
- nujson: Tailored for handling NumPy objects, nujson demonstrates competitive performance in both encoding and decoding operations compared to ujson. While its primary focus is on NumPy serialization, it remains a viable option for standard JSON data types.
- python-rapidjson: The performance of this library is quite similar to that of ujson and nujson, with slightly faster encoding times but marginally slower decoding times.
- orjson: This library shows the best performance in both encoding and decoding operations among all the tested libraries.
NumPy
As mentioned earlier, these results provide a general idea of the performance differences between the JSON libraries. Although, some JSON libraries are more suitable for handling NumPy objects, such as arrays, thanks to their specialized support for these data types. One such library is `nujson`. Let’s perform a separate benchmark for these libraries and check how they hold NumPy data.
Instead of Shrek our data will be generated with the NumPy library:
data = np.random.rand(1000, 1000).tolist()
Results
Library | Encoding time | Decoding time | |
---|---|---|---|
Built-in json | 0.017020 | 0.016064 | |
simplejson | 0.024024 | 0.016904 | |
ujson | 0.003151 | 0.004762 | |
nujson | 0.003559 | 0.002108 | |
python-rapidjson | 0.003829 | 0.003568 | |
orjson | 0.001916 | 0.001768 |
- Built-in `json`: This library shows moderate performance in both encoding and decoding operations. It is suitable for simple use cases and general-purpose JSON handling, even when dealing with NumPy arrays.
- `simplejson`: The results indicate that this library is slower in both encoding and decoding operations compared to other libraries.
- `ujson`: This library demonstrates better performance in both encoding and decoding operations than the built-in json and simplejson. It is a great choice for applications that require faster JSON processing, including handling NumPy arrays.
- `nujson`: Surprisingly, its encoding performance is slightly slower than ujson, but it is still twice as fast at decoding.
- `python-rapidjson`: The performance of this library is quite similar to that of `ujson` and `nujson`. It can be considered an alternative for handling JSON data, including NumPy arrays.
- `orjson`: This library shows the best performance in both encoding and decoding operations among all the tested libraries. Seems to be an excellent choice for projects that require high-performance JSON handling, especially when dealing with NumPy arrays.
To summarize, `orjson` stands out as the top performer, but the built-in json library remains a practical and convenient option for many projects. When choosing a JSON library, consider your specific requirements and weigh the benefits of each library against your needs.
Non-JSON alternatives
If you seek further optimization, non-JSON serialization alternatives can offer enhanced performance and storage efficiency. Understanding their benefits and use cases allows you to make informed decisions for your project's data handling needs. Let’s briefly take a look at some of them:
- Protocol Buffers (protobuf): Developed by Google, Protocol Buffers are a language-agnostic binary serialization format. They are designed to be compact, efficient, and fast. The advantages of protobuf include smaller message sizes, faster encoding and decoding, and a strongly-typed schema. They are especially useful for communication between microservices or distributed systems.
- MessagePack: This binary serialization format is an efficient alternative to JSON. MessagePack is designed to be compact and fast, with support for many programming languages. It offers smaller payload sizes compared to JSON and better performance in encoding and decoding. However, it might not be as human-readable as JSON.
- Apache Avro: Avro is a binary serialization format developed by the Apache Software Foundation. It is especially suitable for big data and distributed systems like Hadoop. Avro is schema-based, which ensures compatibility between different versions of data structures. It offers compact data representation, fast encoding and decoding, and supports many programming languages.
- Apache Thrift: Originally developed by Facebook, Apache Thrift is a binary serialization format that supports a wide range of programming languages. It provides a compact and efficient binary format, a strongly-typed schema, and enables cross-language communication. Thrift is suitable for microservices and distributed systems.
These alternatives offer various advantages, such as improved performance, smaller message sizes, and support for type safety and schema evolution. However, they may not be as human-readable or as widely supported as JSON. It's important to consider the specific requirements of your project when choosing a serialization format.
In this article, we have explored various Python JSON libraries, comparing their features and performance. While the built-in json library is suitable for many use cases, several alternative libraries offer unique characteristics that may cater to specific requirements, such as nujson can be a good choice for working with NumPy data. We have also discussed non-JSON serialization alternatives, which can provide further performance improvements and storage efficiency. The choice of a serialization library or format will depend on your project's needs, and conducting thorough benchmark tests will help you determine the most suitable solution.
photo/image: Midjourney