2 simple techniques to reduce JSON's size

A 5 minutes story written on Apr 2020 by Adrian B.G.

image

In this article, I want to play the devil’s advocate and showcase a few techniques to reduce the size of any JSON stored object. The use-cases are more inclined to the back-end web development side, where JSONs are stored in databases, messaging queues and processed in bulks.

We will partially sacrifice the “human-friendly” factor of JSON so the techniques do not make sense to be used in public APIs or front-end unless the performance is a priority, in which case you should not be using JSON.

Single letter keys

I want to start by mentioning that this technique is old, probably as old as JSON itself and it has simple reasoning: minimize the size of a JSON encoded format by using single letter keys as shown in the following example (snippet from the “Designing Data-Intensive Applications” book):

{
  "userName": "Martin",
  "favoriteNumber": 1337,
  "interests": ["daydreaming","hacking"]
}

By applying the trick our serialized object will result in:

{
  "u": "Martin",
  "f": 1337,
  "i": ["daydreaming","hacking"]
}

As you can notice, the verbosity that JSON spoiled us with has decreased, but in the context of a single object is still useful. It remains a text (as non-binary) format, the values always have the same order (as long as you use the same encoder) and usually you can deduct the key from the value.

As a sidenote, I have used this technique only for internal scope, meaning that I had 2 versions of the schema for the same object. The single letter one for storage and backend services, but when serving the object trough the public APIs they were served with the long key version, for verbosity.

If you care about performance and storage size I suggest switching to a binary encoding like FlatBuffers or Protocol Buffer and use single letter keys only as a “low hanging fruit” task.

Storage

This optimization makes sense to apply in areas where JSON is stored for a long period and/or developers do not manually read them so often (as it happens on a REST API when debugging). The major use-cases I can see applied to (or I have used them successfully) are:

  • document-oriented databases (MongoDB)
  • values that treat the JSON as a blob (meaning the values are not indexed, the type of the column is text or blob and contains one or more JSON serialized objects). Examples: MySQL, Cassandra, Elasticsearch.
  • messaging queues, Streams and data pipelines (Kafka, NiFi, StreamSets)
  • cold storage and data lakes (S3)

Size reduction

The single letter keys are not a magic bullet, as you can quickly realize it does not make sense to apply when the length of the keys are very small compared to the values. If the event has an average size of 32kb and only 10 keys, the effect will be unnoticeable.

But lucky for us, the technique works for most real-world scenarios and I have successfully applied it to a large TBs Elasticsearch cluster and a MongoDB database gaining an average of 30% less storage space.

In most cases we do not care about storage size because is the cheapest component, but you have to think outside of the box and realize that the document size will also benefit

  • the Network traffic which is the major bottleneck in large distributed systems
  • and the RAM usage of the database (in MongoDBs cases resulted in doing larger aggregations) and of the clients of the storage (eg: the micro-services)
  • unmarshal/decoding performance (CPU and latency)

For simplicity I will treat 1 character as 1 byte and use this example:

{"userName":"Martin","favoriteNumber":1337,"interests":["daydreaming","hacking"]}
  • current size: 81 bytes
  • single letter keys version: 53 bytes (35%)
  • Thrift CompactProtocol 34 bytes (58%)
  • Protocol Buffer 33 bytes (59%)

As a conclusion, the single letter keys is a low-hanging fruit optimization, easy to apply but still does not compare to a binary encoding format in terms of performance. But … if you can reduce the size of your cluster by hundred of GBs with minimal code impact I think that is a big win!

image

Compaction

Another technique that I want to mention is the compaction. Web Developers use compaction everyday for the JavaScript, HTML and CSS files, but less on the backend. This functionality is often implemented in the storage or messaging systems (kafka, gRPC) and does not require developers to write any code.

“Manual” compression makes more sense when you handle or store a series/bulk of objects and it tilts the usage from Network traffic/Storage size to more CPU usage and do not want/or cannot activate the builtin compression functionality of the technologies you are using.

As a reminder, compression is more efficient as the occurrence of substrings is higher, meaning that archiving more similar objects is better.

The users preferences is a random example, let it be that we have a list of objects that are used by the client (front end), and consist of 10-20 custom page settings. They will be probably stored by the backend as a block of text so they do not need to be indexed in separate columns in the database.

[{
  "view-page": "table1",
  "settings": [{
            "column1-size": "30%",
            "column2-size": "30%",
            "column2-hidden": true
          }]  
}]

Choosing what algorithm to use is not that straightforward. You will have to take into consideration the following criteria:

  • the programming languages you are using, and if you can find a well supported implementation
  • balance between compressed size vs CPU usage
  • balance between compression and decompression speed and CPU usage

You should do your benchmark tests, with your own data and/or do some reading, here is a comparison made by Brotli creators Comparison of Brotli, Deflate, Zopfli, LZMA, LZHAM and Bzip2 Compression Algorithms.

Thanks! 🤝

Please share the article, subscribe or send me your feedback so I can improve the following posts!

comments powered by Disqus