Any library in Java or Python for storing lots of small objects in AWS S3 "intelligently"?

By | June 23, 2018

I need to store a lot objects in S3, and I need to be able to address each object fairly quickly (maximum of 2 GETs to find one object) and I don't want to be wasteful when comes to the number of PUT requests, which are expensive at $10 per million PUTs.

Assuming that I could batch the PUTs, is there a library that could combine the objects into an index/payload format where perhaps a list of keys is maintained in a separate object, using RANGE requests to retrieve the payload of an individual object?

The data is highly redundant across objects, so preferably it should be compressed in a way that shares a compression dictionary across more than one object.

In my use-case (YMMV): This library would actually be used in a persistent cache implementation, so data integrity and ACID properties are NOT a requirement. The data can actually be re-generated if some of it is lost during the update process... as long as it doesn't happen too often, as that would overwhelm the source system.

I have heard about the Parquet format and its supporting libraries, but on first reading of the specs, it was not clear at all if it solves this particular problem.