I was recently working with the Reddit comments and submission dumps from PushShift (RIP).1 These are compressed in Zstandard .zst
format. Unfortunately, Python’s extensive standard library doesn’t have native support for this format, and the some of the files are quite large,2 so a streaming API is necessary.
After trying various third-party libraries, I finally found one that worked with a minimum of fuss: pyzstd, available from PyPI or Conda. This appears to be using FacebookMeta’s reference C implementation as the backend, but more importantly, it provides a stream API like the familiar gzip.open
, bz2.open
, and lzma.open
for .gz
, .bz2
and .xz
files, respectively. There’s one nit: PushShift’s Reddit dumps were compressed with an uncommonly large window size (2 << 31), and one has to inform the decompression backend. Without this, I was getting the following error:
_zstd.ZstdError: Unable to decompress zstd data: Frame requires too much memory for decoding.
All I have to do to fix this is to pass the relevant parameter:
PARAMS = {pyzstd.DParameter.windowLogMax: 31} with pystd.open(yourpath, "rt", level_or_options=PARAMS) as source: for line in source: ...
Then, each line
is a JSON message with the post (either a comment or submission) and all the metadata.