Parquet with Rust

Jorge C. Leitão
4 min readAug 9, 2021

--

An update over an experimental implementation of Parquet in Rust

Over the past 5 months I have been experimenting with Parquet and Rust, and this post offers an update over this experiment. The main hypothesis that I wanted to test: it is possible to write an implementation that reads from and writes to Parquet that is:

  • Safe
  • Fast
  • Interoperable
  • Portable
  • Easy to maintain

Some of these requirements are traditionally competing. For example, thread-safe code is usually difficult to write and maintain, and speed sometimes requires careful compiler optimizations that often require removing out of bound memory reads. Endianess independence is usually also difficult to achieve.

Parquet2 is a Rust library that is in my opinion addresses the 5 points described above.

It is important to say that some of the crates’ code base would not have been not possible without the impressive contributions from Apache Arrows’ contributors, which you can find here. There are three persons that I have to emphasize, and that is Sutou Kouhei, Chao Sun, and nevi-me, that spearheaded a lot of the original work on the parquet crate, that I am basing off this experiment from.

With this said, let’s now go through each one of the items above in detail:

Safe

Rust is a system programming language with a compiler that is able to prove that certain code results in undefined behavior, failing the compilation when that is the case. There are cases where the Rust compiler cannot prove that the code is sound, even though we know that it is sound. For these cases, the language has a special keyword, unsafe , that developers use to tell the compiler to not try to prove soundness. Using unsafe requires careful considerations, as it is a very common source of vulnerabilities in Rust.

Parquet2 does not use unsafe , thereby making the crate as a whole proven to be sound by the compiler. Some of its dependencies do use unsafe , and thus an audit to them is still necessary.

Fast

To see how fast parquet2 is, let’s compare it against two implementations: the official C++ implementation, exposed in Python in pyarrow, and the official Apache implementation in Rust, the crate parquet . Below are the main results of reading a single parquet page, which corresponds to the smallest unit of work that you can do in parquet. We will discuss parallelism afterwards.

This includes serialization and encoding done by the arrow2. More details here.
This includes serialization and encoding done by the arrow2. More details here.

The second and equally important aspect of parquet2 is that it offers a complete separation of CPU-intensive tasks (work) and IO-intensive tasks (read/write). This allows users to parallelize read and work according to their individual setups. As examples, let’s go through two common use-cases that parquet2 supports:

  • reading from a local filesystem
  • reading from over a network (e.g. s3)

When reading from a local filesystem on a multi-core configuration, it is usually advantageous to have a read head with a dedicated thread that only reads (IO), and separate workers that perform all the CPU-intensive work. Parquet2 enables this by offering in-memory compressed pages that can cross thread boundaries, thereby allowing them to be sent to work nodes.

When reading from over a network (e.g. s3), it is usually advantageous to use a multi reader multi worker setup whereby many readers can access s3 and slowly pull bytes (IO-intensive), which are then shipped over to workers that decompress, deserialize, and decode the pages to an in-memory format. Again, parquet2 allows this out of the box.

parquet2 supports async reading (example over s3) without committing to a runtime, thereby allowing the IO-intensive tasks to be non-blocking.

Interoperable

The parquet2 crate is tested in integration against pyarrow and pyspark, with round-trips between them on different configurations (e.g. types, encodings, compressions, parquet versions).

Portable

Being written in a system programming language, this code can be compiled against any architecture that the Rust compiler supports. Furthermore, all type to byte conversions use endianess conversions, ensuring compatibility with big endian machines.

Easy to maintain

Parquet2 supports all functionality that the official crate supports. Yet, its source code is ~3x smaller, and does not use unsafe . This results in a much smaller maintenance burden.

Closing remarks

Parquet2 is still in its early stages, specially when compared with very mature implementations in C++ and Java.

--

--