Safe Analytics with Rust and Arrow

An update over an experimental implementation of Arrow in Rust.

Over the past 4 months I have been working on arrow2. Arrow2 is a re-write of the official arrow implementation in Rust (that I have been heavily contributing to over the past 8 months). Arrow2 is aimed at testing the following hypothesis: it is possible to have an implementation of the arrow Arrow specification in Rust that is

  • Fast,
  • Maintainable,
  • Secure,
  • Interoperable, and
  • Intuitive.

Arrow is an in-memory data format, i.e. a specification of how data is laid in memory (e.g. in RAM) to leverage modern architectures and interoperability between programming languages. The Arrow specification is incredibly well designed, and I recommend anyone working in analytics to study it.

Arrow2 is an implementation of the specification in Rust that tries to leverage Arrow to its maximum potential without jeopardizing security nor user experience. It started off as an experiment to address many of the security issues that the current Rust implementation of Arrow has and that I was unable to address on its current design.

The arrow2 implementation is based on the incredible work and contributions of many, many contributors to the Arrow implementation in Rust. You can find all of contributors here and here. Arrow2 borrows a lot of the ideas and code from there, and it just have a different inner working that in my opinion makes it both safe by design and more likely to be more performant.

Fast

The Arrow format is highly optimized for numeric operations with optional null values. Let’s compare Arrow2 against two popular numeric libraries: numpy (1.19.5) and pyarrow (4.0.0). In my opinion, these are all comparable because

  • they are all written in a low-level system programming language
  • they are highly optimized for compute
  • their compute model is based on immutable data structures
  • their compute kernels are single threaded (i.e. decisions over multi-threading model is left to their users)
  • They are columnar in-memory formats

Benchmarks against numpy are only shown for arrays without null values, as numpy does not support null values. I also added the performance of the current Rust arrow implementation (4.0.0) for illustration purposes. All the benchmark code is available in the repository here and here. The benchmarks for pyarrow and numpy are in Python, but the benched code is just a call to C/C++, which has a residual cost. I did not try any complex stuff: just compile the code and compare against pre-compiled binaries available in pypi. My machine is commodity hardware: Standard B2s from Azure.

Lets start with a simple example; a typical row-based operation, filtering based on a predicate:

Secondly, let’s compute the sum of an array of floats, with and without nulls values, the hallmark of an horizontal operation

And finally, something more complex that requires multiple index-based accesses, sort an array:

An important aspect of these results, which again shows how powerful the Arrow specification is, is that the comparison of arrow2 vs arrow crate vs pyarrow is simply based on their compute kernels: the in-memory representation of these 3 crates is exactly the same (we can even share memory pointers between them). This makes it much easier to share knowledge between implementations, thereby allowing the Arrow project to benefit, as a whole, from the performance improvement of one specific implementation.

Maintainable

An important selling point of Rust is that it has powerful abstractions that makes code easy to read, which in turn makes it easy to develop and maintain.

To offer an example, above we showed that the arrow2 is 2x faster than numpy in summing non-null arrays. For such difference, we would expect that arrow2’s code is highly optimized and thus have complex optimizations. Below is the verbatim implementation of the sum for non-null arrays that results in the performance above:

fn nonnull_sum<T>(values: &[T]) -> T
where
T: NativeType + Simd,
T::Simd: Add<Output = T::Simd> + Sum<T>,
{
let mut chunks = values.chunks_exact(T::Simd::LANES);
let sum = chunks.by_ref().fold(T::Simd::default(), |acc, chunk| {
acc + T::Simd::from_chunk(chunk)
});
let remainder = T::Simd::from_incomplete_chunk(chunks.remainder(), T::default());
let reduced = sum + remainder;
reduced.simd_sum()
}

For those not familiar with Rust, this is roughly translated to: for any native type with a SIMD representation whose SIMD can be added and reduced by sum, compile the body. The body reads as: divide the `values` in chunks of LANES (16 for a 32-bit float), sum the chunks in lanes (sum), and then reduce the sum (i.e. sum each lane together), together with the remainder of the chunking. This is just a typical strategy to implement a sum. Under the hood, the Rust compiler will compile this for each type required (e.g. i32, f64), and use SIMD instructions for operations that the target architecture supports.

Maintenance and clean design is crucial in an open source project as it enables people to follow what is being done without having to spend a large amount of time reading the code.

Secure

The main selling point of Rust is that it offers compile-time protection against a set of programming patterns that are prone to security vulnerabilities. Rust is gathering a lot of attention because it addresses many of the common security challenges of developing in a system programming language.

With that said, it is possible to do everything in Rust that is possible in C or C++, which implies that Rust only offers an increased security if developers use its compiler effectively.

Arrow2 aimed at re-designing the official arrow crate so that the compiler can help much more than what it can currently help in the arrow crate. Arrow2 is a re-write of the arrow crate in a manner that lends the Arrow specification well to Rust’s programing paradigm. As a result, it addresses all known security vulnerabilities that exist in the current Rust Arrow implementation and passes all MIRI checks.

As an example of this, in the code above, all safety around memory management is proven by the compiler during compilation (in Rusts’ notation, no unsafe API is being used on it or any its calls, recursively).

In my opinion, this point is crucial in arrow2 because information security is kind of mandatory to any library that handles data in an enterprise setting.

Interoperable

The hallmark of the Arrow format is that its specification is designed to enable interoperability between implementations at multiple levels: within the same thread, across threads, and across processes. The Arrow format offers 2 main specifications for this:

  • IPC specification, including the flight protocol
  • FFI specification for in-memory data sharing

arrow2 supports both of then, and has integration tests against all official implementations (C++, Java, R, Go) for the IPC spec, and integration tests against the C++/pyarrow implementation for FFI spec.

Intuitive

An important aspect of any API is how intuitive it is to use. Arrow2 addresses this by identifying the main aspects of the Arrow specification and decomposing them in their natural constructs in Rust. The main building blocks of the specification are:

  • physical types (such as an i32)
  • Buffer
  • Bitmap
  • logical types (such as “date”, represented by a physical type)
  • array types (such as a Boolean array)

Physical types are converted to and from bytes based on endianness, and are usually necessary arguments for templating, as they generate different machine code based on their physical size. They usually also induce architecture-dependent SIMD, which needs to be compiled for each type.

A buffer (of physical type T) is a contiguous memory region holding elements of type T, represented in native endianness. These buffers can be written to parquet, ORC or anything that has an explicit endianness contract. individual values on a buffer can be accessed at constant time via pointer arithmetic. Contiguous regions enables SIMD instructions to be used on them, as multiple values can be loaded from a single memory region.

A bitmap is a special container to store a vector of booleans, where each boolean is an bit, not a byte, as per Arrow specification. This needs special treatment because pointer arithmetic does not work on these.

Finally, we have a logical type: a semantic construct whose each variant drives certain logical operations (e.g. sum an interval in days with a timestamp in seconds is not the usual sum of two numbers representing each).

Arrow2 declares each of these components, and derives all its public API from them. For example, this is the array of native types (e.g. of f32):

pub struct PrimitiveArray<T: NativeType> {  // physical type
data_type: DataType, // logical type
values: Buffer<T>,
validity: Option<Bitmap>,
}

which is intuitive to reason about: it has a logical datatype, its physical type induces a specific pointer arithmetic on the Buffer, and it has an optional validity to keep track of which elements are nulls and not.

Closing remarks

I embarked on this journey out of curiosity: I wanted to understand how much it is possible to leverage the Arrow format without jeopardizing security and maintainability. In my opinion, the hypothesis that I wanted to test has been demonstrated — we can achieve performance, maintainability, security, interoperability and an intuitive API.

In terms of feature set, arrow2 has all features of the official implementation apart from from being able to read and write nested types from parquet. It has a bunch of other major features, such as being able to read big-endian IPC, supporting nested types in FFI, and being 3–15x faster to read and write to and from parquet than the arrow crate, but that is for another story…

A natural step forward is to consider incorporating this implementation in Apache Arrow, in the hopes that we can gradually move to have arrow2 to be the official implementation of Arrow in Rust.

Leveraging information asymmetry