
Ismael Juma
12K posts

Ismael Juma
@ijuma
Kafka, Scala, JVM, distributed systems, performance, machine learning, Haskell, @ConfluentInc.













🏎️ Made some good progress over the weekend improving the performance of the #Hardwood parser for Apache Parquet; 11 files from the 2025 NYC taxi ride data set (~720 MB) can now be fully parsed in ~1.9 sec. Besides some decoder tweaks, I focused mostly on improving the parallelism of the parser at this time. Which, as it turns out, is a surprisingly tricky problem. I'm still not really happy with how things are, but they are much better now than before. A naive approach would be to just parse separate column chunks in parallel. This can help a little bit, but it falls short very quickly: Your file might not just have many columns to begin with, or they could have different lengths (one column is repeatable, while others are not). So I took a first stab at implementing page-level parallelism (The Parquet format organizes files in row groups which are made up of column chunks which are made up of pages), which allows to fan out the work on a more fine-grained level. Once you have identified the page boundaries within a chunk (Parquet supports indexes for that, but not all files have them), and you have parsed its dictionary (if the column uses dictionary encoding), you can distribute the work of parsing pages to multiple threads, which increases CPU utilization a lot. There's still a problem: there can be significant differences in terms of how CPU-intensive the decoding of a given page is, depending on its encoding type, and thus to the time it takes to parse a page; essentially, faster columns will wait on slower columns. The way I'm currently tackling this is via adaptive page pre-fetching: slower columns build up a deeper page pre-fetching queue over time, thus more threads can pick up their parsing tasks. Eventually, whenever a new page is needed when iterating through the Parquet file, that page should be decoded already, no matter its value or encoding type. This gets me to a CPU utilization of ~800%, which is a significant improvement over single-threaded parsing or basic column-level parallelism. In wall clock profiling, I'm still seeing decoder threads idling for about half of the time, but we're getting there, step by step 🤓. 👉 github.com/hardwood-hq/ha…










