polars data

438 posts

polars data banner
polars data

polars data

@DataPolars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust.

Amsterdam Katılım Temmuz 2022
6 Takip Edilen7.2K Takipçiler
polars data
polars data@DataPolars·
Quoting Jensen: "All of these platforms are processing DataFrames. This is the ground truth of business. This is the ground truth of enterprise computing. Now we will have AI use structured data. And we are going to accelerate the living daylights out of it." Polars DataFrames are at the core of the AI revolution. youtube.com/watch?v=jw_o0x…
YouTube video
YouTube
English
0
2
34
3.8K
polars data
polars data@DataPolars·
We've released Python Polars 1.39. Some of the highlights: • Streaming AsOf join join_asof() is now supported in the streaming engine, enabling memory-efficient time-series joins. • sink_iceberg() for writing to Iceberg tables A new LazyFrame sink that writes directly to Apache Iceberg tables. Combined with the existing scan_iceberg(), Polars now supports full read/write workflows for Iceberg-based data lakehouses. • Streaming cloud downloads scan_csv(), scan_ndjson(), and scan_lines() can now stream data directly from cloud storage instead of downloading the full file first. Link to the complete changelog: github.com/pola-rs/polars…
English
1
24
182
8.6K
polars data
polars data@DataPolars·
A one liner will route every .collect() call through the streaming engine: pl.Config.set_engine_affinity("streaming"). Put it at the top of your script and all subsequent .collect() calls will prefer the streaming engine. You can also pass engine="streaming" directly to a single .collect() call if you only want to opt in for only one query. The streaming engine processes data in chunks rather than loading everything into memory at once. It's 3-7x faster than the in-memory engine, and for workloads that exceed available RAM it's the only viable option. We will soon set the streaming engine as the default engine, but this way you can already enjoy its benefits.
polars data tweet media
English
0
6
104
5.4K
polars data
polars data@DataPolars·
pl.from_repr() constructs a DataFrame or Series directly from its printed string representation. This can be useful in unit tests: instead of rebuilding expected DataFrames through dictionaries with typecasting, the schema is encoded in the header and the values are right there in the table. You can see at a glance what the test is asserting.
polars data tweet media
English
1
5
46
3K
polars data
polars data@DataPolars·
Easily scale Polars queries from @ApacheAirflow Our latest blog post walks through different patterns to run distributed Polars queries using Airflow: fire-and-forget execution, parallel queries, multi-stage pipelines, and manual cluster shutdowns. Read more here: pola.rs/posts/airflow-…
English
0
0
26
1.6K
polars data
polars data@DataPolars·
Polars exposes two ways to measure string length: str.len_bytes() and str.len_chars(). The difference matters more than you'd think. In terms of precision: len_bytes counts raw UTF-8 bytes. len_chars counts Unicode code points. For pure ASCII text, they return the same number. However, the moment you have accented characters, CJK text, or emoji, they diverge. For example, Japanese characters take 3 bytes each. Emoji take 4. In terms of performance: on a dataset with 5 million rows, len_bytes runs about 20x faster than len_chars. That's because determining the number of bytes is a single metadata lookup on the underlying buffer, which doesn't need to traverse (complexity: O(1)). len_chars has to walk every string byte-by-byte to find code point boundaries (complexity: O(n)). So which one should you use? • len_bytes: If you're working with guaranteed ASCII data (such as hashes, IDs, standard codes) ,when an approximation of the length is close enough, or when you need to know how many bytes the string takes in memory. • len_chars: If your data contains any user-generated text, names, addresses, or anything multilingual, or you want to be sure of the precise and correct length. Benchmark code: gist.github.com/TNieuwdorp/75e…
polars data tweet media
English
1
4
64
4.2K
polars data
polars data@DataPolars·
We've released Python Polars 1.38. Some of the highlights: • (De)Compression support on text based sources and sinks zstd and gzip are now supported for write_csv(), sink_csv(), scan_ndjson(), and sink_ndjson(). • scan_lines() to read text files This new function constructs a LazyFrame by scanning lines from a file into a string column. This is particularly useful for working with (compressed) log files. • Merge join in the Streaming engine When join columns are sorted in both DataFrames, we now use a merge join, which can improve performance 2-4x and in some cases even up to 10x. To unlock these performance gains, use the Lazy API and apply set_sorted(col) to let Polars know the data is sorted. Link to the complete changelog: github.com/pola-rs/polars…
English
1
7
62
4K
polars data
polars data@DataPolars·
The early design decisions for the Categorical type were under strain because of our streaming engine. Every data chunk carried its own mapping between the categories and their underlying physical values, forcing constant re-encoding. The global StringCache we built to solve it caused lock contention and wasn't designed for a distributed architecture. The new Categories object, released in 1.31, solves this, and gives you: • Control over the physical type (UInt8/16/32) • Named categories with namespaces • Parallel updates without locks • Automatic garbage collection When you know the categories up front you can use Enums. They're faster because of their immutability and allow you to define the sorting order of values. The StringCache is now a no-op, but the code will keep working how it used to (with global Categories). You can also migrate by replacing it with explicit Categories where needed. The result is a Categoricals data type that works well on the streaming engine without performance degradation, and is compatible with a distributed architecture. Read the full deep dive: pola.rs/posts/categori…
polars data tweet media
English
0
9
65
5.4K
polars data retweetledi
Ritchie Vink
Ritchie Vink@RitchieVink·
In 1-2 weeks we land live query profiling in Polars Cloud. See exactly how many rows are consumed and produced per operation. Which operation takes most runtime, and watch the data flow through live, like water. 😍 If your query takes too long, you can see why that happens and act upon it. However, we're fast so you gotta be quick ;)
Ritchie Vink tweet media
English
2
3
69
4.4K
polars data retweetledi
Ritchie Vink
Ritchie Vink@RitchieVink·
ClickBench now runs the Polars streaming engine. Polars is the fastest solution on that benchmark on Parquet file(s) 😎 The speed is there. This year, we will tackle out of core (spill to disk) and distributed to truly tackle scale. #system=-ahi|Arc|drB|hDB|lik|o%20aaap|i%20%20,s|-C|cuwe|atbd|DcD|Flt|QtD|Sls|StD|pA|pC|Ub|Us&type=-&machine=+ae-l&cluster_size=-&opensource=-&hardware=-&tuned=+n&metric=combined&queries=-" target="_blank" rel="nofollow noopener">benchmark.clickhouse.com/#system=-ahi|A…
Ritchie Vink tweet media
English
3
6
53
4K
polars data
polars data@DataPolars·
We just released Polars 1.37, here are the highlights: Performance Improvements - Streaming Sinks: New streaming pipelines for NDJSON, CSV, and IPC sinks (1.14x-1.88x speedup with up to only ~10% of the original memory usage, depending on the use case), plus optimized memory usage for grouping and Parquet scanning. - Streaming Compressed CSVs: Polars now supports streaming compressed CSV files, allowing you to process large files without loading them into memory all at once. - Faster SQL Ordering: Significant speedups for ORDER BY clauses in the SQL interface. New Features & APIs - pl.PartitionBy: A unified and cleaner API for defining partition strategies when sinking datasets. - min_by / max_by: New aggregation expressions to easily find values corresponding to the min/max of another column (see the snippet below!). - Series.sql(): You can now run SQL queries directly against a Series object for quick analysis. Technical Updates - Free-Threading Support: Polars is now marked as safe for free-threading (Python 3.13+), paving the way for true parallelism without the GIL. - Python 3.9 Support Dropped: This Python version is end-of-life. Please upgrade to Python 3.10+ to use the latest versions. - musl Builds: Polars now offers pre-built versions for musl systems like Alpine Linux, so you no longer have to compile it yourself, reducing build times. Find the full release notes here: github.com/pola-rs/polars…
polars data tweet media
English
4
23
167
10.6K
polars data
polars data@DataPolars·
We are kicking off the year by growing our engineering team in Amsterdam. As we continue to build the next generation of DataFrames, we are looking for engineers to join us onsite (hybrid) to help scale our platform: - Platform Engineer (Rust): To help us build our distributed engine and cloud platform. You need deep knowledge of Async Rust (Tokio) and distributed systems architecture. - Full Stack Engineer: To build the frontend architecture of Polars Cloud. You need strong experience with React, TypeScript, and modern tooling (Next.js, Tailwind, ShadCN). Do you know someone, or are you the person that fits? Apply at pola.rs/careers or find our other positions.
English
3
4
40
5.4K
polars data
polars data@DataPolars·
Did you know about pl.corr()? We modeled a simple scenario: Remote Work vs. Productivity. (Disclaimer: The productivity metrics are... questionable). The problem is that data aggregation can hide what's really going on. Below you can find Simpson's Paradox: a phenomenon where a trend appears in different groups of data but disappears or reverses when these groups are combined. If you look at the global correlation in our data, it suggests remote work has a slightly negative impact (-0.09). But when you check the specific groups, the picture changes: Sales suffer from remote work (-0.80) Devs flourish (+0.74) With pl.corr(), you don't need complex loops to find these discrepancies. You can calculate global and grouped correlations side-by-side for a quick comparison. Sometimes the devil really is in the details.
polars data tweet media
English
1
2
34
2.5K
polars data retweetledi
Daniel van Strien
Daniel van Strien@vanstriendaniel·
Is the web getting more educational? @DataPolars can stream (some would say blazingly fast) from @huggingface Hub. I analysed 736GB from finepdfs-edu without downloading locally. 50M rows spanning 2013-2025. Answer: Yes. High-quality educational content is up ~22%. ~14 min on HF Jobs. No local storage needed. For smaller languages, it's a few seconds.
GIF
English
1
4
18
2.1K
polars data
polars data@DataPolars·
"We adopted Polars to meet strict technical requirements, but the result went beyond simple optimization. The 30x performance improvement gave us the unexpected opportunity to do more." Read about how Rabobank successfully deployed Polars in a critical enterprise production environment. When processing billions of transactions to provide customers with insights on recurring costs and subscriptions, performance is a requirement. Freerk Venhuizen details how the move to Polars did more than just speed up their "Periodicity engine" by ~30x. The rewrite resulted in code that is not only more readable and maintainable, but also efficient enough to unlock new possibilities. The performance gains allowed the team to process complexities that were previously impossible, turning a technical migration into a major step forward for their customer insights. Read everything about it here: pola.rs/posts/case-rab…
polars data tweet media
English
0
3
41
2.3K
🇮🇱 Uriah Finkel
🇮🇱 Uriah Finkel@FinkelUriah·
@DataPolars What are the benfits of custom data types? @DataPolars I thought it might be good for data validation, but I guess it's more about Interoperability with other ecosystems. Am I right?
English
1
0
0
72
polars data
polars data@DataPolars·
We've just released 1.36.0 with a couple of big features. Here are the highlights: Highlights: 🧩 Extension Types: Allows for custom data types within the Polars ecosystem. You can see an example in the image below. 🛟 Float16 Support: First-class support for model parameters and half-precision floating point data. ↪️ Lazy Pivot: LazyFrame.pivot() is finally here, allowing for query optimization on reshape operations. 👀 show(): Easily preview the first rows of a DataFrame or LazyFrame. 🗄️ SQL Parity: Added Window functions (ROW_NUMBER, RANK, DENSE_RANK) and CROSS JOIN UNNEST to the SQL API. Performance: ⏱️ Parquet writer improvement: 2.2x runtime improvement with a 20% peak memory usage reduction, which is even 39% for partitioned sinks (on a synthetic benchmark). 🚀 Support for group_by_dynamic and Sorted Group-By on the streaming engine. Find the full release notes here: github.com/pola-rs/polars…
polars data tweet media
English
2
11
94
4.9K