polars data

446 posts

polars data banner
polars data

polars data

@DataPolars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust.

Amsterdam Katılım Temmuz 2022
6 Takip Edilen7.3K Takipçiler
polars data
polars data@DataPolars·
Polars supports a full Iceberg roundtrip on the streaming engine. You can scan an Iceberg table with scan_iceberg(), transform it lazily, and write the result back with sink_iceberg(). Useful for workflows like data redaction or compliance cleanup: scan the table, redact the matching user's PII, and overwrite the table with the cleaned result. That overwrite is committed as a new Iceberg snapshot, and after you validate it you can expire older snapshots as part of your cleanup workflow.
polars data tweet media
English
0
2
26
1.6K
polars data
polars data@DataPolars·
Handling schema changes in Polars. Our latest blog post maps the four shapes of schema change (a new column appears, an expected one disappears, a type drifts, or one breaks) to the Polars solution that handles each, across CSV, multi-file Parquet, Delta Lake, and Apache Iceberg. Read the full breakdown here: pola.rs/posts/schema-e…
English
1
1
15
1.3K
polars data
polars data@DataPolars·
We've released Python Polars 1.40. Some of the highlights: • Streaming grouped AsOf join AsOf joins with a `by` argument are now supported in the streaming engine, extending last release's streaming AsOf support to grouped time-series joins. • Basic over() in the streaming engine Elementwise window expressions using over() can now run in the streaming engine. • More expressions lowered to streaming cov(), corr(), interpolate(), skew(), kurtosis(), and entropy() are now natively supported in the streaming engine. Link to the complete changelog: github.com/pola-rs/polars…
English
1
4
40
2.8K
polars data
polars data@DataPolars·
We've been busy in Q1 2026. 12 releases. 778 PRs. 95 contributors (thank you!). The streaming engine now covers more join types, all major formats have a streaming scan implementation, Delta and Iceberg both have full read/write support, and Polars Cloud gained a query profiler that helped us run a TPC-H benchmark 54% faster at 64% lower cost. Read all the highlights in the latest Polars in Aggregate: pola.rs/posts/polars-i…
English
1
1
32
2.3K
polars data
polars data@DataPolars·
Polars loves sorted data! If your data is already sorted, you can get a performance boost up to 18x when joining your datasets. Read all about it in our latest blog post: pola.rs/posts/streamin…
English
0
2
37
2.1K
polars data
polars data@DataPolars·
Realtime query profiling of Polars In this post we use the query profiler in Polars Cloud to optimize the infrastructure configuration for a specific query. This results in a 54% faster and 64% cheaper query with only five runs. Read all about it here: pola.rs/posts/query-pr…
English
0
0
11
1.7K
polars data
polars data@DataPolars·
We've released Polars Cloud client 0.6.0. Some of the highlights: • Improved UX for query profiling Data skew is now included in the metrics, showing how long workers take to execute the stage and the size of partitions. You can now also see resource metrics per stage. • Compute Scratchpad Alpha We've released a new interactive scratchpad functionality for ad-hoc computation that runs on your Polars Cloud cluster. • Improved distributed query planning Various improvements in the distributed query planning to improve stability & performance. • Breaking: `LazyFrameRemote.execute` is now blocking by default Previously fire-and-forget, `.execute()` now blocks until the query completes. Providing the parameter `blocking=False` triggers the old behavior.
English
0
1
6
1K
polars data
polars data@DataPolars·
Quoting Jensen: "All of these platforms are processing DataFrames. This is the ground truth of business. This is the ground truth of enterprise computing. Now we will have AI use structured data. And we are going to accelerate the living daylights out of it." Polars DataFrames are at the core of the AI revolution. youtube.com/watch?v=jw_o0x…
YouTube video
YouTube
English
0
2
36
4.3K
polars data
polars data@DataPolars·
We've released Python Polars 1.39. Some of the highlights: • Streaming AsOf join join_asof() is now supported in the streaming engine, enabling memory-efficient time-series joins. • sink_iceberg() for writing to Iceberg tables A new LazyFrame sink that writes directly to Apache Iceberg tables. Combined with the existing scan_iceberg(), Polars now supports full read/write workflows for Iceberg-based data lakehouses. • Streaming cloud downloads scan_csv(), scan_ndjson(), and scan_lines() can now stream data directly from cloud storage instead of downloading the full file first. Link to the complete changelog: github.com/pola-rs/polars…
English
1
24
181
8.9K
polars data
polars data@DataPolars·
A one liner will route every .collect() call through the streaming engine: pl.Config.set_engine_affinity("streaming"). Put it at the top of your script and all subsequent .collect() calls will prefer the streaming engine. You can also pass engine="streaming" directly to a single .collect() call if you only want to opt in for only one query. The streaming engine processes data in chunks rather than loading everything into memory at once. It's 3-7x faster than the in-memory engine, and for workloads that exceed available RAM it's the only viable option. We will soon set the streaming engine as the default engine, but this way you can already enjoy its benefits.
polars data tweet media
English
0
6
105
5.7K
polars data
polars data@DataPolars·
pl.from_repr() constructs a DataFrame or Series directly from its printed string representation. This can be useful in unit tests: instead of rebuilding expected DataFrames through dictionaries with typecasting, the schema is encoded in the header and the values are right there in the table. You can see at a glance what the test is asserting.
polars data tweet media
English
1
5
45
3.2K
polars data
polars data@DataPolars·
Easily scale Polars queries from @ApacheAirflow Our latest blog post walks through different patterns to run distributed Polars queries using Airflow: fire-and-forget execution, parallel queries, multi-stage pipelines, and manual cluster shutdowns. Read more here: pola.rs/posts/airflow-…
English
0
0
26
1.7K
polars data
polars data@DataPolars·
Polars exposes two ways to measure string length: str.len_bytes() and str.len_chars(). The difference matters more than you'd think. In terms of precision: len_bytes counts raw UTF-8 bytes. len_chars counts Unicode code points. For pure ASCII text, they return the same number. However, the moment you have accented characters, CJK text, or emoji, they diverge. For example, Japanese characters take 3 bytes each. Emoji take 4. In terms of performance: on a dataset with 5 million rows, len_bytes runs about 20x faster than len_chars. That's because determining the number of bytes is a single metadata lookup on the underlying buffer, which doesn't need to traverse (complexity: O(1)). len_chars has to walk every string byte-by-byte to find code point boundaries (complexity: O(n)). So which one should you use? • len_bytes: If you're working with guaranteed ASCII data (such as hashes, IDs, standard codes) ,when an approximation of the length is close enough, or when you need to know how many bytes the string takes in memory. • len_chars: If your data contains any user-generated text, names, addresses, or anything multilingual, or you want to be sure of the precise and correct length. Benchmark code: gist.github.com/TNieuwdorp/75e…
polars data tweet media
English
1
4
63
4.3K
polars data
polars data@DataPolars·
We've released Python Polars 1.38. Some of the highlights: • (De)Compression support on text based sources and sinks zstd and gzip are now supported for write_csv(), sink_csv(), scan_ndjson(), and sink_ndjson(). • scan_lines() to read text files This new function constructs a LazyFrame by scanning lines from a file into a string column. This is particularly useful for working with (compressed) log files. • Merge join in the Streaming engine When join columns are sorted in both DataFrames, we now use a merge join, which can improve performance 2-4x and in some cases even up to 10x. To unlock these performance gains, use the Lazy API and apply set_sorted(col) to let Polars know the data is sorted. Link to the complete changelog: github.com/pola-rs/polars…
English
2
7
62
4.1K
polars data
polars data@DataPolars·
The early design decisions for the Categorical type were under strain because of our streaming engine. Every data chunk carried its own mapping between the categories and their underlying physical values, forcing constant re-encoding. The global StringCache we built to solve it caused lock contention and wasn't designed for a distributed architecture. The new Categories object, released in 1.31, solves this, and gives you: • Control over the physical type (UInt8/16/32) • Named categories with namespaces • Parallel updates without locks • Automatic garbage collection When you know the categories up front you can use Enums. They're faster because of their immutability and allow you to define the sorting order of values. The StringCache is now a no-op, but the code will keep working how it used to (with global Categories). You can also migrate by replacing it with explicit Categories where needed. The result is a Categoricals data type that works well on the streaming engine without performance degradation, and is compatible with a distributed architecture. Read the full deep dive: pola.rs/posts/categori…
polars data tweet media
English
0
9
65
5.5K
polars data retweetledi
Ritchie Vink
Ritchie Vink@RitchieVink·
In 1-2 weeks we land live query profiling in Polars Cloud. See exactly how many rows are consumed and produced per operation. Which operation takes most runtime, and watch the data flow through live, like water. 😍 If your query takes too long, you can see why that happens and act upon it. However, we're fast so you gotta be quick ;)
Ritchie Vink tweet media
English
2
3
69
4.5K