Polars is a data frame in the Rust language using Apache Arrow Columnar Format. polars-ruby is the Ruby binding for Polars created by Andrew Kane.
Several members of the Ruby community have been deeply involved in the development of Apache Arrow.
Fast data processing with Ruby and Apache Arrow - rubykaigi2022
So while the Arrow C binding for the Ruby language is relatively well-developed, polars-df is not an Arrow C binding, but a binding to Polars implemented in Rust. magnus is used for the connection between Ruby and Rust. In fact, there is also a Ruby data frame that uses the Arrow binding, which is called RedAmber. But we are not talking about that now.
Please note that this post is incomplete and polars-df is still in the development phase, so the API is subject to change.
Documentation
Chapter 1 Getting started in Ruby
Installation
Ruby gem
gem install polars
From source code
git clone
https://github.com/ankane/polars-ruby
cd polars-ruby
bundle
bundle exec rake compile
bundle exec rake install
Quick start
Below we show a simple snippet that parses a CSV file, filters it, and finishes with a groupby operation. This example is presented in python only, as the "eager" API is not the preferred model in Rust.
require 'polars'
require 'uri'
df = Polars.read_csv(URI('https://j.mp/iriscsv'))
df.filter(Polars.col('sepal_length') > 5)
.groupby('species')
.agg(Polars.all.sum)
The snippet above will output:
shape: (3, 5)
ββββββββββββββ¬βββββββββββββββ¬ββββββββββββββ¬βββββββββββββββ¬ββββββββββββββ
β species β sepal_length β sepal_width β petal_length β petal_width β
β --- β --- β --- β --- β --- β
β str β f64 β f64 β f64 β f64 β
ββββββββββββββͺβββββββββββββββͺββββββββββββββͺβββββββββββββββͺββββββββββββββ‘
β versicolor β 281.9 β 131.8 β 202.9 β 63.3 β
ββββββββββββββΌβββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββ€
β setosa β 116.9 β 81.7 β 33.2 β 6.1 β
ββββββββββββββΌβββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββ€
β virginica β 324.5 β 146.2 β 273.1 β 99.6 β
ββββββββββββββ΄βββββββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄ββββββββββββββ
As we can see, Polars pretty-prints the output object, including the column name and datatype as headers.
Lazy quick start
If we want to run this query in lazy Polars we'd write:
require 'polars'
Polars.read_csv(URI('https://j.mp/iriscsv'))
.lazy
.filter(Polars.col('sepal_length') > 5)
.groupby('species')
.agg(Polars.all.sum)
.collect
shape: (3, 5)
ββββββββββββββ¬βββββββββββββββ¬ββββββββββββββ¬βββββββββββββββ¬ββββββββββββββ
β species β sepal_length β sepal_width β petal_length β petal_width β
β --- β --- β --- β --- β --- β
β str β f64 β f64 β f64 β f64 β
ββββββββββββββͺβββββββββββββββͺββββββββββββββͺβββββββββββββββͺββββββββββββββ‘
β virginica β 324.5 β 146.2 β 273.1 β 99.6 β
ββββββββββββββΌβββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββ€
β setosa β 116.9 β 81.7 β 33.2 β 6.1 β
ββββββββββββββΌβββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββ€
β versicolor β 281.9 β 131.8 β 202.9 β 63.3 β
ββββββββββββββ΄βββββββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄ββββββββββββββ
Chapter 2 Polars cheat sheet in Ruby
Creating / reading DataFrames
Create DataFrame
df = Polars::DataFrame.new({
nrs: [1, 2, 3, nil, 5],
names: ["foo", "ham", "spam", "egg", nil],
random: [0.3, 0.7, 0.1, 0.9, 0.6],
groups: %w[A A B C B],
})
shape: (5, 4)
ββββββββ¬ββββββββ¬βββββββββ¬βββββββββ
β nrs β names β random β groups β
β --- β --- β --- β --- β
β i64 β str β f64 β str β
ββββββββͺββββββββͺβββββββββͺβββββββββ‘
β 1 β foo β 0.3 β A β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 2 β ham β 0.7 β A β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 3 β spam β 0.1 β B β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β null β egg β 0.9 β C β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 5 β null β 0.6 β B β
ββββββββ΄ββββββββ΄βββββββββ΄βββββββββ
Read CSV
iris = Polars.read_csv(URI('https://j.mp/iriscsv'),
has_header: true)
shape: (150, 5)
ββββββββββββββββ¬ββββββββββββββ¬βββββββββββββββ¬ββββββββββββββ¬ββββββββββββ
β sepal_length β sepal_width β petal_length β petal_width β species β
β --- β --- β --- β --- β --- β
β f64 β f64 β f64 β f64 β str β
ββββββββββββββββͺββββββββββββββͺβββββββββββββββͺββββββββββββββͺββββββββββββ‘
β 5.1 β 3.5 β 1.4 β 0.2 β setosa β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ€
β 4.9 β 3.0 β 1.4 β 0.2 β setosa β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ€
β 4.7 β 3.2 β 1.3 β 0.2 β setosa β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ€
β 4.6 β 3.1 β 1.5 β 0.2 β setosa β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ€
β ... β ... β ... β ... β ... β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ€
β 6.3 β 2.5 β 5.0 β 1.9 β virginica β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ€
β 6.5 β 3.0 β 5.2 β 2.0 β virginica β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ€
β 6.2 β 3.4 β 5.4 β 2.3 β virginica β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ€
β 5.9 β 3.0 β 5.1 β 1.8 β virginica β
ββββββββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄ββββββββββββββ΄ββββββββββββ
Read parquet
Polars.read_parquet('file.parquet')
Expressions
df.filter(Polars.col('nrs') < 4) # symbols not work?
.groupby('groups')
.agg(Polars.all.sum)
shape: (2, 4)
ββββββββββ¬ββββββ¬ββββββββ¬βββββββββ
β groups β nrs β names β random β
β --- β --- β --- β --- β
β str β i64 β str β f64 β
ββββββββββͺββββββͺββββββββͺβββββββββ‘
β A β 3 β null β 1.0 β
ββββββββββΌββββββΌββββββββΌβββββββββ€
β B β 3 β null β 0.1 β
ββββββββββ΄ββββββ΄ββββββββ΄βββββββββ
Subset Observations - rows
Filter: Extract rows that meet logical criteria
df.filter(Polars.col('random') > 0.5)
df.filter(
(Polars.col('groups') == 'B') &
(Polars.col('random') > 0.5)
)
shape: (1, 4)
βββββββ¬ββββββββ¬βββββββββ¬βββββββββ
β nrs β names β random β groups β
β --- β --- β --- β --- β
β i64 β str β f64 β str β
βββββββͺββββββββͺβββββββββͺβββββββββ‘
β 5 β null β 0.6 β B β
βββββββ΄ββββββββ΄βββββββββ΄βββββββββ
Randomly select fraction of rows.
df.sample(frac: 0.5)
# Results are random.
shape: (1, 4)
βββββββ¬ββββββββ¬βββββββββ¬βββββββββ
β nrs β names β random β groups β
β --- β --- β --- β --- β
β i64 β str β f64 β str β
βββββββͺββββββββͺβββββββββͺβββββββββ‘
β 2 β ham β 0.7 β A β
βββββββ΄ββββββββ΄βββββββββ΄βββββββββ
Randomly select n rows.
df.sample(n: 2)
# Results are random.
shape: (2, 4)
ββββββββ¬ββββββββ¬βββββββββ¬βββββββββ
β nrs β names β random β groups β
β --- β --- β --- β --- β
β i64 β str β f64 β str β
ββββββββͺββββββββͺβββββββββͺβββββββββ‘
β 3 β spam β 0.1 β B β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β null β egg β 0.9 β C β
ββββββββ΄ββββββββ΄βββββββββ΄βββββββββ
select first n rows.
df.head(2)
shape: (2, 4)
βββββββ¬ββββββββ¬βββββββββ¬βββββββββ
β nrs β names β random β groups β
β --- β --- β --- β --- β
β i64 β str β f64 β str β
βββββββͺββββββββͺβββββββββͺβββββββββ‘
β 1 β foo β 0.3 β A β
βββββββΌββββββββΌβββββββββΌβββββββββ€
β 2 β ham β 0.7 β A β
βββββββ΄ββββββββ΄βββββββββ΄βββββββββ
select last n rows.
df.tail(2)
shape: (2, 4)
ββββββββ¬ββββββββ¬βββββββββ¬βββββββββ
β nrs β names β random β groups β
β --- β --- β --- β --- β
β i64 β str β f64 β str β
ββββββββͺββββββββͺβββββββββͺβββββββββ‘
β null β egg β 0.9 β C β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 5 β null β 0.6 β B β
ββββββββ΄ββββββββ΄βββββββββ΄βββββββββ
Subset Observations - columns
Select multiple columns with specific names
df.select(["nrs", "names"])
shape: (5, 2)
ββββββββ¬ββββββββ
β nrs β names β
β --- β --- β
β i64 β str β
ββββββββͺββββββββ‘
β 1 β foo β
ββββββββΌββββββββ€
β 2 β ham β
ββββββββΌββββββββ€
β 3 β spam β
ββββββββΌββββββββ€
β null β egg β
ββββββββΌββββββββ€
β 5 β null β
ββββββββ΄ββββββββ
Select columns whose name matches regex
df.select(Polars.col("^n.*$"))
shape: (5, 2)
ββββββββ¬ββββββββ
β nrs β names β
β --- β --- β
β i64 β str β
ββββββββͺββββββββ‘
β 1 β foo β
ββββββββΌββββββββ€
β 2 β ham β
ββββββββΌββββββββ€
β 3 β spam β
ββββββββΌββββββββ€
β null β egg β
ββββββββΌββββββββ€
β 5 β null β
ββββββββ΄ββββββββ
Subsets - rows and columns
Select rows 2-4
? # Yet Range support appears to be limited
Select columns in positions 1 and 3 (first column is 0)
?
Select rows meeting logical condition, and only the specific columns
?
Reshaping Data β Change layout, sorting, renaming
Append rows of DataFrames
Polars.concat([df, df2])
Append columns of DataFrames
Polars.concat([df, df3], how: "horizontal")
Gather columns into rows
df.melt(
id_vars: 'nrs',
value_vars: %w[names groups]
)
shape: (10, 3)
ββββββββ¬βββββββββββ¬ββββββββ
β nrs β variable β value β
β --- β --- β --- β
β i64 β str β str β
ββββββββͺβββββββββββͺββββββββ‘
β 1 β names β foo β
ββββββββΌβββββββββββΌββββββββ€
β 2 β names β ham β
ββββββββΌβββββββββββΌββββββββ€
β 3 β names β spam β
ββββββββΌβββββββββββΌββββββββ€
β null β names β egg β
ββββββββΌβββββββββββΌββββββββ€
β ... β ... β ... β
ββββββββΌβββββββββββΌββββββββ€
β 2 β groups β A β
ββββββββΌβββββββββββΌββββββββ€
β 3 β groups β B β
ββββββββΌβββββββββββΌββββββββ€
β null β groups β C β
ββββββββΌβββββββββββΌββββββββ€
β 5 β groups β B β
ββββββββ΄βββββββββββ΄ββββββββ
Spread rows into columns
df.pivot(values: 'nrs', index: 'groups',
columns: 'names')
shape: (3, 6)
ββββββββββ¬βββββββ¬βββββββ¬βββββββ¬βββββββ¬βββββββ
β groups β foo β ham β spam β egg β null β
β --- β --- β --- β --- β --- β --- β
β str β i64 β i64 β i64 β i64 β i64 β
ββββββββββͺβββββββͺβββββββͺβββββββͺβββββββͺβββββββ‘
β A β 1 β 2 β null β null β null β
ββββββββββΌβββββββΌβββββββΌβββββββΌβββββββΌβββββββ€
β B β null β null β 3 β null β 5 β
ββββββββββΌβββββββΌβββββββΌβββββββΌβββββββΌβββββββ€
β C β null β null β null β null β null β
ββββββββββ΄βββββββ΄βββββββ΄βββββββ΄βββββββ΄βββββββ
Order rows by values of a column
# low to high
df.sort("random")
shape: (5, 4)
ββββββββ¬ββββββββ¬βββββββββ¬βββββββββ
β nrs β names β random β groups β
β --- β --- β --- β --- β
β i64 β str β f64 β str β
ββββββββͺββββββββͺβββββββββͺβββββββββ‘
β 3 β spam β 0.1 β B β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 1 β foo β 0.3 β A β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 5 β null β 0.6 β B β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 2 β ham β 0.7 β A β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β null β egg β 0.9 β C β
ββββββββ΄ββββββββ΄βββββββββ΄βββββββββ
# high to low
df.sort("random", reverse: true)
shape: (5, 4)
ββββββββ¬ββββββββ¬βββββββββ¬βββββββββ
β nrs β names β random β groups β
β --- β --- β --- β --- β
β i64 β str β f64 β str β
ββββββββͺββββββββͺβββββββββͺβββββββββ‘
β null β egg β 0.9 β C β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 2 β ham β 0.7 β A β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 5 β null β 0.6 β B β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 1 β foo β 0.3 β A β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 3 β spam β 0.1 β B β
ββββββββ΄ββββββββ΄βββββββββ΄βββββββββ
Rename the columns of a DataFrame
df.rename({"nrs" => "idx"})
shape: (5, 4)
ββββββββ¬ββββββββ¬βββββββββ¬βββββββββ
β idx β names β random β groups β
β --- β --- β --- β --- β
β i64 β str β f64 β str β
ββββββββͺββββββββͺβββββββββͺβββββββββ‘
β 1 β foo β 0.3 β A β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 2 β ham β 0.7 β A β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 3 β spam β 0.1 β B β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β null β egg β 0.9 β C β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 5 β null β 0.6 β B β
ββββββββ΄ββββββββ΄βββββββββ΄βββββββββ
Drop columns from DataFrame
df.drop(["names", "random"])
shape: (5, 2)
ββββββββ¬βββββββββ
β nrs β groups β
β --- β --- β
β i64 β str β
ββββββββͺβββββββββ‘
β 1 β A β
ββββββββΌβββββββββ€
β 2 β A β
ββββββββΌβββββββββ€
β 3 β B β
ββββββββΌβββββββββ€
β null β C β
ββββββββΌβββββββββ€
β 5 β B β
ββββββββ΄βββββββββ
Summarize Data
Count number of rows with each unique value of variable
df["groups"].value_counts
shape: (3, 2)
ββββββββββ¬βββββββββ
β groups β counts β
β --- β --- β
β str β u32 β
ββββββββββͺβββββββββ‘
β B β 2 β
ββββββββββΌβββββββββ€
β A β 2 β
ββββββββββΌβββββββββ€
β C β 1 β
ββββββββββ΄βββββββββ
Number of rows in DataFrame
df.height
# => 5
Tuple of number of rows, number of columns in DataFrame
df.shape
# => [5, 4]
Number of distinct values in a column
df["groups"].n_unique
# => 3
Basic descriptive and statistics for each column
df.describe
shape: (7, 5)
ββββββββββββββ¬βββββββββββ¬ββββββββ¬βββββββββββ¬βββββββββ
β describe β nrs β names β random β groups β
β --- β --- β --- β --- β --- β
β str β f64 β str β f64 β str β
ββββββββββββββͺβββββββββββͺββββββββͺβββββββββββͺβββββββββ‘
β count β 5.0 β 5 β 5.0 β 5 β
ββββββββββββββΌβββββββββββΌββββββββΌβββββββββββΌβββββββββ€
β null_count β 1.0 β 1 β 0.0 β 0 β
ββββββββββββββΌβββββββββββΌββββββββΌβββββββββββΌβββββββββ€
β mean β 2.75 β null β 0.52 β null β
ββββββββββββββΌβββββββββββΌββββββββΌβββββββββββΌβββββββββ€
β std β 1.707825 β null β 0.319374 β null β
ββββββββββββββΌβββββββββββΌββββββββΌβββββββββββΌβββββββββ€
β min β 1.0 β egg β 0.1 β A β
ββββββββββββββΌβββββββββββΌββββββββΌβββββββββββΌβββββββββ€
β max β 5.0 β spam β 0.9 β C β
ββββββββββββββΌβββββββββββΌββββββββΌβββββββββββΌβββββββββ€
β median β 2.5 β null β 0.6 β null β
ββββββββββββββ΄βββββββββββ΄ββββββββ΄βββββββββββ΄βββββββββ
Aggregation functions
sum min max std median mean quantile first
df.select(
[
# Sum values
Polars.sum('random').alias('sum'),
# Minimum value
Polars.min('random').alias('min'),
# Maximum value
Polars.max('random').alias('max'),
# or
Polars.col('random').max.alias('other_max'),
# Standard deviation
Polars.std('random').alias('std dev'),
# Variance
Polars.var('random').alias('variance'),
# Median
Polars.median('random').alias('median'),
# Mean
Polars.mean('random').alias('mean'),
# Quantile
Polars.quantile('random', 0.75).alias('quantile_0.75'),
# or
Polars.col('random').quantile(0.75).alias('other_quantile_0.75'),
# First value
Polars.first('random').alias('first')
]
)
βββββββ¬ββββββ¬ββββββ¬ββββββββββββ¬ββββββ¬βββββββ¬βββββββββββββ¬βββββββββββββββ¬ββββββββ
β sum β min β max β other_max β ... β mean β quantile_0 β other_quanti β first β
β --- β --- β --- β --- β β --- β .75 β le_0.75 β --- β
β f64 β f64 β f64 β f64 β β f64 β --- β --- β f64 β
β β β β β β β f64 β f64 β β
βββββββͺββββββͺββββββͺββββββββββββͺββββββͺβββββββͺβββββββββββββͺβββββββββββββββͺββββββββ‘
β 2.6 β 0.1 β 0.9 β 0.9 β ... β 0.52 β 0.7 β 0.7 β 0.3 β
βββββββ΄ββββββ΄ββββββ΄ββββββββββββ΄ββββββ΄βββββββ΄βββββββββββββ΄βββββββββββββββ΄ββββββββ
Group Data
Group by values in column named "col", returning a GroupBy object
df.groupby("groups")
All of the aggregation functions from above can be applied to a group as well
df.groupby(by = 'groups').agg(
[
# Sum values
Polars.sum('random').alias('sum'),
# Minimum value
Polars.min('random').alias('min'),
# Maximum value
Polars.max('random').alias('max'),
# or
Polars.col('random').max.alias('other_max'),
# Standard deviation
Polars.std('random').alias('std_dev'),
# Variance
Polars.var('random').alias('variance'),
# Median
Polars.median('random').alias('median'),
# Mean
Polars.mean('random').alias('mean'),
# Quantile
Polars.quantile('random', 0.75).alias('quantile_0.75'),
# or
Polars.col('random').quantile(0.75).alias('other_quantile_0.75'),
# First value
Polars.first('random').alias('first')
]
)
shape: (3, 12)
ββββββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββ¬βββββββ¬ββββββββββββββββ¬βββββββββββββββ¬ββββββββ
β groups β sum β min β max β ... β mean β quantile_0.75 β other_quanti β first β
β --- β --- β --- β --- β β --- β --- β le_0.75 β --- β
β str β f64 β f64 β f64 β β f64 β f64 β --- β f64 β
β β β β β β β β f64 β β
ββββββββββͺββββββͺββββββͺββββββͺββββββͺβββββββͺββββββββββββββββͺβββββββββββββββͺββββββββ‘
β C β 0.9 β 0.9 β 0.9 β ... β 0.9 β 0.9 β 0.9 β 0.9 β
ββββββββββΌββββββΌββββββΌββββββΌββββββΌβββββββΌββββββββββββββββΌβββββββββββββββΌββββββββ€
β A β 1.0 β 0.3 β 0.7 β ... β 0.5 β 0.7 β 0.7 β 0.3 β
ββββββββββΌββββββΌββββββΌββββββΌββββββΌβββββββΌββββββββββββββββΌβββββββββββββββΌββββββββ€
β B β 0.7 β 0.1 β 0.6 β ... β 0.35 β 0.6 β 0.6 β 0.1 β
ββββββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββ΄βββββββ΄ββββββββββββββββ΄βββββββββββββββ΄ββββββββ
Additional GroupBy functions
??
Handling Missing Data
Drop rows with any column having a null value
df.drop_nulls
shape: (3, 4)
βββββββ¬ββββββββ¬βββββββββ¬βββββββββ
β nrs β names β random β groups β
β --- β --- β --- β --- β
β i64 β str β f64 β str β
βββββββͺββββββββͺβββββββββͺβββββββββ‘
β 1 β foo β 0.3 β A β
βββββββΌββββββββΌβββββββββΌβββββββββ€
β 2 β ham β 0.7 β A β
βββββββΌββββββββΌβββββββββΌβββββββββ€
β 3 β spam β 0.1 β B β
βββββββ΄ββββββββ΄βββββββββ΄βββββββββ
Replace null values with given value
df.fill_null(42)
shape: (5, 4)
βββββββ¬ββββββββ¬βββββββββ¬βββββββββ
β nrs β names β random β groups β
β --- β --- β --- β --- β
β i64 β str β f64 β str β
βββββββͺββββββββͺβββββββββͺβββββββββ‘
β 1 β foo β 0.3 β A β
βββββββΌββββββββΌβββββββββΌβββββββββ€
β 2 β ham β 0.7 β A β
βββββββΌββββββββΌβββββββββΌβββββββββ€
β 3 β spam β 0.1 β B β
βββββββΌββββββββΌβββββββββΌβββββββββ€
β 42 β egg β 0.9 β C β
βββββββΌββββββββΌβββββββββΌβββββββββ€
β 5 β 42 β 0.6 β B β
βββββββ΄ββββββββ΄βββββββββ΄βββββββββ
Replace null values using forward strategy
df.fill_null(strategy: "forward")
shape: (5, 4)
βββββββ¬ββββββββ¬βββββββββ¬βββββββββ
β nrs β names β random β groups β
β --- β --- β --- β --- β
β i64 β str β f64 β str β
βββββββͺββββββββͺβββββββββͺβββββββββ‘
β 1 β foo β 0.3 β A β
βββββββΌββββββββΌβββββββββΌβββββββββ€
β 2 β ham β 0.7 β A β
βββββββΌββββββββΌβββββββββΌβββββββββ€
β 3 β spam β 0.1 β B β
βββββββΌββββββββΌβββββββββΌβββββββββ€
β 3 β egg β 0.9 β C β
βββββββΌββββββββΌβββββββββΌβββββββββ€
β 5 β egg β 0.6 β B β
βββββββ΄ββββββββ΄βββββββββ΄βββββββββ
Other fill strategies are "backward", "min", "max", "mean", "zero" and "one"
Replace floating point Nan values with given value
dfn = Polars::DataFrame.new(
{
"a" => [1.5, 2, Float::NAN, 4],
"b" => [0.5, 4, Float::NAN, 13]
}
)
dfn.fill_nan(99)
Make New Columns
Add a new columns to the DataFrame
df.with_column(
(Polars.col('random') * Polars.col('nrs'))
.alias('product')
)
shape: (5, 5)
ββββββββ¬ββββββββ¬βββββββββ¬βββββββββ¬ββββββββββ
β nrs β names β random β groups β product β
β --- β --- β --- β --- β --- β
β i64 β str β f64 β str β f64 β
ββββββββͺββββββββͺβββββββββͺβββββββββͺββββββββββ‘
β 1 β foo β 0.3 β A β 0.3 β
ββββββββΌββββββββΌβββββββββΌβββββββββΌββββββββββ€
β 2 β ham β 0.7 β A β 1.4 β
ββββββββΌββββββββΌβββββββββΌβββββββββΌββββββββββ€
β 3 β spam β 0.1 β B β 0.3 β
ββββββββΌββββββββΌβββββββββΌβββββββββΌββββββββββ€
β null β egg β 0.9 β C β null β
ββββββββΌββββββββΌβββββββββΌβββββββββΌββββββββββ€
β 5 β null β 0.6 β B β 3.0 β
ββββββββ΄ββββββββ΄βββββββββ΄βββββββββ΄ββββββββββ
Add several new columns to the DataFrame
df.with_columns(
[
(Polars.col('random') * Polars.col('nrs'))
.alias('product'),
Polars.col('names').str.lengths
.alias('names_lengths')
]
)
shape: (5, 6)
ββββββββ¬ββββββββ¬βββββββββ¬βββββββββ¬ββββββββββ¬ββββββββββββββββ
β nrs β names β random β groups β product β names_lengths β
β --- β --- β --- β --- β --- β --- β
β i64 β str β f64 β str β f64 β u32 β
ββββββββͺββββββββͺβββββββββͺβββββββββͺββββββββββͺββββββββββββββββ‘
β 1 β foo β 0.3 β A β 0.3 β 3 β
ββββββββΌββββββββΌβββββββββΌβββββββββΌββββββββββΌββββββββββββββββ€
β 2 β ham β 0.7 β A β 1.4 β 3 β
ββββββββΌββββββββΌβββββββββΌβββββββββΌββββββββββΌββββββββββββββββ€
β 3 β spam β 0.1 β B β 0.3 β 4 β
ββββββββΌββββββββΌβββββββββΌβββββββββΌββββββββββΌββββββββββββββββ€
β null β egg β 0.9 β C β null β 3 β
ββββββββΌββββββββΌβββββββββΌβββββββββΌββββββββββΌββββββββββββββββ€
β 5 β null β 0.6 β B β 3.0 β null β
ββββββββ΄ββββββββ΄βββββββββ΄βββββββββ΄ββββββββββ΄ββββββββββββββββ
Add a column at index 0 that counts the rows
df.with_row_count
shape: (5, 5)
ββββββββββ¬βββββββ¬ββββββββ¬βββββββββ¬βββββββββ
β row_nr β nrs β names β random β groups β
β --- β --- β --- β --- β --- β
β u32 β i64 β str β f64 β str β
ββββββββββͺβββββββͺββββββββͺβββββββββͺβββββββββ‘
β 0 β 1 β foo β 0.3 β A β
ββββββββββΌβββββββΌββββββββΌβββββββββΌβββββββββ€
β 1 β 2 β ham β 0.7 β A β
ββββββββββΌβββββββΌββββββββΌβββββββββΌβββββββββ€
β 2 β 3 β spam β 0.1 β B β
ββββββββββΌβββββββΌββββββββΌβββββββββΌβββββββββ€
β 3 β null β egg β 0.9 β C β
ββββββββββΌβββββββΌββββββββΌβββββββββΌβββββββββ€
β 4 β 5 β null β 0.6 β B β
ββββββββββ΄βββββββ΄ββββββββ΄βββββββββ΄βββββββββ
Rolling Functions
The following rolling functions are available
df.select(
[
# Rolling maximum value
Polars.col('random')
.rolling_max(2)
.alias('rolling_max'),
# Rolling mean value
Polars.col('random')
.rolling_mean(2)
.alias('rolling_mean'),
# Rolling median value
Polars.col('random')
.rolling_median(2, min_periods: 2)
.alias('rolling_median'),
# Rolling minimum value
Polars.col('random')
.rolling_min(2)
.alias('rolling_min'),
# Rolling standard deviation
Polars.col('random')
.rolling_std(2)
.alias('rolling_std'),
# Rolling sum values
Polars.col('random')
.rolling_sum(2)
.alias('rolling_sum'),
# Rolling variance
Polars.col('random')
.rolling_var(2)
.alias('rolling_var'),
# Rolling quantile
Polars.col('random')
.rolling_quantile(
0.75,
window_size: 2,
min_periods: 2
)
.alias('rolling_quantile'),
# Rolling skew
Polars.col('random')
.rolling_skew(2)
.alias('rolling_skew')
# Rolling custom function
# Polars.col('random')
# .rolling_apply(
# function = np.nanstd, window_size = 2
# )
# .alias('rolling_apply')
]
)
shape: (5, 9)
βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬ββββββββββββ¬ββββββ¬ββββββββββββ¬ββββββββββββ¬βββββββββββββ¬βββββββββββββ
β rolling_m β rolling_me β rolling_me β rolling_m β ... β rolling_s β rolling_v β rolling_qu β rolling_sk β
β ax β an β dian β in β β um β ar β antile β ew β
β --- β --- β --- β --- β β --- β --- β --- β --- β
β f64 β f64 β f64 β f64 β β f64 β f64 β f64 β f64 β
βββββββββββββͺβββββββββββββͺβββββββββββββͺββββββββββββͺββββββͺββββββββββββͺββββββββββββͺβββββββββββββͺβββββββββββββ‘
β null β null β null β null β ... β null β null β null β null β
βββββββββββββΌβββββββββββββΌβββββββββββββΌββββββββββββΌββββββΌββββββββββββΌββββββββββββΌβββββββββββββΌβββββββββββββ€
β 0.7 β 0.5 β 0.5 β 0.3 β ... β 1.0 β 0.08 β 0.7 β -4.3368e-1 β
β β β β β β β β β 6 β
βββββββββββββΌβββββββββββββΌβββββββββββββΌββββββββββββΌββββββΌββββββββββββΌββββββββββββΌβββββββββββββΌβββββββββββββ€
β 0.7 β 0.4 β 0.4 β 0.1 β ... β 0.8 β 0.18 β 0.7 β 3.8549e-16 β
βββββββββββββΌβββββββββββββΌβββββββββββββΌββββββββββββΌββββββΌββββββββββββΌββββββββββββΌβββββββββββββΌβββββββββββββ€
β 0.9 β 0.5 β 0.5 β 0.1 β ... β 1.0 β 0.32 β 0.9 β 0.0 β
βββββββββββββΌβββββββββββββΌβββββββββββββΌββββββββββββΌββββββΌββββββββββββΌββββββββββββΌβββββββββββββΌβββββββββββββ€
β 0.9 β 0.75 β 0.75 β 0.6 β ... β 1.5 β 0.045 β 0.9 β 0.0 β
βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄ββββββββββββ΄ββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββββββ΄βββββββββββββ
Window Functions
Window functions allow to group by several columns simultaneously
df.select(
[
'names',
'groups',
Polars.col('random').sum.over('names')
.alias('sum_by_names'),
Polars.col('random').sum.over('groups')
.alias('sum_by_groups')
]
)
Combine Data Sets
df_1 = Polars::DataFrame.new(
{
"foo" => [1, 2, 3],
"bar" => [6.0, 7.0, 8.0],
"ham" => ["a", "b", "c"]
}
)
df_2 = Polars::DataFrame.new(
{
"apple" => ["x", "y", "z"],
"ham" => ["a", "b", "d"]
}
)
shape: (3, 3)
βββββββ¬ββββββ¬ββββββ
β foo β bar β ham β
β --- β --- β --- β
β i64 β f64 β str β
βββββββͺββββββͺββββββ‘
β 1 β 6.0 β a β
βββββββΌββββββΌββββββ€
β 2 β 7.0 β b β
βββββββΌββββββΌββββββ€
β 3 β 8.0 β c β
βββββββ΄ββββββ΄ββββββ
shape: (3, 2)
βββββββββ¬ββββββ
β apple β ham β
β --- β --- β
β str β str β
βββββββββͺββββββ‘
β x β a β
βββββββββΌββββββ€
β y β b β
βββββββββΌββββββ€
β z β d β
βββββββββ΄ββββββ
Inner Join
Retains only rows with a match in the other set.
df_1.join(df_2, on: "ham")
df_1.join(df_2, on: "ham", how: "inner")
shape: (2, 4)
βββββββ¬ββββββ¬ββββββ¬ββββββββ
β foo β bar β ham β apple β
β --- β --- β --- β --- β
β i64 β f64 β str β str β
βββββββͺββββββͺββββββͺββββββββ‘
β 1 β 6.0 β a β x β
βββββββΌββββββΌββββββΌββββββββ€
β 2 β 7.0 β b β y β
βββββββ΄ββββββ΄ββββββ΄ββββββββ
Left Join
Retains each row from "left" set (df).
df_1.join(df_2, on: "ham", how: "left")
shape: (3, 4)
βββββββ¬ββββββ¬ββββββ¬ββββββββ
β foo β bar β ham β apple β
β --- β --- β --- β --- β
β i64 β f64 β str β str β
βββββββͺββββββͺββββββͺββββββββ‘
β 1 β 6.0 β a β x β
βββββββΌββββββΌββββββΌββββββββ€
β 2 β 7.0 β b β y β
βββββββΌββββββΌββββββΌββββββββ€
β 3 β 8.0 β c β null β
βββββββ΄ββββββ΄ββββββ΄ββββββββ
Outer Join
Retains each row, even if no other matching row exists.
df_1.join(df_2, on: "ham", how: "outer")
shape: (4, 4)
ββββββββ¬βββββββ¬ββββββ¬ββββββββ
β foo β bar β ham β apple β
β --- β --- β --- β --- β
β i64 β f64 β str β str β
ββββββββͺβββββββͺββββββͺββββββββ‘
β 1 β 6.0 β a β x β
ββββββββΌβββββββΌββββββΌββββββββ€
β 2 β 7.0 β b β y β
ββββββββΌβββββββΌββββββΌββββββββ€
β null β null β d β z β
ββββββββΌβββββββΌββββββΌββββββββ€
β 3 β 8.0 β c β null β
ββββββββ΄βββββββ΄ββββββ΄ββββββββ
Anti Join
Contains all rows from df that do not have a match in other_df
df_1.join(df_2, on: "ham", how: "anti")
βββββββ¬ββββββ¬ββββββ
β foo β bar β ham β
β --- β --- β --- β
β i64 β f64 β str β
βββββββͺββββββͺββββββ‘
β 3 β 8.0 β c β
βββββββ΄ββββββ΄ββββββ
Top comments (0)