import polars as plPlaying with MoMA datasets
An introduction to Polars DataFrames
The Museum of Modern Art (MoMA) in New York created a collection of GitHub repositories to access information about their collections. We will play with some of these to learn how to deal with tabular data in Python.
Tabular data
Many fields of data science and the humanities rely on tabular data where:
- columns hold variables and are homogeneous (same data type),
- rows contain observations and can be heterogeneous.
Tabular data is the kind of data you would store in spreadsheets (e.g. Microsoft Excel).
DataFrames (data frames or dataframes) are two dimensional objects that bring tabular data to programming.
Polars
If you read a little about Python, you will hear about pandas. It is everywhere (including in the Anaconda set of packages). That’s because pandas was, for the longest time, the only package to deal with DataFrames in Python.
In recent years however, a much better library called Polars has been developed. It is faster, has a better syntax, handles missing data much better, and allows to work with larger datasets on the same hardware. In this section, we will learn how to use it.
Reading in data
Load the Polars package as the alias pl (this is the standard way to load Polars):
Create a DataFrame that we call artists with data from the MoMA collection project on artists:
artists = pl.read_csv("https://github.com/MuseumofModernArt/collection/raw/refs/heads/main/Artists.csv")As you can see, Polars can read in data from a URL. To get that URL however, I had to click on the Artists.csv link in the main collection project, right-click on Raw, and copy the link. That’s the link that I used in the code above.
Of course Polars can also read in files that you have in your computer.
Here, the file is a CSV file, so I am using the function polars.read_csv. There are other functions for other file formats (e.g. polars.read_json, polars.read_parquet).
Exploring the data
Now, I can print the DataFrame:
artists| ConstituentID | DisplayName | ArtistBio | Nationality | Gender | BeginDate | EndDate | Wiki QID | ULAN |
|---|---|---|---|---|---|---|---|---|
| i64 | str | str | str | str | i64 | i64 | str | i64 |
| 1 | "Robert Arneson" | "American, 1930–1992" | "American" | "male" | 1930 | 1992 | null | null |
| 2 | "Doroteo Arnaiz" | "Spanish, born 1936" | "Spanish" | "male" | 1936 | 0 | null | null |
| 3 | "Bill Arnold" | "American, born 1941" | "American" | "male" | 1941 | 0 | null | null |
| 4 | "Charles Arnoldi" | "American, born 1946" | "American" | "male" | 1946 | 0 | "Q1063584" | 500027998 |
| 5 | "Per Arnoldi" | "Danish, born 1941" | "Danish" | "male" | 1941 | 0 | null | null |
| … | … | … | … | … | … | … | … | … |
| 141956 | "Julian Pałka" | "Polish, 1923 – 2002" | "Polish" | "male" | 1923 | 2002 | null | null |
| 141957 | "A. Vedeneev" | "Russian, 1891 – 1950" | "Russian" | null | 1891 | 1950 | null | null |
| 141958 | "Yakov Zinovievich Shtoffer" | "Russian, 1906 – 1951" | "Russian" | "male" | 1906 | 1951 | null | null |
| 142033 | "M/M" | "French" | "French" | null | 0 | 0 | null | null |
| 142034 | "Klaus Ferentschik" | "German, born 1957" | "German" | null | 1957 | 0 | null | null |
It has 15,908 rows and 9 columns. This means that there are 15,908 different artists in this dataset and that there are 9 variables (or distinct types of information) for each artist.
You can also see the data type for each column (variable).
I can print the first 5 rows:
artists.head()| ConstituentID | DisplayName | ArtistBio | Nationality | Gender | BeginDate | EndDate | Wiki QID | ULAN |
|---|---|---|---|---|---|---|---|---|
| i64 | str | str | str | str | i64 | i64 | str | i64 |
| 1 | "Robert Arneson" | "American, 1930–1992" | "American" | "male" | 1930 | 1992 | null | null |
| 2 | "Doroteo Arnaiz" | "Spanish, born 1936" | "Spanish" | "male" | 1936 | 0 | null | null |
| 3 | "Bill Arnold" | "American, born 1941" | "American" | "male" | 1941 | 0 | null | null |
| 4 | "Charles Arnoldi" | "American, born 1946" | "American" | "male" | 1946 | 0 | "Q1063584" | 500027998 |
| 5 | "Per Arnoldi" | "Danish, born 1941" | "Danish" | "male" | 1941 | 0 | null | null |
Or the first 3 rows:
artists.head(3)| ConstituentID | DisplayName | ArtistBio | Nationality | Gender | BeginDate | EndDate | Wiki QID | ULAN |
|---|---|---|---|---|---|---|---|---|
| i64 | str | str | str | str | i64 | i64 | str | i64 |
| 1 | "Robert Arneson" | "American, 1930–1992" | "American" | "male" | 1930 | 1992 | null | null |
| 2 | "Doroteo Arnaiz" | "Spanish, born 1936" | "Spanish" | "male" | 1936 | 0 | null | null |
| 3 | "Bill Arnold" | "American, born 1941" | "American" | "male" | 1941 | 0 | null | null |
Or the last 5 rows:
artists.tail()| ConstituentID | DisplayName | ArtistBio | Nationality | Gender | BeginDate | EndDate | Wiki QID | ULAN |
|---|---|---|---|---|---|---|---|---|
| i64 | str | str | str | str | i64 | i64 | str | i64 |
| 141956 | "Julian Pałka" | "Polish, 1923 – 2002" | "Polish" | "male" | 1923 | 2002 | null | null |
| 141957 | "A. Vedeneev" | "Russian, 1891 – 1950" | "Russian" | null | 1891 | 1950 | null | null |
| 141958 | "Yakov Zinovievich Shtoffer" | "Russian, 1906 – 1951" | "Russian" | "male" | 1906 | 1951 | null | null |
| 142033 | "M/M" | "French" | "French" | null | 0 | 0 | null | null |
| 142034 | "Klaus Ferentschik" | "German, born 1957" | "German" | null | 1957 | 0 | null | null |
Your turn:
Print the last 15 rows of this DataFrame.
I can get a random sample of rows. This is very useful if you want to have an idea of what is going on in the data and not focus on the first or last elements which are likely to look similar.
For this, I first need to load the random module from Python’s standard library:
import randomThen I can draw a single random samples with:
artists.sample()| ConstituentID | DisplayName | ArtistBio | Nationality | Gender | BeginDate | EndDate | Wiki QID | ULAN |
|---|---|---|---|---|---|---|---|---|
| i64 | str | str | str | str | i64 | i64 | str | i64 |
| 6507 | "Katsuro Yoshida" | "Japanese" | "Japanese" | "male" | 0 | 0 | null | null |
Or I can draw any number of samples:
artists.sample(10)| ConstituentID | DisplayName | ArtistBio | Nationality | Gender | BeginDate | EndDate | Wiki QID | ULAN |
|---|---|---|---|---|---|---|---|---|
| i64 | str | str | str | str | i64 | i64 | str | i64 |
| 129 | "Harold Altman" | "American, 1924–2003" | "American" | "male" | 1924 | 2003 | null | 500028706 |
| 5154 | "Alberto Sani" | "Italian, 1900–1964" | "Italian" | "male" | 1900 | 1964 | null | null |
| 6233 | "Bernard Walsh" | "American, 1912–2004" | "American" | "male" | 1912 | 2004 | null | null |
| 7435 | "Donald Moffett" | "American, born 1955" | "American" | "male" | 1955 | 0 | null | null |
| 7929 | "Joseph Grigely" | "American, born 1956" | "American" | "male" | 1956 | 0 | null | null |
| 7973 | "Yevgeny Khaldei" | "Russian, 1917–1998" | "Russian" | "male" | 1917 | 1998 | null | null |
| 8039 | "Bradford Bissell" | "American, born 1957" | "American" | "male" | 1957 | 0 | null | null |
| 11119 | "Columbian Bronze Corp., Freepo… | null | null | null | 0 | 0 | null | null |
| 34619 | "El Anatsui" | "Ghanaian, born 1944" | "Ghanaian" | "male" | 1944 | 0 | "Q1323876" | 500034780 |
| 75665 | "Fernanda Gomes" | "Brazilian, born 1960" | "Brazilian" | "female" | 1960 | 0 | null | null |
I can get a list of columns with:
artists.columns['ConstituentID',
'DisplayName',
'ArtistBio',
'Nationality',
'Gender',
'BeginDate',
'EndDate',
'Wiki QID',
'ULAN']
And a list of data types with:
artists.dtypes[Int64, String, String, String, String, Int64, Int64, String, Int64]
I can get the schema (sets the names of the variables (columns) and their data types) with:
artists.schemaSchema([('ConstituentID', Int64),
('DisplayName', String),
('ArtistBio', String),
('Nationality', String),
('Gender', String),
('BeginDate', Int64),
('EndDate', Int64),
('Wiki QID', String),
('ULAN', Int64)])
Finally, I can get some summary statistics for all the variables with:
artists.describe()| statistic | ConstituentID | DisplayName | ArtistBio | Nationality | Gender | BeginDate | EndDate | Wiki QID | ULAN |
|---|---|---|---|---|---|---|---|---|---|
| str | f64 | str | str | str | str | f64 | f64 | str | f64 |
| "count" | 15916.0 | "15916" | "13746" | "13434" | "12636" | 15916.0 | 15916.0 | "3248" | 2931.0 |
| "null_count" | 0.0 | "0" | "2170" | "2482" | "3280" | 0.0 | 0.0 | "12668" | 12985.0 |
| "mean" | 25439.306861 | null | null | null | null | 1509.16518 | 731.405253 | null | 5.0007e8 |
| "std" | 31769.08116 | null | null | null | null | 798.986602 | 956.051559 | null | 86599.942895 |
| "min" | 1.0 | ""a.r." group" | "1858–ca. 1910" | "Afghan" | "female" | 0.0 | 0.0 | "Q1000203" | 5.00000006e8 |
| "25%" | 4443.0 | null | null | null | null | 1864.0 | 0.0 | null | 5.00017574e8 |
| "50%" | 9752.0 | null | null | null | null | 1924.0 | 0.0 | null | 5.00033033e8 |
| "75%" | 36070.0 | null | null | null | null | 1949.0 | 1971.0 | null | 5.00114576e8 |
| "max" | 142034.0 | "…XYZ Dot Dot Dot Ex Why Zed De… | "Łódź, Poland, est. 1929–1936" | "Zimbabwean" | "transgender woman" | 2020.0 | 2026.0 | "Q993400" | 5.00356571e8 |
Depending on the data, this can be useful or perfectly meaningless (as is the case here) if the variables are not numerical. It does show us something useful however: the number of missing data for each variables. Missing data in Polars are of the type polars.Null and you can see their numbers in the statistic called null_count.
Subsetting data
Selecting subsets of data is very useful to run analyses, create graphs, get some statistics, create new datasets, etc.
Selecting columns
In Polars, you can select a column with the function select:
artists.select(pl.col("Gender"))| Gender |
|---|
| str |
| "male" |
| "male" |
| "male" |
| "male" |
| "male" |
| … |
| "male" |
| null |
| "male" |
| null |
| null |
Note that this prints the result, but it does not save the new DataFrame. If you want to create a new DataFrame with the result you need to assign that result to a new variable. Let’s call this new variable artists_gender:
artists_gender = artists.select(pl.col("Gender"))As you can see, when you do that, it does not print the new DataFrame. If you want to print it, you now need to call it:
artists_gender| Gender |
|---|
| str |
| "male" |
| "male" |
| "male" |
| "male" |
| "male" |
| … |
| "male" |
| null |
| "male" |
| null |
| null |
Now, if you want to see the unique entries for this column you can run:
artists.select(pl.col("Gender")).unique()| Gender |
|---|
| str |
| "female" |
| "transgender woman" |
| "non-binary" |
| "gender non-conforming" |
| "female (transwoman)" |
| null |
| "male" |
| "male (trans? ftm?)" |
Your turn:
Create a DataFrame with the unique values of the column Nationality.
If we want to see how many artists are represented for each nationality, we can do:
artists.select(
pl.col("Nationality").unique(maintain_order=True).alias("Nationality"),
pl.col("Nationality").unique_counts().alias("Count"),
)| Nationality | Count |
|---|---|
| str | u64 |
| "American" | 5435 |
| "Spanish" | 160 |
| "Danish" | 123 |
| "Italian" | 550 |
| "French" | 876 |
| … | … |
| "Hmong" | 1 |
| "Spirit Lake Dakota/Cheyenne Ri… | 1 |
| "Okinawan" | 1 |
| "Dominican" | 1 |
| "Kalaaleq" | 1 |
And if we want to sort them by decreasing count:
artists.select(
pl.col("Nationality").unique(maintain_order=True).alias("Nationality"),
pl.col("Nationality").unique_counts().alias("Count"),
).sort("Count", descending=True)| Nationality | Count |
|---|---|
| str | u64 |
| "American" | 5435 |
| null | 2482 |
| "German" | 999 |
| "French" | 876 |
| "British" | 876 |
| … | … |
| "Hmong" | 1 |
| "Spirit Lake Dakota/Cheyenne Ri… | 1 |
| "Okinawan" | 1 |
| "Dominican" | 1 |
| "Kalaaleq" | 1 |
You can of course select multiple columns:
artists.select(
pl.col("Gender"),
pl.col("Nationality")
)| Gender | Nationality |
|---|---|
| str | str |
| "male" | "American" |
| "male" | "Spanish" |
| "male" | "American" |
| "male" | "American" |
| "male" | "Danish" |
| … | … |
| "male" | "Polish" |
| null | "Russian" |
| "male" | "Russian" |
| null | "French" |
| null | "German" |
Selecting row
You can select all the rows matching some condition with the function filter:
artists.filter(pl.col("Gender") == "transgender woman")| ConstituentID | DisplayName | ArtistBio | Nationality | Gender | BeginDate | EndDate | Wiki QID | ULAN |
|---|---|---|---|---|---|---|---|---|
| i64 | str | str | str | str | i64 | i64 | str | i64 |
| 135972 | "Evelyn Taocheng Wang" | "Dutch, born China 1981" | "Dutch" | "transgender woman" | 1981 | 0 | null | null |
Not a very represented group!
You can combine conditions:
artists.filter(
pl.col("Gender") == "female",
pl.col("Nationality") == "American",
)| ConstituentID | DisplayName | ArtistBio | Nationality | Gender | BeginDate | EndDate | Wiki QID | ULAN |
|---|---|---|---|---|---|---|---|---|
| i64 | str | str | str | str | i64 | i64 | str | i64 |
| 10 | "Irene Aronson" | "American, born Germany 1918" | "American" | "female" | 1918 | 0 | "Q19748568" | 500042413 |
| 21 | "Ruth Asawa" | "American, 1926–2013" | "American" | "female" | 1926 | 2013 | "Q7382874" | 500077806 |
| 31 | "Dana Atchley" | "American, 1941–2000" | "American" | "female" | 1941 | 2000 | null | null |
| 41 | "Berenice Abbott" | "American, 1898–1991" | "American" | "female" | 1898 | 1991 | "Q231861" | 500020631 |
| 61 | "Alice Adams" | "American, born 1930" | "American" | "female" | 1930 | 0 | "Q4725709" | 500033075 |
| … | … | … | … | … | … | … | … | … |
| 141044 | "Elizabeth Browning Jackson" | "American, born 1948" | "American" | "female" | 1948 | 0 | null | null |
| 141278 | "Allison McCann" | "American, born 1989" | "American" | "female" | 1989 | 0 | null | null |
| 141280 | "Yuliya Parshina-Kottas" | "Russian-American, born 1982" | "American" | "female" | 1982 | 0 | null | null |
| 141284 | "Audra D.S. Burch" | "American, born 1966" | "American" | "female" | 1966 | 0 | null | null |
| 141618 | "Sarah Friedland" | "American, born 1992" | "American" | "female" | 1992 | 0 | null | null |
Your turn:
How many Afghan artists are there in this dataset?
Based on what we learnt, there are 2 different ways to get to that info. Can you come up with both of them?
Switching to a new dataset
Looking at this repository from the MoMA again, try to load in Python the data for the artworks and print it.
How many nationalities are represented in this dataset?
Create a new DataFrame showing the years and the number of artworks for each year.