Playing with MoMA datasets

An introduction to Polars DataFrames

Author

Marie-Hélène Burle

The Museum of Modern Art (MoMA) in New York created a collection of GitHub repositories to access information about their collections. We will play with some of these to learn how to deal with tabular data in Python.

Tabular data

Many fields of data science and the humanities rely on tabular data where:

columns hold variables and are homogeneous (same data type),
rows contain observations and can be heterogeneous.

Tabular data is the kind of data you would store in spreadsheets (e.g. Microsoft Excel).

DataFrames (data frames or dataframes) are two dimensional objects that bring tabular data to programming.

Polars

If you read a little about Python, you will hear about pandas. It is everywhere (including in the Anaconda set of packages). That’s because pandas was, for the longest time, the only package to deal with DataFrames in Python.

In recent years however, a much better library called Polars has been developed. It is faster, has a better syntax, handles missing data much better, and allows to work with larger datasets on the same hardware. In this section, we will learn how to use it.

Reading in data

Load the Polars package as the alias pl (this is the standard way to load Polars):

import polars as pl

Create a DataFrame that we call artists with data from the MoMA collection project on artists:

artists = pl.read_csv("https://github.com/MuseumofModernArt/collection/raw/refs/heads/main/Artists.csv")

As you can see, Polars can read in data from a URL. To get that URL however, I had to click on the Artists.csv link in the main collection project, right-click on Raw, and copy the link. That’s the link that I used in the code above.

Of course Polars can also read in files that you have in your computer.

Here, the file is a CSV file, so I am using the function polars.read_csv. There are other functions for other file formats (e.g. polars.read_json, polars.read_parquet).

Exploring the data

Now, you can print the DataFrame:

artists

shape: (15_916, 9)

ConstituentID	DisplayName	ArtistBio	Nationality	Gender	BeginDate	EndDate	Wiki QID	ULAN
i64	str	str	str	str	i64	i64	str	i64
1	"Robert Arneson"	"American, 1930–1992"	"American"	"male"	1930	1992	null	null
2	"Doroteo Arnaiz"	"Spanish, born 1936"	"Spanish"	"male"	1936	0	null	null
3	"Bill Arnold"	"American, born 1941"	"American"	"male"	1941	0	null	null
4	"Charles Arnoldi"	"American, born 1946"	"American"	"male"	1946	0	"Q1063584"	500027998
5	"Per Arnoldi"	"Danish, born 1941"	"Danish"	"male"	1941	0	null	null
…	…	…	…	…	…	…	…	…
141956	"Julian Pałka"	"Polish, 1923 – 2002"	"Polish"	"male"	1923	2002	null	null
141957	"A. Vedeneev"	"Russian, 1891 – 1950"	"Russian"	null	1891	1950	null	null
141958	"Yakov Zinovievich Shtoffer"	"Russian, 1906 – 1951"	"Russian"	"male"	1906	1951	null	null
142033	"M/M"	"French"	"French"	null	0	0	null	null
142034	"Klaus Ferentschik"	"German, born 1957"	"German"	null	1957	0	null	null

It has 15,908 rows and 9 columns. This means that there are 15,908 different artists in this dataset and that there are 9 variables (or distinct types of information) for each artist.

You can also see the data type for each column (variable).

You can print the first 5 rows:

artists.head()

shape: (5, 9)

ConstituentID	DisplayName	ArtistBio	Nationality	Gender	BeginDate	EndDate	Wiki QID	ULAN
i64	str	str	str	str	i64	i64	str	i64
1	"Robert Arneson"	"American, 1930–1992"	"American"	"male"	1930	1992	null	null
2	"Doroteo Arnaiz"	"Spanish, born 1936"	"Spanish"	"male"	1936	0	null	null
3	"Bill Arnold"	"American, born 1941"	"American"	"male"	1941	0	null	null
4	"Charles Arnoldi"	"American, born 1946"	"American"	"male"	1946	0	"Q1063584"	500027998
5	"Per Arnoldi"	"Danish, born 1941"	"Danish"	"male"	1941	0	null	null

Or the first 3 rows:

artists.head(3)

shape: (3, 9)

ConstituentID	DisplayName	ArtistBio	Nationality	Gender	BeginDate	EndDate	Wiki QID	ULAN
i64	str	str	str	str	i64	i64	str	i64
1	"Robert Arneson"	"American, 1930–1992"	"American"	"male"	1930	1992	null	null
2	"Doroteo Arnaiz"	"Spanish, born 1936"	"Spanish"	"male"	1936	0	null	null
3	"Bill Arnold"	"American, born 1941"	"American"	"male"	1941	0	null	null

Or the last 5 rows:

artists.tail()

shape: (5, 9)

ConstituentID	DisplayName	ArtistBio	Nationality	Gender	BeginDate	EndDate	Wiki QID	ULAN
i64	str	str	str	str	i64	i64	str	i64
141956	"Julian Pałka"	"Polish, 1923 – 2002"	"Polish"	"male"	1923	2002	null	null
141957	"A. Vedeneev"	"Russian, 1891 – 1950"	"Russian"	null	1891	1950	null	null
141958	"Yakov Zinovievich Shtoffer"	"Russian, 1906 – 1951"	"Russian"	"male"	1906	1951	null	null
142033	"M/M"	"French"	"French"	null	0	0	null	null
142034	"Klaus Ferentschik"	"German, born 1957"	"German"	null	1957	0	null	null

Your turn:

Print the last 15 rows of this DataFrame.

You can get a random sample of rows. This is very useful if you want to have an idea of what is going on in the data and not focus on the first or last elements which are likely to look similar:

artists.sample()

shape: (1, 9)

ConstituentID	DisplayName	ArtistBio	Nationality	Gender	BeginDate	EndDate	Wiki QID	ULAN
i64	str	str	str	str	i64	i64	str	i64
11786	"Max Dalang A.G., Zürich"	"1920–1940"	"Swiss"	null	1920	1940	null	null

Or you can draw any number of samples:

artists.sample(10)

shape: (10, 9)

ConstituentID	DisplayName	ArtistBio	Nationality	Gender	BeginDate	EndDate	Wiki QID	ULAN
i64	str	str	str	str	i64	i64	str	i64
2273	"Emmet Gowin"	"American, born 1941"	"American"	"male"	1941	0	"Q5373484"	500037046
5429	"Thomas Sills"	"American, 1914–2000"	"American"	"male"	1914	2000	null	null
8584	"Richard Rothman"	"American, born 1956"	"American"	"male"	1956	0	null	null
27763	"Vessel, Inc."	"USA, est. 2001"	"American"	null	2001	0	null	null
28830	"Ena Swansea"	"American, born 1966"	"American"	"female"	1966	0	null	null
33291	"David Reinfurt"	"American, born 1971"	"American"	"male"	1971	0	null	null
35022	"Virgin Records"	"founded 1972"	null	null	1972	0	null	null
39280	"B. Ferrari"	null	null	null	0	0	null	null
47632	"Annihilation Time"	"USA, est. 2001"	null	null	0	0	null	null
134119	"Barbara Rossi"	"American, 1940–2023"	"American"	"female"	1940	2023	null	null

If you want to make the sample reproducible, you need to set a seed with:

artists.sample(10, seed=123)

shape: (10, 9)

ConstituentID	DisplayName	ArtistBio	Nationality	Gender	BeginDate	EndDate	Wiki QID	ULAN
i64	str	str	str	str	i64	i64	str	i64
1000	"Henri Cartier-Bresson"	"French, 1908–2004"	"French"	"male"	1908	2004	"Q156567"	500115484
10666	"Mallory Industries, Inc., Farm…	"American, established 1956"	"American"	null	1956	0	null	null
26153	"Dyson Ltd., England"	"established 1993"	null	null	1993	0	null	null
28667	"Shelby Hughes"	"American, 1981–2014"	"American"	"female"	1981	2014	null	null
28730	"Sandeep Mukherjee"	"American, born India 1964"	"American"	"male"	1964	0	null	null
29817	"Xu Bing"	"Chinese, born 1955"	"Chinese"	"male"	1955	0	"Q1932124"	500125592
37694	"Jean Tschumi"	"Swiss, 1904–1962"	"Swiss"	"male"	1904	1962	"Q117124"	500021463
45221	"Xian Yeagan"	null	null	null	0	0	null	null
67872	"Joseph-Philibert Girault de Pr…	"French, 1804–1892"	"French"	"male"	1804	1892	null	null
131945	"Alexandre Wollner"	"Brazilian, 1928–2018"	"Brazilian"	"male"	1928	2018	null	null

You can get a list of columns with:

artists.columns

['ConstituentID',
 'DisplayName',
 'ArtistBio',
 'Nationality',
 'Gender',
 'BeginDate',
 'EndDate',
 'Wiki QID',
 'ULAN']

And a list of data types with:

artists.dtypes

[Int64, String, String, String, String, Int64, Int64, String, Int64]

You can get the schema (sets the names of the variables (columns) and their data types) with:

artists.schema

Schema([('ConstituentID', Int64),
        ('DisplayName', String),
        ('ArtistBio', String),
        ('Nationality', String),
        ('Gender', String),
        ('BeginDate', Int64),
        ('EndDate', Int64),
        ('Wiki QID', String),
        ('ULAN', Int64)])

Finally, you can get some summary statistics for all the variables with:

artists.describe()

shape: (9, 10)

statistic	ConstituentID	DisplayName	ArtistBio	Nationality	Gender	BeginDate	EndDate	Wiki QID	ULAN
str	f64	str	str	str	str	f64	f64	str	f64
"count"	15916.0	"15916"	"13746"	"13434"	"12636"	15916.0	15916.0	"3248"	2931.0
"null_count"	0.0	"0"	"2170"	"2482"	"3280"	0.0	0.0	"12668"	12985.0
"mean"	25439.306861	null	null	null	null	1509.16518	731.405253	null	5.0007e8
"std"	31769.08116	null	null	null	null	798.986602	956.051559	null	86599.942895
"min"	1.0	""a.r." group"	"1858–ca. 1910"	"Afghan"	"female"	0.0	0.0	"Q1000203"	5.00000006e8
"25%"	4443.0	null	null	null	null	1864.0	0.0	null	5.00017574e8
"50%"	9752.0	null	null	null	null	1924.0	0.0	null	5.00033033e8
"75%"	36070.0	null	null	null	null	1949.0	1971.0	null	5.00114576e8
"max"	142034.0	"…XYZ Dot Dot Dot Ex Why Zed De…	"Łódź, Poland, est. 1929–1936"	"Zimbabwean"	"transgender woman"	2020.0	2026.0	"Q993400"	5.00356571e8

Depending on the data, this can be useful or perfectly meaningless (as is the case here) if the variables are not numerical. It does show us something useful however: the number of missing data for each variables. Missing data in Polars are of the type polars.Null and you can see their numbers in the statistic called null_count.

Subsetting data

Selecting subsets of data is very useful to run analyses, create graphs, get some statistics, create new datasets, etc.

Selecting columns

In Polars, you can select a column with the function select:

artists.select(pl.col("Gender"))

shape: (15_916, 1)

Gender
str
"male"
"male"
"male"
"male"
"male"
…
"male"
null
"male"
null
null

Note that this prints the result, but it does not save the new DataFrame. If you want to create a new DataFrame with the result you need to assign that result to a new variable. Let’s call this new variable artists_gender:

artists_gender = artists.select(pl.col("Gender"))

As you can see, when you do that, it does not print the new DataFrame. If you want to print it, you now need to call it:

artists_gender

shape: (15_916, 1)

Gender
str
"male"
"male"
"male"
"male"
"male"
…
"male"
null
"male"
null
null

Now, if you want to see the unique entries for this column you can run:

artists.select(pl.col("Gender")).unique()

shape: (8, 1)

Gender
str
"transgender woman"
null
"male (trans? ftm?)"
"male"
"female"
"non-binary"
"gender non-conforming"
"female (transwoman)"

Your turn:

Create a DataFrame with the unique values of the column Nationality.

If we want to see how many artists are represented for each nationality, we can do:

artists.select(
    pl.col("Nationality").unique(maintain_order=True).alias("Nationality"),
    pl.col("Nationality").unique_counts().alias("Count"),
)

shape: (141, 2)

Nationality	Count
str	u64
"American"	5435
"Spanish"	160
"Danish"	123
"Italian"	550
"French"	876
…	…
"Hmong"	1
"Spirit Lake Dakota/Cheyenne Ri…	1
"Okinawan"	1
"Dominican"	1
"Kalaaleq"	1

And if we want to sort them by decreasing count:

artists.select(
    pl.col("Nationality").unique(maintain_order=True).alias("Nationality"),
    pl.col("Nationality").unique_counts().alias("Count"),
).sort("Count", descending=True)

shape: (141, 2)

Nationality	Count
str	u64
"American"	5435
null	2482
"German"	999
"French"	876
"British"	876
…	…
"Hmong"	1
"Spirit Lake Dakota/Cheyenne Ri…	1
"Okinawan"	1
"Dominican"	1
"Kalaaleq"	1

You can of course select multiple columns:

artists.select(
    pl.col("Gender"),
    pl.col("Nationality")
)

shape: (15_916, 2)

Gender	Nationality
str	str
"male"	"American"
"male"	"Spanish"
"male"	"American"
"male"	"American"
"male"	"Danish"
…	…
"male"	"Polish"
null	"Russian"
"male"	"Russian"
null	"French"
null	"German"

Selecting row

You can select all the rows matching some condition with the function filter:

artists.filter(pl.col("Gender") == "transgender woman")

shape: (1, 9)

ConstituentID	DisplayName	ArtistBio	Nationality	Gender	BeginDate	EndDate	Wiki QID	ULAN
i64	str	str	str	str	i64	i64	str	i64
135972	"Evelyn Taocheng Wang"	"Dutch, born China 1981"	"Dutch"	"transgender woman"	1981	0	null	null

Not a very represented group!

You can combine conditions:

artists.filter(
    pl.col("Gender") == "female",
    pl.col("Nationality") == "American",
)

shape: (1_186, 9)

ConstituentID	DisplayName	ArtistBio	Nationality	Gender	BeginDate	EndDate	Wiki QID	ULAN
i64	str	str	str	str	i64	i64	str	i64
10	"Irene Aronson"	"American, born Germany 1918"	"American"	"female"	1918	0	"Q19748568"	500042413
21	"Ruth Asawa"	"American, 1926–2013"	"American"	"female"	1926	2013	"Q7382874"	500077806
31	"Dana Atchley"	"American, 1941–2000"	"American"	"female"	1941	2000	null	null
41	"Berenice Abbott"	"American, 1898–1991"	"American"	"female"	1898	1991	"Q231861"	500020631
61	"Alice Adams"	"American, born 1930"	"American"	"female"	1930	0	"Q4725709"	500033075
…	…	…	…	…	…	…	…	…
141044	"Elizabeth Browning Jackson"	"American, born 1948"	"American"	"female"	1948	0	null	null
141278	"Allison McCann"	"American, born 1989"	"American"	"female"	1989	0	null	null
141280	"Yuliya Parshina-Kottas"	"Russian-American, born 1982"	"American"	"female"	1982	0	null	null
141284	"Audra D.S. Burch"	"American, born 1966"	"American"	"female"	1966	0	null	null
141618	"Sarah Friedland"	"American, born 1992"	"American"	"female"	1992	0	null	null

Your turn:

How many Afghan artists are there in this dataset?

Based on what we learnt, there are 2 different ways to get to that info. Can you come up with both of them?

Switching to a new dataset

Looking at this repository from the MoMA again, try to load in Python the data for the artworks and print it.

How many nationalities are represented in this dataset?

Create a new DataFrame showing the years and the number of artworks for each year.