Playing with MoMA datasets

An introduction to Polars DataFrames

Author

Marie-Hélène Burle

The Museum of Modern Art (MoMA) in New York created a collection of GitHub repositories to access information about their collections. We will play with some of these to learn how to deal with tabular data in Python.

Tabular data

Many fields of data science and the humanities rely on tabular data where:

  • columns hold variables and are homogeneous (same data type),
  • rows contain observations and can be heterogeneous.

Tabular data is the kind of data you would store in spreadsheets (e.g. Microsoft Excel).

DataFrames (data frames or dataframes) are two dimensional objects that bring tabular data to programming.

Polars

If you read a little about Python, you will hear about pandas. It is everywhere (including in the Anaconda set of packages). That’s because pandas was, for the longest time, the only package to deal with DataFrames in Python.

In recent years however, a much better library called Polars has been developed. It is faster, has a better syntax, handles missing data much better, and allows to work with larger datasets on the same hardware. In this section, we will learn how to use it.

Reading in data

Load the Polars package as the alias pl (this is the standard way to load Polars):

import polars as pl

Create a DataFrame that we call artists with data from the MoMA collection project on artists:

artists = pl.read_csv("https://github.com/MuseumofModernArt/collection/raw/refs/heads/main/Artists.csv")

As you can see, Polars can read in data from a URL. To get that URL however, I had to click on the Artists.csv link in the main collection project, right-click on Raw, and copy the link. That’s the link that I used in the code above.

Of course Polars can also read in files that you have in your computer.

Here, the file is a CSV file, so I am using the function polars.read_csv. There are other functions for other file formats (e.g. polars.read_json, polars.read_parquet).

Exploring the data

Now, I can print the DataFrame:

artists
shape: (15_916, 9)
ConstituentID DisplayName ArtistBio Nationality Gender BeginDate EndDate Wiki QID ULAN
i64 str str str str i64 i64 str i64
1 "Robert Arneson" "American, 1930–1992" "American" "male" 1930 1992 null null
2 "Doroteo Arnaiz" "Spanish, born 1936" "Spanish" "male" 1936 0 null null
3 "Bill Arnold" "American, born 1941" "American" "male" 1941 0 null null
4 "Charles Arnoldi" "American, born 1946" "American" "male" 1946 0 "Q1063584" 500027998
5 "Per Arnoldi" "Danish, born 1941" "Danish" "male" 1941 0 null null
141956 "Julian Pałka" "Polish, 1923 – 2002" "Polish" "male" 1923 2002 null null
141957 "A. Vedeneev" "Russian, 1891 – 1950" "Russian" null 1891 1950 null null
141958 "Yakov Zinovievich Shtoffer" "Russian, 1906 – 1951" "Russian" "male" 1906 1951 null null
142033 "M/M" "French" "French" null 0 0 null null
142034 "Klaus Ferentschik" "German, born 1957" "German" null 1957 0 null null


It has 15,908 rows and 9 columns. This means that there are 15,908 different artists in this dataset and that there are 9 variables (or distinct types of information) for each artist.

You can also see the data type for each column (variable).

I can print the first 5 rows:

artists.head()
shape: (5, 9)
ConstituentID DisplayName ArtistBio Nationality Gender BeginDate EndDate Wiki QID ULAN
i64 str str str str i64 i64 str i64
1 "Robert Arneson" "American, 1930–1992" "American" "male" 1930 1992 null null
2 "Doroteo Arnaiz" "Spanish, born 1936" "Spanish" "male" 1936 0 null null
3 "Bill Arnold" "American, born 1941" "American" "male" 1941 0 null null
4 "Charles Arnoldi" "American, born 1946" "American" "male" 1946 0 "Q1063584" 500027998
5 "Per Arnoldi" "Danish, born 1941" "Danish" "male" 1941 0 null null


Or the first 3 rows:

artists.head(3)
shape: (3, 9)
ConstituentID DisplayName ArtistBio Nationality Gender BeginDate EndDate Wiki QID ULAN
i64 str str str str i64 i64 str i64
1 "Robert Arneson" "American, 1930–1992" "American" "male" 1930 1992 null null
2 "Doroteo Arnaiz" "Spanish, born 1936" "Spanish" "male" 1936 0 null null
3 "Bill Arnold" "American, born 1941" "American" "male" 1941 0 null null


Or the last 5 rows:

artists.tail()
shape: (5, 9)
ConstituentID DisplayName ArtistBio Nationality Gender BeginDate EndDate Wiki QID ULAN
i64 str str str str i64 i64 str i64
141956 "Julian Pałka" "Polish, 1923 – 2002" "Polish" "male" 1923 2002 null null
141957 "A. Vedeneev" "Russian, 1891 – 1950" "Russian" null 1891 1950 null null
141958 "Yakov Zinovievich Shtoffer" "Russian, 1906 – 1951" "Russian" "male" 1906 1951 null null
142033 "M/M" "French" "French" null 0 0 null null
142034 "Klaus Ferentschik" "German, born 1957" "German" null 1957 0 null null

Your turn:

Print the last 15 rows of this DataFrame.

I can get a random sample of rows. This is very useful if you want to have an idea of what is going on in the data and not focus on the first or last elements which are likely to look similar.

For this, I first need to load the random module from Python’s standard library:

import random

Then I can draw a single random samples with:

artists.sample()
shape: (1, 9)
ConstituentID DisplayName ArtistBio Nationality Gender BeginDate EndDate Wiki QID ULAN
i64 str str str str i64 i64 str i64
6507 "Katsuro Yoshida" "Japanese" "Japanese" "male" 0 0 null null


Or I can draw any number of samples:

artists.sample(10)
shape: (10, 9)
ConstituentID DisplayName ArtistBio Nationality Gender BeginDate EndDate Wiki QID ULAN
i64 str str str str i64 i64 str i64
129 "Harold Altman" "American, 1924–2003" "American" "male" 1924 2003 null 500028706
5154 "Alberto Sani" "Italian, 1900–1964" "Italian" "male" 1900 1964 null null
6233 "Bernard Walsh" "American, 1912–2004" "American" "male" 1912 2004 null null
7435 "Donald Moffett" "American, born 1955" "American" "male" 1955 0 null null
7929 "Joseph Grigely" "American, born 1956" "American" "male" 1956 0 null null
7973 "Yevgeny Khaldei" "Russian, 1917–1998" "Russian" "male" 1917 1998 null null
8039 "Bradford Bissell" "American, born 1957" "American" "male" 1957 0 null null
11119 "Columbian Bronze Corp., Freepo… null null null 0 0 null null
34619 "El Anatsui" "Ghanaian, born 1944" "Ghanaian" "male" 1944 0 "Q1323876" 500034780
75665 "Fernanda Gomes" "Brazilian, born 1960" "Brazilian" "female" 1960 0 null null


I can get a list of columns with:

artists.columns
['ConstituentID',
 'DisplayName',
 'ArtistBio',
 'Nationality',
 'Gender',
 'BeginDate',
 'EndDate',
 'Wiki QID',
 'ULAN']

And a list of data types with:

artists.dtypes
[Int64, String, String, String, String, Int64, Int64, String, Int64]

I can get the schema (sets the names of the variables (columns) and their data types) with:

artists.schema
Schema([('ConstituentID', Int64),
        ('DisplayName', String),
        ('ArtistBio', String),
        ('Nationality', String),
        ('Gender', String),
        ('BeginDate', Int64),
        ('EndDate', Int64),
        ('Wiki QID', String),
        ('ULAN', Int64)])

Finally, I can get some summary statistics for all the variables with:

artists.describe()
shape: (9, 10)
statistic ConstituentID DisplayName ArtistBio Nationality Gender BeginDate EndDate Wiki QID ULAN
str f64 str str str str f64 f64 str f64
"count" 15916.0 "15916" "13746" "13434" "12636" 15916.0 15916.0 "3248" 2931.0
"null_count" 0.0 "0" "2170" "2482" "3280" 0.0 0.0 "12668" 12985.0
"mean" 25439.306861 null null null null 1509.16518 731.405253 null 5.0007e8
"std" 31769.08116 null null null null 798.986602 956.051559 null 86599.942895
"min" 1.0 ""a.r." group" "1858–ca. 1910" "Afghan" "female" 0.0 0.0 "Q1000203" 5.00000006e8
"25%" 4443.0 null null null null 1864.0 0.0 null 5.00017574e8
"50%" 9752.0 null null null null 1924.0 0.0 null 5.00033033e8
"75%" 36070.0 null null null null 1949.0 1971.0 null 5.00114576e8
"max" 142034.0 "…XYZ Dot Dot Dot Ex Why Zed De… "Łódź, Poland, est. 1929–1936" "Zimbabwean" "transgender woman" 2020.0 2026.0 "Q993400" 5.00356571e8


Depending on the data, this can be useful or perfectly meaningless (as is the case here) if the variables are not numerical. It does show us something useful however: the number of missing data for each variables. Missing data in Polars are of the type polars.Null and you can see their numbers in the statistic called null_count.

Subsetting data

Selecting subsets of data is very useful to run analyses, create graphs, get some statistics, create new datasets, etc.

Selecting columns

In Polars, you can select a column with the function select:

artists.select(pl.col("Gender"))
shape: (15_916, 1)
Gender
str
"male"
"male"
"male"
"male"
"male"
"male"
null
"male"
null
null

Note that this prints the result, but it does not save the new DataFrame. If you want to create a new DataFrame with the result you need to assign that result to a new variable. Let’s call this new variable artists_gender:

artists_gender = artists.select(pl.col("Gender"))

As you can see, when you do that, it does not print the new DataFrame. If you want to print it, you now need to call it:

artists_gender
shape: (15_916, 1)
Gender
str
"male"
"male"
"male"
"male"
"male"
"male"
null
"male"
null
null


Now, if you want to see the unique entries for this column you can run:

artists.select(pl.col("Gender")).unique()
shape: (8, 1)
Gender
str
"female"
"transgender woman"
"non-binary"
"gender non-conforming"
"female (transwoman)"
null
"male"
"male (trans? ftm?)"

Your turn:

Create a DataFrame with the unique values of the column Nationality.

If we want to see how many artists are represented for each nationality, we can do:

artists.select(
    pl.col("Nationality").unique(maintain_order=True).alias("Nationality"),
    pl.col("Nationality").unique_counts().alias("Count"),
)
shape: (141, 2)
Nationality Count
str u64
"American" 5435
"Spanish" 160
"Danish" 123
"Italian" 550
"French" 876
"Hmong" 1
"Spirit Lake Dakota/Cheyenne Ri… 1
"Okinawan" 1
"Dominican" 1
"Kalaaleq" 1


And if we want to sort them by decreasing count:

artists.select(
    pl.col("Nationality").unique(maintain_order=True).alias("Nationality"),
    pl.col("Nationality").unique_counts().alias("Count"),
).sort("Count", descending=True)
shape: (141, 2)
Nationality Count
str u64
"American" 5435
null 2482
"German" 999
"French" 876
"British" 876
"Hmong" 1
"Spirit Lake Dakota/Cheyenne Ri… 1
"Okinawan" 1
"Dominican" 1
"Kalaaleq" 1


You can of course select multiple columns:

artists.select(
    pl.col("Gender"),
    pl.col("Nationality")
)
shape: (15_916, 2)
Gender Nationality
str str
"male" "American"
"male" "Spanish"
"male" "American"
"male" "American"
"male" "Danish"
"male" "Polish"
null "Russian"
"male" "Russian"
null "French"
null "German"

Selecting row

You can select all the rows matching some condition with the function filter:

artists.filter(pl.col("Gender") == "transgender woman")
shape: (1, 9)
ConstituentID DisplayName ArtistBio Nationality Gender BeginDate EndDate Wiki QID ULAN
i64 str str str str i64 i64 str i64
135972 "Evelyn Taocheng Wang" "Dutch, born China 1981" "Dutch" "transgender woman" 1981 0 null null


Not a very represented group!

You can combine conditions:

artists.filter(
    pl.col("Gender") == "female",
    pl.col("Nationality") == "American",
)
shape: (1_186, 9)
ConstituentID DisplayName ArtistBio Nationality Gender BeginDate EndDate Wiki QID ULAN
i64 str str str str i64 i64 str i64
10 "Irene Aronson" "American, born Germany 1918" "American" "female" 1918 0 "Q19748568" 500042413
21 "Ruth Asawa" "American, 1926–2013" "American" "female" 1926 2013 "Q7382874" 500077806
31 "Dana Atchley" "American, 1941–2000" "American" "female" 1941 2000 null null
41 "Berenice Abbott" "American, 1898–1991" "American" "female" 1898 1991 "Q231861" 500020631
61 "Alice Adams" "American, born 1930" "American" "female" 1930 0 "Q4725709" 500033075
141044 "Elizabeth Browning Jackson" "American, born 1948" "American" "female" 1948 0 null null
141278 "Allison McCann" "American, born 1989" "American" "female" 1989 0 null null
141280 "Yuliya Parshina-Kottas" "Russian-American, born 1982" "American" "female" 1982 0 null null
141284 "Audra D.S. Burch" "American, born 1966" "American" "female" 1966 0 null null
141618 "Sarah Friedland" "American, born 1992" "American" "female" 1992 0 null null

Your turn:

How many Afghan artists are there in this dataset?

Based on what we learnt, there are 2 different ways to get to that info. Can you come up with both of them?

Switching to a new dataset

Looking at this repository from the MoMA again, try to load in Python the data for the artworks and print it.

How many nationalities are represented in this dataset?

Create a new DataFrame showing the years and the number of artworks for each year.