Let the Machines Learn

How to pivot large tables in BigQuery?

Pivoting a table is a very common operation in data processing. But there is no direct function in BigQuery to perform such operation. To solve this problem I have written a Python module, BqPivot. It generates a SQL query to pivot a table that can then be run in BigQuery. In this blog post, I will introduce you to this module. You can also find the code in my GitHub repository.

Let’s go through how a typical pivot operation looks like –

        Pclass   Survived  Sex
0	3	 0	   male
1	1	 1	   female
2	3	 1	   female
3	1	 1	   female
4	3	 0	   male
5	3	 0	   male
6	1	 0	   male
7	3	 0	   male
8	3	 1	   female
9	2	 1	   male

You have an index column – Pclass, a values column – Survived, and the column on which you need to perform the pivot – Sex. After the pivot operation, the resulting table would look like –

	Pclass	female_survived	male_survived
0	1	2	        0
1	2	1	        1
2	3	2	        0

In the above example, we can get the desired pivoted table using the following query –

select Pclass, 
sum(case when Sex = "male" then Survived else 0 end) as male_survived,
sum(case when Sex = "female" then Survived else 0 end) as female_survived
from `table_name`
group by 1

Writing case statements for each unique value in the pivot column works well if the number of unique values is small. But, imagine writing these queries for a column that has hundreds of unique values. It seems that we will have to either –

After some initial struggle, I found a middle ground. As you can notice the case statements are mostly redundant. If we can get a list of all the unique values, we can simply loop over the list to generate a SQL query. Hence, I started writing loops in Python to generate the repetitive case statements which I then used to run in BiqQuery.

But soon I realized there is still some room for automation. Hence, I came up with the Python module BqPivot. It takes the names of the various columns and the data as input and outputs a SQL query that can then be used in BigQuery to pivot your table. Let me give you a step-by-step introduction –

In order to run this, you need to have Python 3 and pandas installed on your system. Next, you need to copy this file in your working directory. You can find it here. After this, you need to import the BqPivot class.

from bq_pivot import BqPivot

Then you need to create an object of this class by passing the required arguments.

import pandas as pd

# The following example uses the kaggle's titanic data. It can be found here -
# `https://www.kaggle.com/c/titanic/data`
data = pd.read_csv("titanic.csv")

# Using only the top 20 rows for demonstration -
data = data.head(20)

gen = BqPivot(data=data,
              index_col=["Pclass", "Sex"],
              pivot_col="Cabin",
              values_col="Survived",
              agg_fun="sum")

Let’s go through the input arguments –

select Cabin from `titanic_data` group by 1

Let’s also go through the optional arguments –

sum(case when Cabin = C85 then Survived else 0)

Similarly, if the aggregation function is min then the not_eq_default should be positive infinity. This argument defaults to 0.

Now that we have covered all the possible inputs let’s run the command to generate the SQL query. We can run the write_query method to save the generated query in a text file.

gen.write_query("query.txt")

On opening the query.txt file you can find the following query –

select Pclass, Sex, 
sum(case when Cabin = "nan" then Survived else 0 end) as nan_survived,
sum(case when Cabin = "C85" then Survived else 0 end) as c85_survived,
sum(case when Cabin = "C123" then Survived else 0 end) as c123_survived,
sum(case when Cabin = "E46" then Survived else 0 end) as e46_survived,
sum(case when Cabin = "G6" then Survived else 0 end) as g6_survived,
sum(case when Cabin = "C103" then Survived else 0 end) as c103_survived
from <--insert-table-name-here-->
group by 1,2

I hope this will help you save some time while pivoting tables in BigQuery. If you have any questions, let me know in the comments section. I will do my best to answer. Before I sign out, here are some pointers on when and when not to use this module.

When to use this module?

When not to use this module?

Thank You. 🙂