How to pivot large tables in BigQuery?

Pivoting a table is a very common operation in data processing. But there is no direct function in BigQuery to perform such operation. To solve this problem I have written a Python module, BqPivot. It generates a SQL query to pivot a table that can then be run in BigQuery. In this blog post, I will introduce you to this module. You can also find the code in my GitHub repository.

Let’s go through how a typical pivot operation looks like –

        Pclass   Survived  Sex
0	3	 0	   male
1	1	 1	   female
2	3	 1	   female
3	1	 1	   female
4	3	 0	   male
5	3	 0	   male
6	1	 0	   male
7	3	 0	   male
8	3	 1	   female
9	2	 1	   male

You have an index column – Pclass, a values column – Survived, and the column on which you need to perform the pivot – Sex. After the pivot operation, the resulting table would look like –

	Pclass	female_survived	male_survived
0	1	2	        0
1	2	1	        1
2	3	2	        0

In the above example, we can get the desired pivoted table using the following query –

select Pclass, 
sum(case when Sex = "male" then Survived else 0 end) as male_survived,
sum(case when Sex = "female" then Survived else 0 end) as female_survived
from `table_name`
group by 1

Writing case statements for each unique value in the pivot column works well if the number of unique values is small. But, imagine writing these queries for a column that has hundreds of unique values. It seems that we will have to either –

  • Write these annoying case statements again and again for each unique value of the column.
  •  Or switch to Python or R and use a package like pandas to pivot the table. But then we would lose the “BiqQuery advantage” of almost real-time analysis if the table is too large.

After some initial struggle, I found a middle ground. As you can notice the case statements are mostly redundant. If we can get a list of all the unique values, we can simply loop over the list to generate a SQL query. Hence, I started writing loops in Python to generate the repetitive case statements which I then used to run in BiqQuery.

But soon I realized there is still some room for automation. Hence, I came up with the Python module BqPivot. It takes the names of the various columns and the data as input and outputs a SQL query that can then be used in BigQuery to pivot your table. Let me give you a step-by-step introduction –

In order to run this, you need to have Python 3 and pandas installed on your system. Next, you need to copy this file in your working directory. You can find it here. After this, you need to import the BqPivot class.

from bq_pivot import BqPivot

Then you need to create an object of this class by passing the required arguments.

import pandas as pd

# The following example uses the kaggle's titanic data. It can be found here -
# `https://www.kaggle.com/c/titanic/data`
data = pd.read_csv("titanic.csv")

# Using only the top 20 rows for demonstration -
data = data.head(20)

gen = BqPivot(data=data,
              index_col=["Pclass", "Sex"],
              pivot_col="Cabin",
              values_col="Survived",
              agg_fun="sum")

Let’s go through the input arguments –

  • data (the string path to the CSV file or a pandas data frame object) The required data is quite flexible. It can work with the original table from BigQuery. But if that table is huge and you do not want to load the entire data into your machine you can also only provide the unique values of the pivot column in your data. Therefore, a data generated from a group by on the pivot column would also work perfectly fine.
select Cabin from `titanic_data` group by 1
  • index_col (a Python list or string) – The names of the index columns in the query on which the group by needs to be performed. Again, these columns may or may not be present in the input data. The only requirement is that these names must be consistent with the original data on which the output query would be run.
  • pivot_col (string) – The name of the column on which the pivot needs to be performed. This column must be present in the input data.
  • values_col (string) – The name of the column on which aggregation needs to be performed. Just like the index_col the name should only be consistent with the original data and may not be present in the data argument.

Let’s also go through the optional arguments –

  • agg_fun (string, optional) – The name of the SQL aggregation function. It defaults to the sum function. Examples – “avg”, “min”, “count” etc.
  • table_name (string, optional) – The name of the table in the query. If not provided it is replaced by a placeholder which can be changed later by the actual table name manually.
  • not_eq_default (numeric, optional) The value to take when the else statement is satisfied. For example, if one is doing a sum aggregation on the value column then the not_eq_default should be equal to 0. Because the case statement part of the SQL query would look like –
sum(case when Cabin = C85 then Survived else 0)

Similarly, if the aggregation function is min then the not_eq_default should be positive infinity. This argument defaults to 0.

  • add_col_nm_suffix (boolean, optional) – If True, then the original values column name will be added as a suffix in the new pivoted columns. It defaults to True.
  • custom_agg_fun (string, optional) – This can be used if one wants to give customized aggregation function. The values column name should be replaced with {}. For example, if we want an aggregation function like – sum(coalesce(Survived, 0)) then the custom_agg_fun argument would be – sum(coalesce({}, 0)). It defaults to none. If provided it overrides the agg_fun parameter.
  • prefix (string, optional) – A fixed string to add as a prefix in the pivoted column names. It defaults to None.
  • suffix (string, optional) – A fixed string to add as a suffix in the pivoted column names. It defaults to None.

Now that we have covered all the possible inputs let’s run the command to generate the SQL query. We can run the write_query method to save the generated query in a text file.

gen.write_query("query.txt")

On opening the query.txt file you can find the following query –

select Pclass, Sex, 
sum(case when Cabin = "nan" then Survived else 0 end) as nan_survived,
sum(case when Cabin = "C85" then Survived else 0 end) as c85_survived,
sum(case when Cabin = "C123" then Survived else 0 end) as c123_survived,
sum(case when Cabin = "E46" then Survived else 0 end) as e46_survived,
sum(case when Cabin = "G6" then Survived else 0 end) as g6_survived,
sum(case when Cabin = "C103" then Survived else 0 end) as c103_survived
from <--insert-table-name-here-->
group by 1,2

I hope this will help you save some time while pivoting tables in BigQuery. If you have any questions, let me know in the comments section. I will do my best to answer. Before I sign out, here are some pointers on when and when not to use this module.

When to use this module?

  • When you want a lot of pivot columns and you do not have an intern who would write the query for you.
  • When your data set is very large and pivoting in Python or R would not give you real-time results.

When not to use this module?

  • When the number of unique values in the pivot column is less than 10.
  • When the number of unique values in the pivot column is more than 10,000. This would result in more than 10,000 columns in the resulting table. But BigQuery only supports tables with less than 10,000 columns. 
  • When you are willing to work hard but not smart.

Thank You. 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s