In [1]:

import seaborn as sns
%matplotlib inline

In [8]:

import pandas as pd
pd.options.display.max_rows =10
data = sns.load_dataset("flights")

GroupBy¶

We are going to explore grouping data in pandas and how we can use the veriety of methods designed for the GroupBy object. I am going to use a very basic example data of flights which has year , month and number of passengers for that year and month.

What is a GroupBy¶

before we continue with how the groupby works let's try to understand the GroupBy object in Pandas. GroupBy is an object in Pandas, which means it has it's own methods and properties which we can work it.
GroupBy objects are returned when you call a groupby method to a DataFrame or a Series which we will see below.

In [9]:

data

Out[9]:

	year	month	passengers
0	1949	January	112
1	1949	February	118
2	1949	March	132
3	1949	April	129
4	1949	May	121
...	...	...	...
139	1960	August	606
140	1960	September	508
141	1960	October	461
142	1960	November	390
143	1960	December	432

144 rows × 3 columns

In [15]:

data.groupby("month").passengers.sum()

Out[15]:

month
January      2901
February     2820
March        3242
April        3205
May          3262
             ... 
August       4213
September    3629
October      3199
November     2794
December     3142
Name: passengers, Length: 12, dtype: int64

Let's walk through the code above step by step.¶

We have a DataFrame data as per above.
we called the groupby() method to the DataFrame data.
we passed in "month" as a parameter to our groupby method, this tells the method that I wish to group my data based on the column month, we are grouping by one column in this example but we will see in the next example that we can have as many columns as we want in the groupby method.
data.groupby("month) produces a Groupby object
data.groupby("month")
pandas.core.groupby.DataFrameGroupBy object at 0x00000229658F9F98
Now that we have a GroupBy object we can chain other methods, in this case I want to sum the number of passengers for each group, since our data is grouped by month, I can call the column passengers and used the method sum to get the result.

In [16]:

data.groupby("month")

Out[16]:

<pandas.core.groupby.DataFrameGroupBy object at 0x00000229658F9F98>

Let's look an example with more columns in the groupby¶

As I menthoned above you can group your data with as many columns as you want in the groupby method. The key thing is to remember whenever you are using more than one column in any method including groupby in pandas you need to pass those columns into a list, by using [ ].

In [17]:

data.groupby(['year','month']).passengers.sum()

Out[17]:

year  month    
1949  January      112
      February     118
      March        132
      April        129
      May          121
                  ... 
1960  August       606
      September    508
      October      461
      November     390
      December     432
Name: passengers, Length: 144, dtype: int64

Let's try to understand the above result, above we have a data which has been grouped based on year and month, thus it will return the number of passenger for every month and year, we have two indexes, the first being the year and the second the month in that year. Of course this data has already been grouped so the result will be the same as the raw data, but I just done this to show how the groupby method works with multiple columns.

Some of the GroupBy Methods, some kind of a cheat sheet for GroupBy if you will.¶

In [20]:

my_group = data.groupby(['month'])

In [21]:

my_group.groups

Out[21]:

{'April': Int64Index([3, 15, 27, 39, 51, 63, 75, 87, 99, 111, 123, 135], dtype='int64'),
 'August': Int64Index([7, 19, 31, 43, 55, 67, 79, 91, 103, 115, 127, 139], dtype='int64'),
 'December': Int64Index([11, 23, 35, 47, 59, 71, 83, 95, 107, 119, 131, 143], dtype='int64'),
 'February': Int64Index([1, 13, 25, 37, 49, 61, 73, 85, 97, 109, 121, 133], dtype='int64'),
 'January': Int64Index([0, 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132], dtype='int64'),
 'July': Int64Index([6, 18, 30, 42, 54, 66, 78, 90, 102, 114, 126, 138], dtype='int64'),
 'June': Int64Index([5, 17, 29, 41, 53, 65, 77, 89, 101, 113, 125, 137], dtype='int64'),
 'March': Int64Index([2, 14, 26, 38, 50, 62, 74, 86, 98, 110, 122, 134], dtype='int64'),
 'May': Int64Index([4, 16, 28, 40, 52, 64, 76, 88, 100, 112, 124, 136], dtype='int64'),
 'November': Int64Index([10, 22, 34, 46, 58, 70, 82, 94, 106, 118, 130, 142], dtype='int64'),
 'October': Int64Index([9, 21, 33, 45, 57, 69, 81, 93, 105, 117, 129, 141], dtype='int64'),
 'September': Int64Index([8, 20, 32, 44, 56, 68, 80, 92, 104, 116, 128, 140], dtype='int64')}

In the example above we called the groups method, which should return the groups and the correspinding indexes. for example for April, April appear in the index of 3,15,27,39, ...,135. let's prove that this is indeed a fact by looking at those indexes.

I am going to use the loc method, if you are unsure abot loc I have covered it in my filtering and indexing tutorials.

In [33]:

data.loc[[3, 15, 27, 39, 51, 63, 75, 87, 99, 111, 123, 135],:]

Out[33]:

	year	month	passengers
3	1949	April	129
15	1950	April	135
27	1951	April	163
39	1952	April	181
51	1953	April	235
...	...	...	...
87	1956	April	313
99	1957	April	348
111	1958	April	348
123	1959	April	396
135	1960	April	461

12 rows × 3 columns

In [41]:

#we can rank our columns in the groups
my_group.rank(ascending = False)

Out[41]:

	passengers	year
0	12.0	12.0
1	12.0	12.0
2	12.0	12.0
3	12.0	12.0
4	12.0	12.0
...	...	...
139	1.0	1.0
140	1.0	1.0
141	1.0	1.0
142	1.0	1.0
143	1.0	1.0

144 rows × 2 columns

In [42]:

#we can use mathematical functions such as sum
my_group.sum()

Out[42]:

	year	passengers
month
January	23454	2901
February	23454	2820
March	23454	3242
April	23454	3205
May	23454	3262
...	...	...
August	23454	4213
September	23454	3629
October	23454	3199
November	23454	2794
December	23454	3142

12 rows × 2 columns

In [43]:

my_group.count()

Out[43]:

	year	passengers
month
January	12	12
February	12	12
March	12	12
April	12	12
May	12	12
...	...	...
August	12	12
September	12	12
October	12	12
November	12	12
December	12	12

12 rows × 2 columns

In [47]:

my_group.median()

Out[47]:

	year	passengers
month
January	1954.5	223.0
February	1954.5	214.5
March	1954.5	251.5
April	1954.5	252.0
May	1954.5	252.0
...	...	...
August	1954.5	320.0
September	1954.5	285.5
October	1954.5	251.5
November	1954.5	220.0
December	1954.5	253.5

12 rows × 2 columns