In [1]:
import seaborn as sns
%matplotlib inline
In [8]:
import pandas as pd
pd.options.display.max_rows =10
data = sns.load_dataset("flights")

GroupBy

We are going to explore grouping data in pandas and how we can use the veriety of methods designed for the GroupBy object. I am going to use a very basic example data of flights which has year , month and number of passengers for that year and month.

What is a GroupBy

before we continue with how the groupby works let's try to understand the GroupBy object in Pandas. GroupBy is an object in Pandas, which means it has it's own methods and properties which we can work it.
GroupBy objects are returned when you call a groupby method to a DataFrame or a Series which we will see below.

In [9]:
data
Out[9]:
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
... ... ... ...
139 1960 August 606
140 1960 September 508
141 1960 October 461
142 1960 November 390
143 1960 December 432

144 rows × 3 columns

In [15]:
data.groupby("month").passengers.sum()
Out[15]:
month
January      2901
February     2820
March        3242
April        3205
May          3262
             ... 
August       4213
September    3629
October      3199
November     2794
December     3142
Name: passengers, Length: 12, dtype: int64
Let's walk through the code above step by step.
  1. We have a DataFrame data as per above.
  2. we called the groupby() method to the DataFrame data.
  3. we passed in "month" as a parameter to our groupby method, this tells the method that I wish to group my data based on the column month, we are grouping by one column in this example but we will see in the next example that we can have as many columns as we want in the groupby method.

  4. data.groupby("month) produces a Groupby object
    data.groupby("month")
    pandas.core.groupby.DataFrameGroupBy object at 0x00000229658F9F98

  5. Now that we have a GroupBy object we can chain other methods, in this case I want to sum the number of passengers for each group, since our data is grouped by month, I can call the column passengers and used the method sum to get the result.

In [16]:
data.groupby("month")
Out[16]:
<pandas.core.groupby.DataFrameGroupBy object at 0x00000229658F9F98>

Let's look an example with more columns in the groupby

As I menthoned above you can group your data with as many columns as you want in the groupby method. The key thing is to remember whenever you are using more than one column in any method including groupby in pandas you need to pass those columns into a list, by using [ ].

In [17]:
data.groupby(['year','month']).passengers.sum()
Out[17]:
year  month    
1949  January      112
      February     118
      March        132
      April        129
      May          121
                  ... 
1960  August       606
      September    508
      October      461
      November     390
      December     432
Name: passengers, Length: 144, dtype: int64

Let's try to understand the above result, above we have a data which has been grouped based on year and month, thus it will return the number of passenger for every month and year, we have two indexes, the first being the year and the second the month in that year. Of course this data has already been grouped so the result will be the same as the raw data, but I just done this to show how the groupby method works with multiple columns.

Some of the GroupBy Methods, some kind of a cheat sheet for GroupBy if you will.

In [20]:
my_group = data.groupby(['month'])
In [21]:
my_group.groups
Out[21]:
{'April': Int64Index([3, 15, 27, 39, 51, 63, 75, 87, 99, 111, 123, 135], dtype='int64'),
 'August': Int64Index([7, 19, 31, 43, 55, 67, 79, 91, 103, 115, 127, 139], dtype='int64'),
 'December': Int64Index([11, 23, 35, 47, 59, 71, 83, 95, 107, 119, 131, 143], dtype='int64'),
 'February': Int64Index([1, 13, 25, 37, 49, 61, 73, 85, 97, 109, 121, 133], dtype='int64'),
 'January': Int64Index([0, 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132], dtype='int64'),
 'July': Int64Index([6, 18, 30, 42, 54, 66, 78, 90, 102, 114, 126, 138], dtype='int64'),
 'June': Int64Index([5, 17, 29, 41, 53, 65, 77, 89, 101, 113, 125, 137], dtype='int64'),
 'March': Int64Index([2, 14, 26, 38, 50, 62, 74, 86, 98, 110, 122, 134], dtype='int64'),
 'May': Int64Index([4, 16, 28, 40, 52, 64, 76, 88, 100, 112, 124, 136], dtype='int64'),
 'November': Int64Index([10, 22, 34, 46, 58, 70, 82, 94, 106, 118, 130, 142], dtype='int64'),
 'October': Int64Index([9, 21, 33, 45, 57, 69, 81, 93, 105, 117, 129, 141], dtype='int64'),
 'September': Int64Index([8, 20, 32, 44, 56, 68, 80, 92, 104, 116, 128, 140], dtype='int64')}

In the example above we called the groups method, which should return the groups and the correspinding indexes. for example for April, April appear in the index of 3,15,27,39, ...,135. let's prove that this is indeed a fact by looking at those indexes.

I am going to use the loc method, if you are unsure abot loc I have covered it in my filtering and indexing tutorials.

In [33]:
data.loc[[3, 15, 27, 39, 51, 63, 75, 87, 99, 111, 123, 135],:]
Out[33]:
year month passengers
3 1949 April 129
15 1950 April 135
27 1951 April 163
39 1952 April 181
51 1953 April 235
... ... ... ...
87 1956 April 313
99 1957 April 348
111 1958 April 348
123 1959 April 396
135 1960 April 461

12 rows × 3 columns

In [41]:
#we can rank our columns in the groups
my_group.rank(ascending = False)
Out[41]:
passengers year
0 12.0 12.0
1 12.0 12.0
2 12.0 12.0
3 12.0 12.0
4 12.0 12.0
... ... ...
139 1.0 1.0
140 1.0 1.0
141 1.0 1.0
142 1.0 1.0
143 1.0 1.0

144 rows × 2 columns

In [42]:
#we can use mathematical functions such as sum
my_group.sum()
Out[42]:
year passengers
month
January 23454 2901
February 23454 2820
March 23454 3242
April 23454 3205
May 23454 3262
... ... ...
August 23454 4213
September 23454 3629
October 23454 3199
November 23454 2794
December 23454 3142

12 rows × 2 columns

In [43]:
my_group.count()
Out[43]:
year passengers
month
January 12 12
February 12 12
March 12 12
April 12 12
May 12 12
... ... ...
August 12 12
September 12 12
October 12 12
November 12 12
December 12 12

12 rows × 2 columns

In [47]:
my_group.median()
Out[47]:
year passengers
month
January 1954.5 223.0
February 1954.5 214.5
March 1954.5 251.5
April 1954.5 252.0
May 1954.5 252.0
... ... ...
August 1954.5 320.0
September 1954.5 285.5
October 1954.5 251.5
November 1954.5 220.0
December 1954.5 253.5

12 rows × 2 columns