import seaborn as sns %matplotlib inline
import pandas as pd pd.options.display.max_rows =10 data = sns.load_dataset("flights")
We are going to explore grouping data in pandas and how we can use the veriety of methods designed for the GroupBy object. I am going to use a very basic example data of flights which has year , month and number of passengers for that year and month.
before we continue with how the groupby works let's try to understand the GroupBy object in Pandas. GroupBy is an object in Pandas, which means it has it's own methods and properties which we can work it. GroupBy objects are returned when you call a groupby method to a DataFrame or a Series which we will see below.
data
144 rows × 3 columns
data.groupby("month").passengers.sum()
month January 2901 February 2820 March 3242 April 3205 May 3262 ... August 4213 September 3629 October 3199 November 2794 December 3142 Name: passengers, Length: 12, dtype: int64
we passed in "month" as a parameter to our groupby method, this tells the method that I wish to group my data based on the column month, we are grouping by one column in this example but we will see in the next example that we can have as many columns as we want in the groupby method.
data.groupby("month) produces a Groupby object data.groupby("month") pandas.core.groupby.DataFrameGroupBy object at 0x00000229658F9F98
Now that we have a GroupBy object we can chain other methods, in this case I want to sum the number of passengers for each group, since our data is grouped by month, I can call the column passengers and used the method sum to get the result.
data.groupby("month")
<pandas.core.groupby.DataFrameGroupBy object at 0x00000229658F9F98>
As I menthoned above you can group your data with as many columns as you want in the groupby method. The key thing is to remember whenever you are using more than one column in any method including groupby in pandas you need to pass those columns into a list, by using [ ].
data.groupby(['year','month']).passengers.sum()
year month 1949 January 112 February 118 March 132 April 129 May 121 ... 1960 August 606 September 508 October 461 November 390 December 432 Name: passengers, Length: 144, dtype: int64
Let's try to understand the above result, above we have a data which has been grouped based on year and month, thus it will return the number of passenger for every month and year, we have two indexes, the first being the year and the second the month in that year. Of course this data has already been grouped so the result will be the same as the raw data, but I just done this to show how the groupby method works with multiple columns.
my_group = data.groupby(['month'])
my_group.groups
{'April': Int64Index([3, 15, 27, 39, 51, 63, 75, 87, 99, 111, 123, 135], dtype='int64'), 'August': Int64Index([7, 19, 31, 43, 55, 67, 79, 91, 103, 115, 127, 139], dtype='int64'), 'December': Int64Index([11, 23, 35, 47, 59, 71, 83, 95, 107, 119, 131, 143], dtype='int64'), 'February': Int64Index([1, 13, 25, 37, 49, 61, 73, 85, 97, 109, 121, 133], dtype='int64'), 'January': Int64Index([0, 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132], dtype='int64'), 'July': Int64Index([6, 18, 30, 42, 54, 66, 78, 90, 102, 114, 126, 138], dtype='int64'), 'June': Int64Index([5, 17, 29, 41, 53, 65, 77, 89, 101, 113, 125, 137], dtype='int64'), 'March': Int64Index([2, 14, 26, 38, 50, 62, 74, 86, 98, 110, 122, 134], dtype='int64'), 'May': Int64Index([4, 16, 28, 40, 52, 64, 76, 88, 100, 112, 124, 136], dtype='int64'), 'November': Int64Index([10, 22, 34, 46, 58, 70, 82, 94, 106, 118, 130, 142], dtype='int64'), 'October': Int64Index([9, 21, 33, 45, 57, 69, 81, 93, 105, 117, 129, 141], dtype='int64'), 'September': Int64Index([8, 20, 32, 44, 56, 68, 80, 92, 104, 116, 128, 140], dtype='int64')}
In the example above we called the groups method, which should return the groups and the correspinding indexes. for example for April, April appear in the index of 3,15,27,39, ...,135. let's prove that this is indeed a fact by looking at those indexes.
I am going to use the loc method, if you are unsure abot loc I have covered it in my filtering and indexing tutorials.
data.loc[[3, 15, 27, 39, 51, 63, 75, 87, 99, 111, 123, 135],:]
12 rows × 3 columns
#we can rank our columns in the groups my_group.rank(ascending = False)
144 rows × 2 columns
#we can use mathematical functions such as sum my_group.sum()
12 rows × 2 columns
my_group.count()
my_group.median()