In [1]:

import seaborn as sns
data = sns.load_dataset('tips')

Filtering Data based on conditions in Pandas.¶

The ease of filtering in pandas is why I love pandas so much, we are going to take a look at single and multiple conditions for our filters.¶

We will use the tip data sets to show how this is done. This should be fun , let's begin.¶

In [3]:

import pandas as pd
%matplotlib inline
pd.options.display.max_rows = 10

In [4]:

data

Out[4]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
...	...	...	...	...	...	...	...
239	29.03	5.92	Male	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

244 rows × 7 columns

In Pandas we have two main choices of methods for filtering the data.¶

1.loc which stands for location expects the location of the labels of your rows and/or columns¶

2.iloc which stands for index location expects the numeric index of your rows and/or columns¶

The general Syntax for both are¶

data.loc[ : , : ] where you have the location of your rows separated by a comma and followed by the location of your columns. if you do not pass in any specific location to rows or/and columns it will just return everything.¶

To be extra clear here data.loc[rowslices, colslices] and either one of the slices can take in a boolan too which we will cover in more depth with examples below.

data.iloc has a similar behavior but it can't take a boolan to filter your data and accepts only numeric index of your rows and/or columns¶

data.iloc[ 0:5 , 0:2 ] will return the first 5 rows and the first 2 columns from your data.¶

I want to filter my data to show me only the tips from the Female customers.
As you can see below we done this in a single line of code, but let's try to understand what exactly is happening

In [5]:

data.loc[data.sex == "Female"]

Out[5]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
11	35.26	5.00	Female	No	Sun	Dinner	4
14	14.83	3.02	Female	No	Sun	Dinner	2
16	10.33	1.67	Female	No	Sun	Dinner	3
...	...	...	...	...	...	...	...
226	10.09	2.00	Female	Yes	Fri	Lunch	2
229	22.12	2.88	Female	Yes	Sat	Dinner	2
238	35.83	4.67	Female	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

87 rows × 7 columns

lets break the code above into components,

loc
When we call the loc method, we are asking for a location in our data set, so the loc method expects an address if you will. An address for the rows and an address for the columns, But we can also pass in a boolan of True and False Values and it will only return all the rows which qualified as the True Values. additionally we can pass in a boolan for columns too, but if you just pass in a single boolan statment it will return all the columns and the rows with True values.
data.sex == "Female"
If you are familiar with Python you know that == is a conditional statement , it simply means if something is equal to something else, return True other wise return False.
We are asking our data to see if the sex column data.sex is equal to Female and if it is return True, and Pandas is smart enough to do this for every row in your data.
Puting it all together data.loc[data.sex == "Female"]
Now that we have a boolan as per the 2nd Point above, we can use that boolan to filter our data to only show the True values of the boolean which is the Female Customer. And thats how we can filter our data to show specific part of the data.

In [6]:

data.sex == "Female"

Out[6]:

0       True
1      False
2      False
3      False
4       True
       ...  
239    False
240     True
241    False
242    False
243     True
Name: sex, Length: 244, dtype: bool

Now that we understand the concept of the boolan , you should know we can make use of the and and or conditional statements too as refreser.

statment1 and statement2 will return True if both are True and False otherwise.
statement1 or statement2 will return True if either of the statements are True.

In [7]:

#example
1==1 and 2==2

Out[7]:

True

In [8]:

1==1 and 2==1

Out[8]:

False

In [9]:

1==1 or 2==1

Out[9]:

True

Pandas allows us to use and and or conditions in our conditionals in a loc method but the syntax is slightly different.
& represents and
| represents or
And we need to put each staement inside a bracket() for it to work, lets look at an example below

In [10]:

# I want to filter my data for female who gave tips during a dinner meal
data.loc[(data.sex=="Female") & (data.time == "Dinner")]

Out[10]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
11	35.26	5.00	Female	No	Sun	Dinner	4
14	14.83	3.02	Female	No	Sun	Dinner	2
16	10.33	1.67	Female	No	Sun	Dinner	3
...	...	...	...	...	...	...	...
219	30.14	3.09	Female	Yes	Sat	Dinner	4
229	22.12	2.88	Female	Yes	Sat	Dinner	2
238	35.83	4.67	Female	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

52 rows × 7 columns

So just to recap what's happening above we are using an and conditional with two statement:

is the sex of the customer Female?
and
is the time of the meal Dinner?

If they are both True then Return True, and filter my data based on the final True values which is females at dinner.

Let's use the same example with an or statment which should return any row with female as sex, or dinner as time.

Note The Sample Method just shuffles the data so we can see we are indeed returning either female or dinnner.

In [11]:

# I want to filter my data for female who gave tips during a dinner meal
data.loc[(data.sex=="Female") | (data.time == "Dinner")].sample(10)

Out[11]:

	total_bill	tip	sex	smoker	day	time	size
13	18.43	3.00	Male	No	Sun	Dinner	4
203	16.40	2.50	Female	Yes	Thur	Lunch	2
9	14.78	3.23	Male	No	Sun	Dinner	2
119	24.08	2.92	Female	No	Thur	Lunch	4
174	16.82	4.00	Male	Yes	Sun	Dinner	2
60	20.29	3.21	Male	Yes	Sat	Dinner	2
185	20.69	5.00	Male	No	Sun	Dinner	5
152	17.26	2.74	Male	No	Sun	Dinner	3
219	30.14	3.09	Female	Yes	Sat	Dinner	4
55	19.49	3.51	Male	No	Sun	Dinner	2

Some advance filtering for the curious ones, as I mentioned above we can use boolans to filter our data.¶

The complicity of our boolan and our statemnt is entirely up to us to decide so let's do some math in our boolan and filter our data based on that.¶

Let's say I want to see if the total_bill - tip == 11 or not. 11 is completley random, I'm just doing this to prove what can be done.¶

In [12]:

data.loc[(data.total_bill - data.tip) ==11.0]

Out[12]:

	total_bill	tip	sex	smoker	day	time	size
110	14.0	3.0	Male	No	Sat	Dinner	2
198	13.0	2.0	Female	Yes	Thur	Lunch	2
202	13.0	2.0	Female	Yes	Thur	Lunch	2

Let's do one which is more realistic, I want to filter all the tips that were more than 40% of the total_bill.¶

In [13]:

data.loc[(data.tip/ data.total_bill ) > 0.4]

Out[13]:

	total_bill	tip	sex	smoker	day	time	size
172	7.25	5.15	Male	Yes	Sun	Dinner	2
178	9.60	4.00	Female	Yes	Sun	Dinner	2

There we have it, it seems like people are in a better mood at Sunday Dinners. :)

In [14]:

data.iloc[pd.np.r_[0:5,7,9],[1]]

Out[14]:

	tip
0	1.01
1	1.66
2	3.50
3	3.31
4	3.61
7	3.12
9	3.23