In [1]:
import seaborn as sns
data = sns.load_dataset('tips')

Filtering Data based on conditions in Pandas.

The ease of filtering in pandas is why I love pandas so much, we are going to take a look at single and multiple conditions for our filters.

We will use the tip data sets to show how this is done. This should be fun , let's begin.
In [3]:
import pandas as pd
%matplotlib inline
pd.options.display.max_rows = 10
In [4]:
data
Out[4]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

In Pandas we have two main choices of methods for filtering the data.

1.loc which stands for location expects the location of the labels of your rows and/or columns

2.iloc which stands for index location expects the numeric index of your rows and/or columns

The general Syntax for both are

data.loc[ : , : ] where you have the location of your rows separated by a comma and followed by the location of your columns. if you do not pass in any specific location to rows or/and columns it will just return everything.

To be extra clear here data.loc[rowslices, colslices] and either one of the slices can take in a boolan too which we will cover in more depth with examples below.

data.iloc has a similar behavior but it can't take a boolan to filter your data and accepts only numeric index of your rows and/or columns

data.iloc[ 0:5 , 0:2 ] will return the first 5 rows and the first 2 columns from your data.

I want to filter my data to show me only the tips from the Female customers.
As you can see below we done this in a single line of code, but let's try to understand what exactly is happening

In [5]:
data.loc[data.sex == "Female"]
Out[5]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
11 35.26 5.00 Female No Sun Dinner 4
14 14.83 3.02 Female No Sun Dinner 2
16 10.33 1.67 Female No Sun Dinner 3
... ... ... ... ... ... ... ...
226 10.09 2.00 Female Yes Fri Lunch 2
229 22.12 2.88 Female Yes Sat Dinner 2
238 35.83 4.67 Female No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

87 rows × 7 columns

lets break the code above into components,

  1. loc
    When we call the loc method, we are asking for a location in our data set, so the loc method expects an address if you will. An address for the rows and an address for the columns, But we can also pass in a boolan of True and False Values and it will only return all the rows which qualified as the True Values. additionally we can pass in a boolan for columns too, but if you just pass in a single boolan statment it will return all the columns and the rows with True values.

  2. data.sex == "Female"
    If you are familiar with Python you know that == is a conditional statement , it simply means if something is equal to something else, return True other wise return False.
    We are asking our data to see if the sex column data.sex is equal to Female and if it is return True, and Pandas is smart enough to do this for every row in your data.

  3. Puting it all together data.loc[data.sex == "Female"]
    Now that we have a boolan as per the 2nd Point above, we can use that boolan to filter our data to only show the True values of the boolean which is the Female Customer. And thats how we can filter our data to show specific part of the data.

In [6]:
data.sex == "Female"
Out[6]:
0       True
1      False
2      False
3      False
4       True
       ...  
239    False
240     True
241    False
242    False
243     True
Name: sex, Length: 244, dtype: bool

Now that we understand the concept of the boolan , you should know we can make use of the and and or conditional statements too as refreser.

statment1 and statement2 will return True if both are True and False otherwise.
statement1 or statement2 will return True if either of the statements are True.

In [7]:
#example
1==1 and 2==2
Out[7]:
True
In [8]:
1==1 and 2==1
Out[8]:
False
In [9]:
1==1 or 2==1
Out[9]:
True

Pandas allows us to use and and or conditions in our conditionals in a loc method but the syntax is slightly different.
& represents and
| represents or
And we need to put each staement inside a bracket() for it to work, lets look at an example below

In [10]:
# I want to filter my data for female who gave tips during a dinner meal
data.loc[(data.sex=="Female") & (data.time == "Dinner")]
Out[10]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
11 35.26 5.00 Female No Sun Dinner 4
14 14.83 3.02 Female No Sun Dinner 2
16 10.33 1.67 Female No Sun Dinner 3
... ... ... ... ... ... ... ...
219 30.14 3.09 Female Yes Sat Dinner 4
229 22.12 2.88 Female Yes Sat Dinner 2
238 35.83 4.67 Female No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

52 rows × 7 columns

So just to recap what's happening above we are using an and conditional with two statement:

  1. is the sex of the customer Female?
    and
  2. is the time of the meal Dinner?

If they are both True then Return True, and filter my data based on the final True values which is females at dinner.

Let's use the same example with an or statment which should return any row with female as sex, or dinner as time.

Note The Sample Method just shuffles the data so we can see we are indeed returning either female or dinnner.

In [11]:
# I want to filter my data for female who gave tips during a dinner meal
data.loc[(data.sex=="Female") | (data.time == "Dinner")].sample(10)
Out[11]:
total_bill tip sex smoker day time size
13 18.43 3.00 Male No Sun Dinner 4
203 16.40 2.50 Female Yes Thur Lunch 2
9 14.78 3.23 Male No Sun Dinner 2
119 24.08 2.92 Female No Thur Lunch 4
174 16.82 4.00 Male Yes Sun Dinner 2
60 20.29 3.21 Male Yes Sat Dinner 2
185 20.69 5.00 Male No Sun Dinner 5
152 17.26 2.74 Male No Sun Dinner 3
219 30.14 3.09 Female Yes Sat Dinner 4
55 19.49 3.51 Male No Sun Dinner 2

Some advance filtering for the curious ones, as I mentioned above we can use boolans to filter our data.

The complicity of our boolan and our statemnt is entirely up to us to decide so let's do some math in our boolan and filter our data based on that.
Let's say I want to see if the total_bill - tip == 11 or not. 11 is completley random, I'm just doing this to prove what can be done.
In [12]:
data.loc[(data.total_bill - data.tip) ==11.0]
Out[12]:
total_bill tip sex smoker day time size
110 14.0 3.0 Male No Sat Dinner 2
198 13.0 2.0 Female Yes Thur Lunch 2
202 13.0 2.0 Female Yes Thur Lunch 2

Let's do one which is more realistic, I want to filter all the tips that were more than 40% of the total_bill.

In [13]:
data.loc[(data.tip/ data.total_bill ) > 0.4]
Out[13]:
total_bill tip sex smoker day time size
172 7.25 5.15 Male Yes Sun Dinner 2
178 9.60 4.00 Female Yes Sun Dinner 2

There we have it, it seems like people are in a better mood at Sunday Dinners. :)

In [14]:
data.iloc[pd.np.r_[0:5,7,9],[1]]
Out[14]:
tip
0 1.01
1 1.66
2 3.50
3 3.31
4 3.61
7 3.12
9 3.23