Missing values

Removing Missing Values and filling Missing values

Let's import pandas and seaborn, seaborn is used to load sample data for this example.
Take a look at the data below, we can already see some "NaN" values in the deck column so we know we have missing values.

However we do need a more robust way of identifying missing values.

In [1]:
import seaborn as sns
import pandas as pd
pd.options.display.max_rows = 15
%matplotlib inline

df = sns.load_dataset("titanic")
df.head()
Out[1]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True

We can either useisna() method or isnull() method, they both do the same thing, however I preffer the isna() method, since it's more inline with the dropna() method which we will look at down the line.
but it all comes down to your preferences, just know you have these two method to identify missing or null values.

isna()

  • when you call isna to a DataFrame it will go over all your columns and return True for the cell which has a missing value.

  • Of course this boolean is used so you can filter your data and see the missing values.

  • But if you wanted some quic number of missing values for each of your columns, you can chain the sum method to your isna() and that will return the total number of missing values in each column.
In [2]:
df.isna()
Out[2]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 False False False False False False False False False False False True False False False
1 False False False False False False False False False False False False False False False
2 False False False False False False False False False False False True False False False
3 False False False False False False False False False False False False False False False
4 False False False False False False False False False False False True False False False
5 False False False True False False False False False False False True False False False
6 False False False False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
884 False False False False False False False False False False False True False False False
885 False False False False False False False False False False False True False False False
886 False False False False False False False False False False False True False False False
887 False False False False False False False False False False False False False False False
888 False False False True False False False False False False False True False False False
889 False False False False False False False False False False False False False False False
890 False False False False False False False False False False False True False False False

891 rows × 15 columns

The sum method for isna can be very handy, I now know, the age column has 177 rows of missing values and the deck column has 688 rows of missing values etc...

In [3]:
df.isna().sum()
Out[3]:
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

Just to prove a point both isnull and isna do the same thing.

In [4]:
df.isnull()
Out[4]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 False False False False False False False False False False False True False False False
1 False False False False False False False False False False False False False False False
2 False False False False False False False False False False False True False False False
3 False False False False False False False False False False False False False False False
4 False False False False False False False False False False False True False False False
5 False False False True False False False False False False False True False False False
6 False False False False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
884 False False False False False False False False False False False True False False False
885 False False False False False False False False False False False True False False False
886 False False False False False False False False False False False True False False False
887 False False False False False False False False False False False False False False False
888 False False False True False False False False False False False True False False False
889 False False False False False False False False False False False False False False False
890 False False False False False False False False False False False True False False False

891 rows × 15 columns

and this one liner is a solid proof for this.

In [5]:
(df.isna() == df.isnull()).all().all()
Out[5]:
True
Great, now we know we have missing values, what should we do with them?
  1. We can either remove them but keep in mind we might lose valuable information.
  2. We can fill them with some value, this can be any value but it's worth thinking about this value and how it could impact your final data, using some of the more common methods of filling missing values, such as the mean of the column or median could be a good idea but it all depends on the probel you are trying to solve.

Let's take a look at filling missing values...

  • In the example below I first accessed my age column using the dot notation.
  • I then used the fillna() method, and this method will fill the missing values with whatever value you pass into it.
  • I used 0 in this example to fill the missing values in the age column with 0.
In [6]:
df.age.fillna(0)
Out[6]:
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5       0.0
6      54.0
       ... 
884    25.0
885    39.0
886    27.0
887    19.0
888     0.0
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64


Note:
By default this does not over write the age column.
To overwrite the age column you can do the one of the followings:

  • df.age.fillna(0 , inplace=True)
    or have:
  • df.age = df.age.fillna(0)

    I did briefly mention that we could use a statistical integer to fill the missing values, this is specially usefull in Machine Learning Models where you could benefit from these missing values, rather than removing them or filling them with random values.

    In the exmaple below I filled the missing values of age with the mean of age column
In [7]:
df.age.fillna(df.age.mean())
Out[7]:
0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
5      29.699118
6      54.000000
         ...    
884    25.000000
885    39.000000
886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: age, Length: 891, dtype: float64

Column Data Types and fillna() method.

It's worth remembering that when you do use fillna() you can pretty much pass in any value or symbol and it will replace the missing values with the value you passed in, ***except for Categorical columns***.

However you should be carefull because:

  • Let's say your column was of type Integer and you passed in a value of String this will convert that column into an object type which is used for String Columns, thus every value in that column will become of type object.

Categorical types and fillna():

Take a look at the example below, I want to fill the column deck's missing values with the value missing

In [8]:
df.deck.fillna("missing")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-42e213f52202> in <module>()
----> 1 df.deck.fillna("missing")

~\Anaconda3\lib\site-packages\pandas\core\series.py in fillna(self, value, method, axis, inplace, limit, downcast, **kwargs)
   2687                                           axis=axis, inplace=inplace,
   2688                                           limit=limit, downcast=downcast,
-> 2689                                           **kwargs)
   2690 
   2691     @Appender(generic._shared_docs['shift'] % _shared_doc_kwargs)

~\Anaconda3\lib\site-packages\pandas\core\generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
   4327                 new_data = self._data.fillna(value=value, limit=limit,
   4328                                              inplace=inplace,
-> 4329                                              downcast=downcast)
   4330 
   4331             elif isinstance(value, (dict, ABCSeries)):

~\Anaconda3\lib\site-packages\pandas\core\internals.py in fillna(self, **kwargs)
   3454 
   3455     def fillna(self, **kwargs):
-> 3456         return self.apply('fillna', **kwargs)
   3457 
   3458     def downcast(self, **kwargs):

~\Anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3327 
   3328             kwargs['mgr'] = self
-> 3329             applied = getattr(b, f)(**kwargs)
   3330             result_blocks = _extend_blocks(applied, result_blocks)
   3331 

~\Anaconda3\lib\site-packages\pandas\core\internals.py in fillna(self, value, limit, inplace, downcast, mgr)
   2379         values = self.values if inplace else self.values.copy()
   2380         values = self._try_coerce_result(values.fillna(value=value,
-> 2381                                                        limit=limit))
   2382         return [self.make_block(values=values)]
   2383 

~\Anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg

~\Anaconda3\lib\site-packages\pandas\core\categorical.py in fillna(self, value, method, limit)
   1680 
   1681             if not isna(value) and value not in self.categories:
-> 1682                 raise ValueError("fill value must be in categories")
   1683 
   1684             mask = values == -1

ValueError: fill value must be in categories

The error i'm getting is fill value must be in categories.
This means because we have a categorical column, we can't pass in any value we want,
the replacement of the missing value must be a category so we need to add it as a category fist.

  • below I used the add_categories to add "missing" as a category.
  • Now that missing is a category observed by the deck column I can replace the missing values with missing.
In [9]:
df.deck = df.deck.cat.add_categories("missing")
df.deck.fillna("missing")
Out[9]:
0      missing
1            C
2      missing
3            C
4      missing
5      missing
6            E
        ...   
884    missing
885    missing
886    missing
887          B
888    missing
889          C
890    missing
Name: deck, Length: 891, dtype: category
Categories (8, object): [A, B, C, D, E, F, G, missing]

Data Types and missing values example:

  • Take a look at the age column Data Type.
  • It's a "float64" type. which is a decimal number.
In [10]:
df.age.dtype
Out[10]:
dtype('float64')
  • I am now going to fill the missing values of age column with a Text which says filling_missing.
  • I want to overwrite the age column so I use inplace = True.
In [11]:
df.age.fillna("filling_missing",inplace = True)
  • If I look at the Column type for age now, what do you think is going to show?
In [12]:
df.age.dtype
Out[12]:
dtype('O')
  • it is of type 'O' which stands for Object, so it's always good to think about what your filling your missing values with before facing this problem.

    #### You should:
  • fill the missing values of a float column with a float.
  • fill the missing values of a text column with text.
  • fill the missing values of a categorical column with a category which we showed above.

Deleting Missing Values

To delete missing values, we will make use of dropna() method.

dropna() Parameters:

  • axis: 0 or 'index, 1 or 'columns'
    when you have axis = 0 or axis = 'index' dropna() will remove the rows that contains Missing Values.
    When you have axis = 1 or axis = 'columns' dropna() will remove Columns that contains Missing Values.

  • how : 'any' or 'all'
    When you have how = "any" it will drop the corresponding row/column if there is any missing value in the row/column.
    When you have how = "all" it will drop the corresponding row/column if every value is a missing value in the row/column.

  • subset:
    by default dropna() will look at all the columns/rows when discovering Missing values, you can use subset to pass in a column name or a list of column names, and it will only focus on those columns or rows.

  • thresh: a minimum threshold of missing values. for example let's say I had 3 rows of data and they all contained missing values, but I wanted to only select the row that had more than 10 missing value
    I could then use thresh = 10

  • inplace: True or False, by default it's set to False, if you set this to True it will overwrite the Column with the updated results.

Examples of dropna()

if you just call dropna it will use the default parameters, which is how = "any" and axis = 0 or 'index'

In [13]:
df.dropna()
Out[13]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
1 1 1 female 38 1 0 71.2833 C First woman False C Cherbourg yes False
3 1 1 female 35 1 0 53.1000 S First woman False C Southampton yes False
6 0 1 male 54 0 0 51.8625 S First man True E Southampton no True
10 1 3 female 4 1 1 16.7000 S Third child False G Southampton yes False
11 1 1 female 58 0 0 26.5500 S First woman False C Southampton yes True
21 1 2 male 34 0 0 13.0000 S Second man True D Southampton yes True
23 1 1 male 28 0 0 35.5000 S First man True A Southampton yes True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
862 1 1 female 48 0 0 25.9292 S First woman False D Southampton yes True
867 0 1 male 31 0 0 50.4958 S First man True A Southampton no True
871 1 1 female 47 1 1 52.5542 S First woman False D Southampton yes False
872 0 1 male 33 0 0 5.0000 S First man True B Southampton no True
879 1 1 female 56 0 1 83.1583 C First woman False C Cherbourg yes False
887 1 1 female 19 0 0 30.0000 S First woman False B Southampton yes True
889 1 1 male 26 0 0 30.0000 C First man True C Cherbourg yes True

201 rows × 15 columns

dropna(1) will drop the columns that contain any missing value.

In [14]:
df.dropna(1)
Out[14]:
survived pclass sex age sibsp parch fare class who adult_male alive alone
0 0 3 male 22 1 0 7.2500 Third man True no False
1 1 1 female 38 1 0 71.2833 First woman False yes False
2 1 3 female 26 0 0 7.9250 Third woman False yes True
3 1 1 female 35 1 0 53.1000 First woman False yes False
4 0 3 male 35 0 0 8.0500 Third man True no True
5 0 3 male filling_missing 0 0 8.4583 Third man True no True
6 0 1 male 54 0 0 51.8625 First man True no True
... ... ... ... ... ... ... ... ... ... ... ... ...
884 0 3 male 25 0 0 7.0500 Third man True no True
885 0 3 female 39 0 5 29.1250 Third woman False no False
886 0 2 male 27 0 0 13.0000 Second man True no True
887 1 1 female 19 0 0 30.0000 First woman False yes True
888 0 3 female filling_missing 1 2 23.4500 Third woman False no False
889 1 1 male 26 0 0 30.0000 First man True yes True
890 0 3 male 32 0 0 7.7500 Third man True no True

891 rows × 12 columns

dropna(how = "all") will drop the rows which have all their records as null or missing value.

In [15]:
df.dropna(how = "all")
Out[15]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35 0 0 8.0500 S Third man True NaN Southampton no True
5 0 3 male filling_missing 0 0 8.4583 Q Third man True NaN Queenstown no True
6 0 1 male 54 0 0 51.8625 S First man True E Southampton no True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
884 0 3 male 25 0 0 7.0500 S Third man True NaN Southampton no True
885 0 3 female 39 0 5 29.1250 Q Third woman False NaN Queenstown no False
886 0 2 male 27 0 0 13.0000 S Second man True NaN Southampton no True
887 1 1 female 19 0 0 30.0000 S First woman False B Southampton yes True
888 0 3 female filling_missing 1 2 23.4500 S Third woman False NaN Southampton no False
889 1 1 male 26 0 0 30.0000 C First man True C Cherbourg yes True
890 0 3 male 32 0 0 7.7500 Q Third man True NaN Queenstown no True

891 rows × 15 columns

you can use subset to pass in which column(s) you wish to look at when discovering missing values. wheather you are using one column or number of columns you should always pass it as a list.

In [16]:
df.dropna(subset = ["age"])
Out[16]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35 0 0 8.0500 S Third man True NaN Southampton no True
5 0 3 male filling_missing 0 0 8.4583 Q Third man True NaN Queenstown no True
6 0 1 male 54 0 0 51.8625 S First man True E Southampton no True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
884 0 3 male 25 0 0 7.0500 S Third man True NaN Southampton no True
885 0 3 female 39 0 5 29.1250 Q Third woman False NaN Queenstown no False
886 0 2 male 27 0 0 13.0000 S Second man True NaN Southampton no True
887 1 1 female 19 0 0 30.0000 S First woman False B Southampton yes True
888 0 3 female filling_missing 1 2 23.4500 S Third woman False NaN Southampton no False
889 1 1 male 26 0 0 30.0000 C First man True C Cherbourg yes True
890 0 3 male 32 0 0 7.7500 Q Third man True NaN Queenstown no True

891 rows × 15 columns

In [17]:
df.dropna(subset = ["age","deck"], how = "all")
Out[17]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35 0 0 8.0500 S Third man True NaN Southampton no True
5 0 3 male filling_missing 0 0 8.4583 Q Third man True NaN Queenstown no True
6 0 1 male 54 0 0 51.8625 S First man True E Southampton no True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
884 0 3 male 25 0 0 7.0500 S Third man True NaN Southampton no True
885 0 3 female 39 0 5 29.1250 Q Third woman False NaN Queenstown no False
886 0 2 male 27 0 0 13.0000 S Second man True NaN Southampton no True
887 1 1 female 19 0 0 30.0000 S First woman False B Southampton yes True
888 0 3 female filling_missing 1 2 23.4500 S Third woman False NaN Southampton no False
889 1 1 male 26 0 0 30.0000 C First man True C Cherbourg yes True
890 0 3 male 32 0 0 7.7500 Q Third man True NaN Queenstown no True

891 rows × 15 columns

Not Null Values.

We learnt how to discover missing values using isna() and isnull() methods, both of these return a boolean.
If you are familiar with Python, you know that it's very easy to reverse a boolan value to show the opposite. you would simply use not key word, however since we are dealing with columns operation we can't use not isna().

in pandas to reverse a boolean or a method which produces a boolean that we wish to use as a filter we can use ~ as a prefix.

Example:

In [19]:
df[~df.age.isna()] # or df.loc[~df.age.isna()]]
Out[19]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35 0 0 8.0500 S Third man True NaN Southampton no True
5 0 3 male filling_missing 0 0 8.4583 Q Third man True NaN Queenstown no True
6 0 1 male 54 0 0 51.8625 S First man True E Southampton no True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
884 0 3 male 25 0 0 7.0500 S Third man True NaN Southampton no True
885 0 3 female 39 0 5 29.1250 Q Third woman False NaN Queenstown no False
886 0 2 male 27 0 0 13.0000 S Second man True NaN Southampton no True
887 1 1 female 19 0 0 30.0000 S First woman False B Southampton yes True
888 0 3 female filling_missing 1 2 23.4500 S Third woman False NaN Southampton no False
889 1 1 male 26 0 0 30.0000 C First man True C Cherbourg yes True
890 0 3 male 32 0 0 7.7500 Q Third man True NaN Queenstown no True

891 rows × 15 columns

The function above will return all the non null values

But if that's too much work for you Pandas does provide notnull() and notna() methods.

In [20]:
df.age.notna()
Out[20]:
0      True
1      True
2      True
3      True
4      True
5      True
6      True
       ... 
884    True
885    True
886    True
887    True
888    True
889    True
890    True
Name: age, Length: 891, dtype: bool
In [21]:
df[df.age.notna()]
Out[21]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35 0 0 8.0500 S Third man True NaN Southampton no True
5 0 3 male filling_missing 0 0 8.4583 Q Third man True NaN Queenstown no True
6 0 1 male 54 0 0 51.8625 S First man True E Southampton no True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
884 0 3 male 25 0 0 7.0500 S Third man True NaN Southampton no True
885 0 3 female 39 0 5 29.1250 Q Third woman False NaN Queenstown no False
886 0 2 male 27 0 0 13.0000 S Second man True NaN Southampton no True
887 1 1 female 19 0 0 30.0000 S First woman False B Southampton yes True
888 0 3 female filling_missing 1 2 23.4500 S Third woman False NaN Southampton no False
889 1 1 male 26 0 0 30.0000 C First man True C Cherbourg yes True
890 0 3 male 32 0 0 7.7500 Q Third man True NaN Queenstown no True

891 rows × 15 columns