Let's import pandas and seaborn, seaborn is used to load sample data for this example. Take a look at the data below, we can already see some "NaN" values in the deck column so we know we have missing values. However we do need a more robust way of identifying missing values.
import seaborn as sns import pandas as pd pd.options.display.max_rows = 15 %matplotlib inline df = sns.load_dataset("titanic") df.head()
We can either useisna() method or isnull() method, they both do the same thing, however I preffer the isna() method, since it's more inline with the dropna() method which we will look at down the line. but it all comes down to your preferences, just know you have these two method to identify missing or null values.
when you call isna to a DataFrame it will go over all your columns and return True for the cell which has a missing value.
Of course this boolean is used so you can filter your data and see the missing values.
df.isna()
891 rows × 15 columns
df.isna().sum()
survived 0 pclass 0 sex 0 age 177 sibsp 0 parch 0 fare 0 embarked 2 class 0 who 0 adult_male 0 deck 688 embark_town 2 alive 0 alone 0 dtype: int64
Just to prove a point both isnull and isna do the same thing.
df.isnull()
and this one liner is a solid proof for this.
(df.isna() == df.isnull()).all().all()
True
Let's take a look at filling missing values...
df.age.fillna(0)
0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 5 0.0 6 54.0 ... 884 25.0 885 39.0 886 27.0 887 19.0 888 0.0 889 26.0 890 32.0 Name: age, Length: 891, dtype: float64
Note: By default this does not over write the age column. To overwrite the age column you can do the one of the followings:
df.age.fillna(df.age.mean())
0 22.000000 1 38.000000 2 26.000000 3 35.000000 4 35.000000 5 29.699118 6 54.000000 ... 884 25.000000 885 39.000000 886 27.000000 887 19.000000 888 29.699118 889 26.000000 890 32.000000 Name: age, Length: 891, dtype: float64
Take a look at the example below, I want to fill the column deck's missing values with the value missing
df.deck.fillna("missing")
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-8-42e213f52202> in <module>() ----> 1 df.deck.fillna("missing") ~\Anaconda3\lib\site-packages\pandas\core\series.py in fillna(self, value, method, axis, inplace, limit, downcast, **kwargs) 2687 axis=axis, inplace=inplace, 2688 limit=limit, downcast=downcast, -> 2689 **kwargs) 2690 2691 @Appender(generic._shared_docs['shift'] % _shared_doc_kwargs) ~\Anaconda3\lib\site-packages\pandas\core\generic.py in fillna(self, value, method, axis, inplace, limit, downcast) 4327 new_data = self._data.fillna(value=value, limit=limit, 4328 inplace=inplace, -> 4329 downcast=downcast) 4330 4331 elif isinstance(value, (dict, ABCSeries)): ~\Anaconda3\lib\site-packages\pandas\core\internals.py in fillna(self, **kwargs) 3454 3455 def fillna(self, **kwargs): -> 3456 return self.apply('fillna', **kwargs) 3457 3458 def downcast(self, **kwargs): ~\Anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs) 3327 3328 kwargs['mgr'] = self -> 3329 applied = getattr(b, f)(**kwargs) 3330 result_blocks = _extend_blocks(applied, result_blocks) 3331 ~\Anaconda3\lib\site-packages\pandas\core\internals.py in fillna(self, value, limit, inplace, downcast, mgr) 2379 values = self.values if inplace else self.values.copy() 2380 values = self._try_coerce_result(values.fillna(value=value, -> 2381 limit=limit)) 2382 return [self.make_block(values=values)] 2383 ~\Anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs) 116 else: 117 kwargs[new_arg_name] = new_arg_value --> 118 return func(*args, **kwargs) 119 return wrapper 120 return _deprecate_kwarg ~\Anaconda3\lib\site-packages\pandas\core\categorical.py in fillna(self, value, method, limit) 1680 1681 if not isna(value) and value not in self.categories: -> 1682 raise ValueError("fill value must be in categories") 1683 1684 mask = values == -1 ValueError: fill value must be in categories
The error i'm getting is fill value must be in categories. This means because we have a categorical column, we can't pass in any value we want, the replacement of the missing value must be a category so we need to add it as a category fist.
df.deck = df.deck.cat.add_categories("missing") df.deck.fillna("missing")
0 missing 1 C 2 missing 3 C 4 missing 5 missing 6 E ... 884 missing 885 missing 886 missing 887 B 888 missing 889 C 890 missing Name: deck, Length: 891, dtype: category Categories (8, object): [A, B, C, D, E, F, G, missing]
df.age.dtype
dtype('float64')
df.age.fillna("filling_missing",inplace = True)
dtype('O')
To delete missing values, we will make use of dropna() method.
dropna() Parameters:
if you just call dropna it will use the default parameters, which is how = "any" and axis = 0 or 'index'
df.dropna()
201 rows × 15 columns
dropna(1) will drop the columns that contain any missing value.
df.dropna(1)
891 rows × 12 columns
dropna(how = "all") will drop the rows which have all their records as null or missing value.
df.dropna(how = "all")
you can use subset to pass in which column(s) you wish to look at when discovering missing values. wheather you are using one column or number of columns you should always pass it as a list.
df.dropna(subset = ["age"])
df.dropna(subset = ["age","deck"], how = "all")
We learnt how to discover missing values using isna() and isnull() methods, both of these return a boolean. If you are familiar with Python, you know that it's very easy to reverse a boolan value to show the opposite. you would simply use not key word, however since we are dealing with columns operation we can't use not isna(). in pandas to reverse a boolean or a method which produces a boolean that we wish to use as a filter we can use ~ as a prefix.
Example:
df[~df.age.isna()] # or df.loc[~df.age.isna()]]
The function above will return all the non null values
df.age.notna()
0 True 1 True 2 True 3 True 4 True 5 True 6 True ... 884 True 885 True 886 True 887 True 888 True 889 True 890 True Name: age, Length: 891, dtype: bool
df[df.age.notna()]