You have to think about your data and what you mean by duplicated rows.
import seaborn as sns import pandas as pd pd.options.display.max_rows = 10
Take a look at the data we are going to work with.
df = sns.load_dataset("planets") df
1035 rows × 6 columns
I used the method duplicated() below, which by default looks at every column and returns True if the rows is identical to any other row in the data and False if the row is unqiue.
df.duplicated(keep = False)
0 False 1 False 2 False 3 False 4 False ... 1030 False 1031 False 1032 False 1033 False 1034 False Length: 1035, dtype: bool
As we have seen in the previous tutorials , when we want to filter our data we would pass in a boolean inside brackets or a loc method.
df[df.duplicated(keep = False)].sort_values("method")
By default duplicated method looks at every column.
df.duplicated(subset = ["distance","year"])
df[df.duplicated(subset = ["distance","year"],keep= False)].sort_values("distance")
466 rows × 6 columns
duplicated() tells us which rows are duplicated, if we want to delete these rows, we should use drop_duplicates() which behaves the same way as duplicated when discovering duplicated rows.
Note: By default all the columns are considered, if we don't pass in any arguments to the subset parameter.
df.drop_duplicates()
1031 rows × 6 columns
We can drop duplicates by subsection of columns such as below:
df.drop_duplicates(subset= ["distance","year"])
679 rows × 6 columns
Note that after removing duplicated rows we ended up with 679 rows as per above. we can prove this by doing simple one line calculation shown below.
len(df) - df.duplicated(subset =["distance","year"]).sum()
679
As we have seen with duplicated method , we have control over which of the duplicated row to keep and discard, via the keep parameter. to keep the first duplicated row we use keep = "first" such as below:
df.drop_duplicates(subset = ['number', 'orbital_period', 'mass'], keep = "first")
995 rows × 6 columns
And to keep the last duplicated record(row) we use keep = "last".
df.drop_duplicates(subset = ['number', 'orbital_period', 'mass'], keep = "last")