How to see duplicated rows and how to remove them.

You have to think about your data and what you mean by duplicated rows.

  • Do you consider a row to be duplicated based on every value of your columns or only a selection of your columns.

we have two method to work with here.

  • duplicated() This method tells you which rows are duplicated, you can also choose which columns to look at while calculating the duplicated rows.
  • drop_duplicates() This method removes duplicated rows, it also lets you choose which columns to look at whilst discovering the duplicated rows.
    by default both duplicated and drop_duplicates() look at all the columns when discovering duplicated values.

  • if you wish to only look at subsection of your columns you should use subset parameter inside these methos.
In [1]:
import seaborn as sns
import pandas as pd
pd.options.display.max_rows = 10

Take a look at the data we are going to work with.

In [2]:
df = sns.load_dataset("planets")
df
Out[2]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300000 7.10 77.40 2006
1 Radial Velocity 1 874.774000 2.21 56.95 2008
2 Radial Velocity 1 763.000000 2.60 19.84 2011
3 Radial Velocity 1 326.030000 19.40 110.62 2007
4 Radial Velocity 1 516.220000 10.50 119.47 2009
... ... ... ... ... ... ...
1030 Transit 1 3.941507 NaN 172.00 2006
1031 Transit 1 2.615864 NaN 148.00 2007
1032 Transit 1 3.191524 NaN 174.00 2007
1033 Transit 1 4.125083 NaN 293.00 2008
1034 Transit 1 4.187757 NaN 260.00 2008

1035 rows × 6 columns

I used the method duplicated() below, which by default looks at every column and returns True if the rows is identical to any other row in the data and False if the row is unqiue.

  • the keep = False parameter tells the method to keep all the record of the duplicated rows.
  • you can also use keep = "first" which would then keep the first occurance of the duplicated rows, and discard the rest.
  • or use keep = "last" to keep the last occurance of your duplicated rows.

    The keep parameter logic is the same in both duplicated() and drop_duplicates() methods.
In [3]:
df.duplicated(keep = False)
Out[3]:
0       False
1       False
2       False
3       False
4       False
        ...  
1030    False
1031    False
1032    False
1033    False
1034    False
Length: 1035, dtype: bool

As we have seen in the previous tutorials , when we want to filter our data we would pass in a boolean inside brackets or a loc method.

  • The data will be filtered to show all the True value rows, which in this case is the duplicated rows.
  • I used the sort_value method to put the duplicated rows on top of each other so we could see them better.
In [4]:
df[df.duplicated(keep = False)].sort_values("method")
Out[4]:
method number orbital_period mass distance year
33 Imaging 1 NaN NaN NaN 2008
957 Imaging 1 NaN NaN NaN 2008
902 Microlensing 1 NaN NaN NaN 2008
903 Microlensing 1 NaN NaN NaN 2008
904 Microlensing 1 NaN NaN NaN 2009
923 Microlensing 1 NaN NaN NaN 2009
925 Microlensing 2 NaN NaN 4080.0 2012
926 Microlensing 2 NaN NaN 4080.0 2012

By default duplicated method looks at every column.

  • What if we wanted to look at only some of the columns like distance and year when discovering duplicates?
  • In this case we can make use of subset parameter which can take either a single column name or a list of column names and it will discover duplicated rows based on the given column(s).
In [5]:
df.duplicated(subset = ["distance","year"])
Out[5]:
0       False
1       False
2       False
3       False
4       False
        ...  
1030    False
1031    False
1032    False
1033    False
1034    False
Length: 1035, dtype: bool
In [6]:
df[df.duplicated(subset = ["distance","year"],keep= False)].sort_values("distance")
Out[6]:
method number orbital_period mass distance year
262 Radial Velocity 3 18.315000 0.00850 6.06 2011
264 Radial Velocity 3 90.309000 0.01510 6.06 2011
263 Radial Velocity 3 40.114000 0.00755 6.06 2011
128 Radial Velocity 4 3.149420 0.00600 6.27 2005
127 Radial Velocity 4 66.800000 0.02200 6.27 2005
... ... ... ... ... ... ...
989 Transit 1 3.161575 NaN NaN 2011
997 Transit 1 2.423804 NaN NaN 2011
998 Transit 1 3.126088 NaN NaN 2011
999 Transit 1 1.430370 NaN NaN 2011
1001 Transit 1 2.143634 NaN NaN 2011

466 rows × 6 columns

duplicated() tells us which rows are duplicated, if we want to delete these rows, we should use drop_duplicates() which behaves the same way as duplicated when discovering duplicated rows.

Note:
By default all the columns are considered, if we don't pass in any arguments to the subset parameter.

In [7]:
df.drop_duplicates()
Out[7]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300000 7.10 77.40 2006
1 Radial Velocity 1 874.774000 2.21 56.95 2008
2 Radial Velocity 1 763.000000 2.60 19.84 2011
3 Radial Velocity 1 326.030000 19.40 110.62 2007
4 Radial Velocity 1 516.220000 10.50 119.47 2009
... ... ... ... ... ... ...
1030 Transit 1 3.941507 NaN 172.00 2006
1031 Transit 1 2.615864 NaN 148.00 2007
1032 Transit 1 3.191524 NaN 174.00 2007
1033 Transit 1 4.125083 NaN 293.00 2008
1034 Transit 1 4.187757 NaN 260.00 2008

1031 rows × 6 columns

We can drop duplicates by subsection of columns such as below:

In [8]:
df.drop_duplicates(subset= ["distance","year"])
Out[8]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300000 7.10 77.40 2006
1 Radial Velocity 1 874.774000 2.21 56.95 2008
2 Radial Velocity 1 763.000000 2.60 19.84 2011
3 Radial Velocity 1 326.030000 19.40 110.62 2007
4 Radial Velocity 1 516.220000 10.50 119.47 2009
... ... ... ... ... ... ...
1030 Transit 1 3.941507 NaN 172.00 2006
1031 Transit 1 2.615864 NaN 148.00 2007
1032 Transit 1 3.191524 NaN 174.00 2007
1033 Transit 1 4.125083 NaN 293.00 2008
1034 Transit 1 4.187757 NaN 260.00 2008

679 rows × 6 columns

Note that after removing duplicated rows we ended up with 679 rows as per above.
we can prove this by doing simple one line calculation shown below.

  1. find the total number of rows in our Data using len method. ---> len(df)
  2. find the number of duplicated rows based on distance and year columns. ---> df.duplicated(subset =["distance","year"]).sum()
  3. subtract the duplicated rows from the total rows and sure enough we get 679 which is the same as above.
In [9]:
len(df) - df.duplicated(subset =["distance","year"]).sum()
Out[9]:
679

As we have seen with duplicated method , we have control over which of the duplicated row to keep and discard, via the keep parameter. to keep the first duplicated row we use keep = "first" such as below:

In [10]:
df.drop_duplicates(subset = ['number', 'orbital_period', 'mass'], keep = "first")
Out[10]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300000 7.10 77.40 2006
1 Radial Velocity 1 874.774000 2.21 56.95 2008
2 Radial Velocity 1 763.000000 2.60 19.84 2011
3 Radial Velocity 1 326.030000 19.40 110.62 2007
4 Radial Velocity 1 516.220000 10.50 119.47 2009
... ... ... ... ... ... ...
1030 Transit 1 3.941507 NaN 172.00 2006
1031 Transit 1 2.615864 NaN 148.00 2007
1032 Transit 1 3.191524 NaN 174.00 2007
1033 Transit 1 4.125083 NaN 293.00 2008
1034 Transit 1 4.187757 NaN 260.00 2008

995 rows × 6 columns

And to keep the last duplicated record(row) we use keep = "last".

In [11]:
df.drop_duplicates(subset = ['number', 'orbital_period', 'mass'], keep = "last")
Out[11]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300000 7.10 77.40 2006
1 Radial Velocity 1 874.774000 2.21 56.95 2008
2 Radial Velocity 1 763.000000 2.60 19.84 2011
3 Radial Velocity 1 326.030000 19.40 110.62 2007
4 Radial Velocity 1 516.220000 10.50 119.47 2009
... ... ... ... ... ... ...
1030 Transit 1 3.941507 NaN 172.00 2006
1031 Transit 1 2.615864 NaN 148.00 2007
1032 Transit 1 3.191524 NaN 174.00 2007
1033 Transit 1 4.125083 NaN 293.00 2008
1034 Transit 1 4.187757 NaN 260.00 2008

995 rows × 6 columns