How to see duplicated rows and how to remove them.¶

You have to think about your data and what you mean by duplicated rows.

Do you consider a row to be duplicated based on every value of your columns or only a selection of your columns.

we have two method to work with here.¶

duplicated() This method tells you which rows are duplicated, you can also choose which columns to look at while calculating the duplicated rows.
drop_duplicates() This method removes duplicated rows, it also lets you choose which columns to look at whilst discovering the duplicated rows.
by default both duplicated and drop_duplicates() look at all the columns when discovering duplicated values.
if you wish to only look at subsection of your columns you should use subset parameter inside these methos.

In [1]:

import seaborn as sns
import pandas as pd
pd.options.display.max_rows = 10

Take a look at the data we are going to work with.

In [2]:

df = sns.load_dataset("planets")
df

Out[2]:

	method	number	orbital_period	mass	distance	year
0	Radial Velocity	1	269.300000	7.10	77.40	2006
1	Radial Velocity	1	874.774000	2.21	56.95	2008
2	Radial Velocity	1	763.000000	2.60	19.84	2011
3	Radial Velocity	1	326.030000	19.40	110.62	2007
4	Radial Velocity	1	516.220000	10.50	119.47	2009
...	...	...	...	...	...	...
1030	Transit	1	3.941507	NaN	172.00	2006
1031	Transit	1	2.615864	NaN	148.00	2007
1032	Transit	1	3.191524	NaN	174.00	2007
1033	Transit	1	4.125083	NaN	293.00	2008
1034	Transit	1	4.187757	NaN	260.00	2008

1035 rows × 6 columns

I used the method duplicated() below, which by default looks at every column and returns True if the rows is identical to any other row in the data and False if the row is unqiue.

the keep = False parameter tells the method to keep all the record of the duplicated rows.
you can also use keep = "first" which would then keep the first occurance of the duplicated rows, and discard the rest.
or use keep = "last" to keep the last occurance of your duplicated rows.

The keep parameter logic is the same in both duplicated() and drop_duplicates() methods.

In [3]:

df.duplicated(keep = False)

Out[3]:

0       False
1       False
2       False
3       False
4       False
        ...  
1030    False
1031    False
1032    False
1033    False
1034    False
Length: 1035, dtype: bool

As we have seen in the previous tutorials , when we want to filter our data we would pass in a boolean inside brackets or a loc method.

The data will be filtered to show all the True value rows, which in this case is the duplicated rows.
I used the sort_value method to put the duplicated rows on top of each other so we could see them better.

In [4]:

df[df.duplicated(keep = False)].sort_values("method")

Out[4]:

	method	number	orbital_period	mass	distance	year
33	Imaging	1	NaN	NaN	NaN	2008
957	Imaging	1	NaN	NaN	NaN	2008
902	Microlensing	1	NaN	NaN	NaN	2008
903	Microlensing	1	NaN	NaN	NaN	2008
904	Microlensing	1	NaN	NaN	NaN	2009
923	Microlensing	1	NaN	NaN	NaN	2009
925	Microlensing	2	NaN	NaN	4080.0	2012
926	Microlensing	2	NaN	NaN	4080.0	2012

By default duplicated method looks at every column.

What if we wanted to look at only some of the columns like distance and year when discovering duplicates?
In this case we can make use of subset parameter which can take either a single column name or a list of column names and it will discover duplicated rows based on the given column(s).

In [5]:

df.duplicated(subset = ["distance","year"])

Out[5]:

0       False
1       False
2       False
3       False
4       False
        ...  
1030    False
1031    False
1032    False
1033    False
1034    False
Length: 1035, dtype: bool

In [6]:

df[df.duplicated(subset = ["distance","year"],keep= False)].sort_values("distance")

Out[6]:

	method	number	orbital_period	mass	distance	year
262	Radial Velocity	3	18.315000	0.00850	6.06	2011
264	Radial Velocity	3	90.309000	0.01510	6.06	2011
263	Radial Velocity	3	40.114000	0.00755	6.06	2011
128	Radial Velocity	4	3.149420	0.00600	6.27	2005
127	Radial Velocity	4	66.800000	0.02200	6.27	2005
...	...	...	...	...	...	...
989	Transit	1	3.161575	NaN	NaN	2011
997	Transit	1	2.423804	NaN	NaN	2011
998	Transit	1	3.126088	NaN	NaN	2011
999	Transit	1	1.430370	NaN	NaN	2011
1001	Transit	1	2.143634	NaN	NaN	2011

466 rows × 6 columns

duplicated() tells us which rows are duplicated, if we want to delete these rows, we should use drop_duplicates() which behaves the same way as duplicated when discovering duplicated rows.

Note:
By default all the columns are considered, if we don't pass in any arguments to the subset parameter.

In [7]:

df.drop_duplicates()

Out[7]:

	method	number	orbital_period	mass	distance	year
0	Radial Velocity	1	269.300000	7.10	77.40	2006
1	Radial Velocity	1	874.774000	2.21	56.95	2008
2	Radial Velocity	1	763.000000	2.60	19.84	2011
3	Radial Velocity	1	326.030000	19.40	110.62	2007
4	Radial Velocity	1	516.220000	10.50	119.47	2009
...	...	...	...	...	...	...
1030	Transit	1	3.941507	NaN	172.00	2006
1031	Transit	1	2.615864	NaN	148.00	2007
1032	Transit	1	3.191524	NaN	174.00	2007
1033	Transit	1	4.125083	NaN	293.00	2008
1034	Transit	1	4.187757	NaN	260.00	2008

1031 rows × 6 columns

We can drop duplicates by subsection of columns such as below:

In [8]:

df.drop_duplicates(subset= ["distance","year"])

Out[8]:

	method	number	orbital_period	mass	distance	year
0	Radial Velocity	1	269.300000	7.10	77.40	2006
1	Radial Velocity	1	874.774000	2.21	56.95	2008
2	Radial Velocity	1	763.000000	2.60	19.84	2011
3	Radial Velocity	1	326.030000	19.40	110.62	2007
4	Radial Velocity	1	516.220000	10.50	119.47	2009
...	...	...	...	...	...	...
1030	Transit	1	3.941507	NaN	172.00	2006
1031	Transit	1	2.615864	NaN	148.00	2007
1032	Transit	1	3.191524	NaN	174.00	2007
1033	Transit	1	4.125083	NaN	293.00	2008
1034	Transit	1	4.187757	NaN	260.00	2008

679 rows × 6 columns

Note that after removing duplicated rows we ended up with 679 rows as per above.
we can prove this by doing simple one line calculation shown below.

find the total number of rows in our Data using len method. ---> len(df)
find the number of duplicated rows based on distance and year columns. ---> df.duplicated(subset =["distance","year"]).sum()
subtract the duplicated rows from the total rows and sure enough we get 679 which is the same as above.

In [9]:

len(df) - df.duplicated(subset =["distance","year"]).sum()

Out[9]:

As we have seen with duplicated method , we have control over which of the duplicated row to keep and discard, via the keep parameter. to keep the first duplicated row we use keep = "first" such as below:

In [10]:

df.drop_duplicates(subset = ['number', 'orbital_period', 'mass'], keep = "first")

Out[10]:

	method	number	orbital_period	mass	distance	year
0	Radial Velocity	1	269.300000	7.10	77.40	2006
1	Radial Velocity	1	874.774000	2.21	56.95	2008
2	Radial Velocity	1	763.000000	2.60	19.84	2011
3	Radial Velocity	1	326.030000	19.40	110.62	2007
4	Radial Velocity	1	516.220000	10.50	119.47	2009
...	...	...	...	...	...	...
1030	Transit	1	3.941507	NaN	172.00	2006
1031	Transit	1	2.615864	NaN	148.00	2007
1032	Transit	1	3.191524	NaN	174.00	2007
1033	Transit	1	4.125083	NaN	293.00	2008
1034	Transit	1	4.187757	NaN	260.00	2008

995 rows × 6 columns

And to keep the last duplicated record(row) we use keep = "last".

In [11]:

df.drop_duplicates(subset = ['number', 'orbital_period', 'mass'], keep = "last")

Out[11]:

	method	number	orbital_period	mass	distance	year
0	Radial Velocity	1	269.300000	7.10	77.40	2006
1	Radial Velocity	1	874.774000	2.21	56.95	2008
2	Radial Velocity	1	763.000000	2.60	19.84	2011
3	Radial Velocity	1	326.030000	19.40	110.62	2007
4	Radial Velocity	1	516.220000	10.50	119.47	2009
...	...	...	...	...	...	...
1030	Transit	1	3.941507	NaN	172.00	2006
1031	Transit	1	2.615864	NaN	148.00	2007
1032	Transit	1	3.191524	NaN	174.00	2007
1033	Transit	1	4.125083	NaN	293.00	2008
1034	Transit	1	4.187757	NaN	260.00	2008

995 rows × 6 columns