How to get Unique values of my column.¶

In [1]:

import seaborn as sns
%matplotlib inline

In [2]:

import pandas as pd
pd.options.display.max_rows = 10

We are going to use the diamond data for this tutorial.

In [3]:

df = sns.load_dataset("diamonds")
df

Out[3]:

	carat	cut	color	clarity	depth	table	price	x	y	z
0	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75
...	...	...	...	...	...	...	...	...	...	...
53935	0.72	Ideal	D	SI1	60.8	57.0	2757	5.75	5.76	3.50
53936	0.72	Good	D	SI1	63.1	55.0	2757	5.69	5.75	3.61
53937	0.70	Very Good	D	SI1	62.8	60.0	2757	5.66	5.68	3.56
53938	0.86	Premium	H	SI2	61.0	58.0	2757	6.15	6.12	3.74
53939	0.75	Ideal	D	SI2	62.2	55.0	2757	5.83	5.87	3.64

53940 rows × 10 columns

I want to see all the unique values in the color column.¶

I first select my column color by using a dot notation
I could have also used the bracket notation, which I will show next.
Once I have my color column selected, I now have a Series and I can use the unique method to get all the unique values in the column color, this will return a numpy array, which is kind of like a list but it's not.

In [4]:

df.color.unique()

Out[4]:

array(['E', 'I', 'J', 'H', 'F', 'G', 'D'], dtype=object)

In [5]:

df["color"].unique()

Out[5]:

array(['E', 'I', 'J', 'H', 'F', 'G', 'D'], dtype=object)

To convert it into a list I can put the whole thing inside of a list method. such as below:

In [6]:

list(df.color.unique())

Out[6]:

['E', 'I', 'J', 'H', 'F', 'G', 'D']

Another method which I use quite a lot with the unique method is nunique which will return the number of unique records in your column which can be very handy in analysis.

In [7]:

df.color.nunique()

Out[7]:

unique method is a Series method, and it will not work on a dataframe, look at the error below.

In [8]:

df.unique()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-84bc62f43960> in <module>()
----> 1 df.unique()

~\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   3612             if name in self._info_axis:
   3613                 return self[name]
-> 3614             return object.__getattribute__(self, name)
   3615 
   3616     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'unique'

But what about unique rows for my table?¶

As we saw in the error above the unique method will not work in a DataFrame(Table). which makes sense, it makes more sense for a table to have unique row rather than a unique value, to get the unique rows we can make use of the duplicated and drop_duplicates() methods.

if you want a quick answer and you want to see the unique rows of your data just use df.drop_duplicates() to get unique rows based on all the columns of your data.

I will cover duplicated rows in it's own tutorial in my site.

In [9]:

# no two rows will have the same values after calling this function based on all the columns available.
df.drop_duplicates()

Out[9]:

	carat	cut	color	clarity	depth	table	price	x	y	z
0	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75
...	...	...	...	...	...	...	...	...	...	...
53935	0.72	Ideal	D	SI1	60.8	57.0	2757	5.75	5.76	3.50
53936	0.72	Good	D	SI1	63.1	55.0	2757	5.69	5.75	3.61
53937	0.70	Very Good	D	SI1	62.8	60.0	2757	5.66	5.68	3.56
53938	0.86	Premium	H	SI2	61.0	58.0	2757	6.15	6.12	3.74
53939	0.75	Ideal	D	SI2	62.2	55.0	2757	5.83	5.87	3.64

53794 rows × 10 columns

duplicated() method returns a boolean rows which can be used in filtering our DataFrame.
I find myself using any() and all() quite alot when I am running analysis in my Table.

df.duplicated().any() will return True if There any duplicated rows and False if there isn't any duplicated row.
df.duplicated().all() will return if all the rows are duplicated, otherwise it will return False.

From the method below I can tell that my data does contain duplicated rows.

In [10]:

df.duplicated().any()

Out[10]:

True