How to get Unique values of my column.

In [1]:
import seaborn as sns
%matplotlib inline
In [2]:
import pandas as pd
pd.options.display.max_rows = 10

We are going to use the diamond data for this tutorial.

In [3]:
df = sns.load_dataset("diamonds")
df
Out[3]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53940 rows × 10 columns

I want to see all the unique values in the color column.


  • I first select my column color by using a dot notation
  • I could have also used the bracket notation, which I will show next.
  • Once I have my color column selected, I now have a Series and I can use the unique method to get all the unique values in the column color, this will return a numpy array, which is kind of like a list but it's not.
In [4]:
df.color.unique()
Out[4]:
array(['E', 'I', 'J', 'H', 'F', 'G', 'D'], dtype=object)
In [5]:
df["color"].unique()
Out[5]:
array(['E', 'I', 'J', 'H', 'F', 'G', 'D'], dtype=object)

To convert it into a list I can put the whole thing inside of a list method. such as below:

In [6]:
list(df.color.unique())
Out[6]:
['E', 'I', 'J', 'H', 'F', 'G', 'D']

Another method which I use quite a lot with the unique method is nunique which will return the number of unique records in your column which can be very handy in analysis.

In [7]:
df.color.nunique()
Out[7]:
7

unique method is a Series method, and it will not work on a dataframe, look at the error below.

In [8]:
df.unique()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-84bc62f43960> in <module>()
----> 1 df.unique()

~\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   3612             if name in self._info_axis:
   3613                 return self[name]
-> 3614             return object.__getattribute__(self, name)
   3615 
   3616     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'unique'

But what about unique rows for my table?

As we saw in the error above the unique method will not work in a DataFrame(Table). which makes sense, it makes more sense for a table to have unique row rather than a unique value, to get the unique rows we can make use of the duplicated and drop_duplicates() methods.

if you want a quick answer and you want to see the unique rows of your data just use df.drop_duplicates() to get unique rows based on all the columns of your data.

I will cover duplicated rows in it's own tutorial in my site.

In [9]:
# no two rows will have the same values after calling this function based on all the columns available.
df.drop_duplicates()
Out[9]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53794 rows × 10 columns

duplicated() method returns a boolean rows which can be used in filtering our DataFrame.
I find myself using any() and all() quite alot when I am running analysis in my Table.

  1. df.duplicated().any() will return True if There any duplicated rows and False if there isn't any duplicated row.
  2. df.duplicated().all() will return if all the rows are duplicated, otherwise it will return False.

From the method below I can tell that my data does contain duplicated rows.

In [10]:
df.duplicated().any()
Out[10]:
True