import seaborn as sns %matplotlib inline
import pandas as pd pd.options.display.max_rows = 10
We are going to use the diamond data for this tutorial.
df = sns.load_dataset("diamonds") df
53940 rows × 10 columns
df.color.unique()
array(['E', 'I', 'J', 'H', 'F', 'G', 'D'], dtype=object)
df["color"].unique()
To convert it into a list I can put the whole thing inside of a list method. such as below:
list(df.color.unique())
['E', 'I', 'J', 'H', 'F', 'G', 'D']
Another method which I use quite a lot with the unique method is nunique which will return the number of unique records in your column which can be very handy in analysis.
df.color.nunique()
7
unique method is a Series method, and it will not work on a dataframe, look at the error below.
df.unique()
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-8-84bc62f43960> in <module>() ----> 1 df.unique() ~\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name) 3612 if name in self._info_axis: 3613 return self[name] -> 3614 return object.__getattribute__(self, name) 3615 3616 def __setattr__(self, name, value): AttributeError: 'DataFrame' object has no attribute 'unique'
As we saw in the error above the unique method will not work in a DataFrame(Table). which makes sense, it makes more sense for a table to have unique row rather than a unique value, to get the unique rows we can make use of the duplicated and drop_duplicates() methods.
if you want a quick answer and you want to see the unique rows of your data just use df.drop_duplicates() to get unique rows based on all the columns of your data.
I will cover duplicated rows in it's own tutorial in my site.
# no two rows will have the same values after calling this function based on all the columns available. df.drop_duplicates()
53794 rows × 10 columns
duplicated() method returns a boolean rows which can be used in filtering our DataFrame. I find myself using any() and all() quite alot when I am running analysis in my Table.
From the method below I can tell that my data does contain duplicated rows.
df.duplicated().any()
True