What is a Data Frame?

At the core of pandas we have DataFrame objects which is like a sheet in Excel where you have columns and rows. Columns are Columns and rows are known as indexes in a DataFrame.

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

  • Dict of 1D ndarrays, lists, dicts, or Series
  • 2-D numpy.ndarray
  • Structured or record ndarray
  • A Series
  • Another DataFrame

DataFrames are made of Series objects each Series can be looked at as a vector or a Column.


To start taking advantage of Pandas we need to convert our data into either a DataFrame or a Series in order to use the methods and pandas APIs designated for each object. There are tons of methods and properties designed for data analysis, quick visualisations, filtering, working with date and time, pivoting, custom formatting and many more.. I will cover some of the important ones in my tutorials and cover the less important ones in my projects where they are used.


image.png

image.png

Data Flows with DataFrames

Note Pandas provide methods to read data and convert it into a DataFrame from variopus sources such as CSV, Table,HTML, JSON, Text file,Dictionaries

And usually if you have a method to read the data you will have a similar a method to write the data back to the same format, thus you will get a flow such as the one below:

Your File Format >> DataFrame Object >> Analysis and Manipulation of data >> Save as Your original File Format

And of course you can play arround with the variety formats, for example you can read a CSV file, using pd.read_csv() into a DataFrame , analyse and manipulate and save the result as a json, or a html using to_json or to_html.

As a rule of thumb in Pandas when dealing with input and output, you wil have a method for inputpd.read_format() and the counterpart of the method for the output pd.to_format().

Creating DataFrame object manually

Dictionary of lists/ndarrays

We can create a dataframe using pd.DataFrame() method.

You can make a Dataframe from a Dictionary, where your keys will be your columns and the values will be the DataFrame values. it's likely that your values will be lists or ndarrays. I personally used this method quite a lot in my work, for e.g by creating a loop and populating the values of the list dynamically based on some conditions in the loop and turning the final result into a dataframe which is super handy for a quick or an in depth analaysis.

Let's look at a basic Example:
Note in the example below I am not using a list or ndarrays as my Dictionary value, therefore I can't just call pd.DataFrame(d) since Pandas has no ideas where to put 1 and 2 and how many rows I have. so we need to explicitly pass in an index argument into the pd.DataFrame() such as below:

In [12]:
#basic example of scalr values for dictionaries.
d = {'one': 1,'two': 2}
pd.DataFrame(d, index = [0])
Out[12]:
one two
0 1 2
In [13]:
#basic example of scalr dictionary values with multiple duplicated rows
d = {'one': 1,'two': 2}
pd.DataFrame(d, index = [0,1,2,3,4,5])
Out[13]:
one two
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2
5 1 2

But if you are using a list as your dictionary values, which is the more common way, you don't need to pass in the index argument when you are creating your DataFrame, Pandas assume each value in your index is a row, Keep in mind the length of your lists must be the same or you will get an error

In [14]:
###example of correct dictionary with lists and no error
d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]}
pd.DataFrame(d)
Out[14]:
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0

In the example below the lists do not match, key "one" has a list with 4 values, where key "two" has a list with 3 values.

In [15]:
###example of incorrect dictionary with lists and error
d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2.]}
pd.DataFrame(d)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-9d9707d4e0d9> in <module>()
      1 ###example of incorrect dictionary with lists and error
      2 d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2.]}
----> 3 pd.DataFrame(d)

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
    328                                  dtype=dtype, copy=copy)
    329         elif isinstance(data, dict):
--> 330             mgr = self._init_dict(data, index, columns, dtype=dtype)
    331         elif isinstance(data, ma.MaskedArray):
    332             import numpy.ma.mrecords as mrecords

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _init_dict(self, data, index, columns, dtype)
    459             arrays = [data[k] for k in keys]
    460 
--> 461         return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    462 
    463     def _init_ndarray(self, values, index, columns, dtype=None, copy=False):

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   6161     # figure out the index, if necessary
   6162     if index is None:
-> 6163         index = extract_index(arrays)
   6164     else:
   6165         index = _ensure_index(index)

~\Anaconda3\lib\site-packages\pandas\core\frame.py in extract_index(data)
   6209             lengths = list(set(raw_lengths))
   6210             if len(lengths) > 1:
-> 6211                 raise ValueError('arrays must all be same length')
   6212 
   6213             if have_dicts:

ValueError: arrays must all be same length

As Mentioned above Each column in a dataframe is Series and you can treat those Series individually or create your own Series. Lets take a look at the same basic example above to see what Series are.

You can See I selected the column named 'one' by using the brackets, If I wanted multiple columnn , you would pass in a list inside the same brackets

In [20]:
d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]}
data_frame = pd.DataFrame(d)  
#selecting a single column which is a series
data_frame['one']
Out[20]:
0    1.0
1    2.0
2    3.0
3    4.0
Name: one, dtype: float64
In [22]:
type(data_frame['one'])
Out[22]:
pandas.core.series.Series
In [21]:
#Selecting a list of columns , in this case we only have two columns, but if we had more than two columns 
##, only the ones in the list would show
data_frame[['one','two']]
Out[21]:
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0
An Example of creating a Series from a list through a loop
  1. Create an empty list and call it series_1
  2. LOOP through the values 1 and 100
  3. append each value of the x in the loop
  4. use pd.Series and pass it your list to get a series.
  5. now you have a series with index which is handy on it's own for statistical analysis as we see in the next cell.
In [35]:
series_1 = []
for x in range(0,101):
    series_1.append(x)
    
my_series = pd.Series(series_1)
my_series.head()
Out[35]:
0    0
1    1
2    2
3    3
4    4
dtype: int64

I am using the describe() method which is a method for both Series and a DataFrame to tell you some stats about your data

In [29]:
my_series.describe()
Out[29]:
count    100.000000
mean      50.500000
std       29.011492
min        1.000000
25%       25.750000
50%       50.500000
75%       75.250000
max      100.000000
dtype: float64

Lets Create another series so we can then create a dataframe from the two of them

In [36]:
series_2 = []
for x in range(100,201):
    series_2.append(x)
    
my_series = pd.Series(series_2)
my_series.head()
Out[36]:
0    100
1    101
2    102
3    103
4    104
dtype: int64

we can create the dataframe by passing it into a dictionary where the keys of that dictionary will be the column names as we seen above.
if you are curious about the method head(), head() shows the first 5 rows by default but you can pass it any integer x and it will show the first x values.

In [38]:
df = pd.DataFrame({'one':series_1,'two':series_2})
df.head()
Out[38]:
one two
0 0 100
1 1 101
2 2 102
3 3 103
4 4 104