At the core of pandas we have DataFrame objects which is like a sheet in Excel where you have columns and rows. Columns are Columns and rows are known as indexes in a DataFrame.
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:
DataFrames are made of Series objects each Series can be looked at as a vector or a Column.
To start taking advantage of Pandas we need to convert our data into either a DataFrame or a Series in order to use the methods and pandas APIs designated for each object. There are tons of methods and properties designed for data analysis, quick visualisations, filtering, working with date and time, pivoting, custom formatting and many more.. I will cover some of the important ones in my tutorials and cover the less important ones in my projects where they are used.
Note Pandas provide methods to read data and convert it into a DataFrame from variopus sources such as CSV, Table,HTML, JSON, Text file,Dictionaries
And usually if you have a method to read the data you will have a similar a method to write the data back to the same format, thus you will get a flow such as the one below:
Your File Format >> DataFrame Object >> Analysis and Manipulation of data >> Save as Your original File Format
And of course you can play arround with the variety formats, for example you can read a CSV file, using pd.read_csv() into a DataFrame , analyse and manipulate and save the result as a json, or a html using to_json or to_html.
As a rule of thumb in Pandas when dealing with input and output, you wil have a method for inputpd.read_format() and the counterpart of the method for the output pd.to_format().
We can create a dataframe using pd.DataFrame() method.
You can make a Dataframe from a Dictionary, where your keys will be your columns and the values will be the DataFrame values. it's likely that your values will be lists or ndarrays. I personally used this method quite a lot in my work, for e.g by creating a loop and populating the values of the list dynamically based on some conditions in the loop and turning the final result into a dataframe which is super handy for a quick or an in depth analaysis.
Let's look at a basic Example: Note in the example below I am not using a list or ndarrays as my Dictionary value, therefore I can't just call pd.DataFrame(d) since Pandas has no ideas where to put 1 and 2 and how many rows I have. so we need to explicitly pass in an index argument into the pd.DataFrame() such as below:
#basic example of scalr values for dictionaries. d = {'one': 1,'two': 2} pd.DataFrame(d, index = [0])
#basic example of scalr dictionary values with multiple duplicated rows d = {'one': 1,'two': 2} pd.DataFrame(d, index = [0,1,2,3,4,5])
But if you are using a list as your dictionary values, which is the more common way, you don't need to pass in the index argument when you are creating your DataFrame, Pandas assume each value in your index is a row, Keep in mind the length of your lists must be the same or you will get an error
###example of correct dictionary with lists and no error d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]} pd.DataFrame(d)
In the example below the lists do not match, key "one" has a list with 4 values, where key "two" has a list with 3 values.
###example of incorrect dictionary with lists and error d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2.]} pd.DataFrame(d)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-15-9d9707d4e0d9> in <module>() 1 ###example of incorrect dictionary with lists and error 2 d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2.]} ----> 3 pd.DataFrame(d) ~\Anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy) 328 dtype=dtype, copy=copy) 329 elif isinstance(data, dict): --> 330 mgr = self._init_dict(data, index, columns, dtype=dtype) 331 elif isinstance(data, ma.MaskedArray): 332 import numpy.ma.mrecords as mrecords ~\Anaconda3\lib\site-packages\pandas\core\frame.py in _init_dict(self, data, index, columns, dtype) 459 arrays = [data[k] for k in keys] 460 --> 461 return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype) 462 463 def _init_ndarray(self, values, index, columns, dtype=None, copy=False): ~\Anaconda3\lib\site-packages\pandas\core\frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype) 6161 # figure out the index, if necessary 6162 if index is None: -> 6163 index = extract_index(arrays) 6164 else: 6165 index = _ensure_index(index) ~\Anaconda3\lib\site-packages\pandas\core\frame.py in extract_index(data) 6209 lengths = list(set(raw_lengths)) 6210 if len(lengths) > 1: -> 6211 raise ValueError('arrays must all be same length') 6212 6213 if have_dicts: ValueError: arrays must all be same length
As Mentioned above Each column in a dataframe is Series and you can treat those Series individually or create your own Series. Lets take a look at the same basic example above to see what Series are.
You can See I selected the column named 'one' by using the brackets, If I wanted multiple columnn , you would pass in a list inside the same brackets
d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]} data_frame = pd.DataFrame(d) #selecting a single column which is a series data_frame['one']
0 1.0 1 2.0 2 3.0 3 4.0 Name: one, dtype: float64
type(data_frame['one'])
pandas.core.series.Series
#Selecting a list of columns , in this case we only have two columns, but if we had more than two columns ##, only the ones in the list would show data_frame[['one','two']]
series_1 = [] for x in range(0,101): series_1.append(x) my_series = pd.Series(series_1) my_series.head()
0 0 1 1 2 2 3 3 4 4 dtype: int64
I am using the describe() method which is a method for both Series and a DataFrame to tell you some stats about your data
my_series.describe()
count 100.000000 mean 50.500000 std 29.011492 min 1.000000 25% 25.750000 50% 50.500000 75% 75.250000 max 100.000000 dtype: float64
Lets Create another series so we can then create a dataframe from the two of them
series_2 = [] for x in range(100,201): series_2.append(x) my_series = pd.Series(series_2) my_series.head()
0 100 1 101 2 102 3 103 4 104 dtype: int64
we can create the dataframe by passing it into a dictionary where the keys of that dictionary will be the column names as we seen above. if you are curious about the method head(), head() shows the first 5 rows by default but you can pass it any integer x and it will show the first x values.
df = pd.DataFrame({'one':series_1,'two':series_2}) df.head()