****************** **Session 02** ****************** Pandas - python data analysis tool ============================================== Brief introduction into Pandas ------------------------------- Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables. In comparison with Microsoft Excel, Pandas has no real limit and handles millions of data points. It is easier to create and use complex equations and calculations on your data. As opposed to Excel, it is completely free to download and use. Pandas and Matplotlib installation ---------------------------------- At first, we should install Pandas and Matplotlib libraries directly to our Python IDE. Open PyCharm, go to ``Settings --> Project(name_of_your_project) --> Project Interpreter``. There you can add the required packages by pressing **+** symbol and install *matplotlib* and *pandas* from the available packages respectively. Sometimes there is a problem with the installation of those packages when you have Python version 3.8. This version is relatively new, there might be incompatibilities with installing particular packages. We can avoid such an issue. During the installation of the required packages, install *matplotlib* first and then *pandas*, in a consecutive way. Do not forget to update your **pip** as well. You can either do it in the **Project Interpreter** menu or by typing ``python -m pip install --upgrade pip`` in the terminal. If these methods are unsuccessful for you, then you should install Python version 3.7 instead of Python 3.8. First example: IMU data ------------------------------------- All *.csv* files for this course can be found under `GIT FH Aachen `_. Log in with your FH credentials and search for **it1-unterlagen**, go to folder **Praktikum 2**. Open the **intro to pandas** folder and download the *.csv* files. Move them directly to that folder where you created your python file. First example is about data from the two different IMUs (inertial measurement unit), **ID IMU** and **MT IMU** respectively. In the scope of tutorial, we will learn how to read a simple *.csv file* and plot the data. We will start with the data. The files are called **ID_IMU.csv** and **MT_IMU.csv**. Our data is in **.csv** (comma-separated values) format. It is a quite popular format and actually quite simple. Here will be shwon how to work with the **ID_IMU.csv** file. You can work with the file **MT_IMU.csv** in parallel. The steps will be very similar. This is what the data looks like in any editor: .. image:: images//7.png :width: 600 Open up your PyCharm. First, we have to import ``pandas``, which is the main library: .. code-block:: python #! /usr/bin/python # import [module] as [another_name] import pandas as pd We import ``matplotlib`` for graphing: .. code-block:: python import matplotlib.pyplot as plt We should get our data loaded in. We do it using ``read_csv()`` function: .. code-block:: python dataframe = pd.read_csv("ID_IMU.csv") # in case of MT_IMU: #dataframe = pd.read_csv("MT_IMU.csv", sep=";") print(dataframe) We can see all data in **Run Window** now. If you do not want to load in all of the data, use ``head()`` or ``tail()`` functions to print certain number of rows from bottom or top: .. code-block:: python print (dataframe.head(5)) We can see now only first top three rows in **Run Window**. Pandas figures out that the first line of the file is the **header**. We can see we have a header at the top, that gives us the ten columns we have: *DIAG_STAT*, *X_GYRO (16 bit)*, etc. .. code-block:: pycon DIAG_STAT X_GYRO (16 bit) ... TIME_STAMP BM_CHECKSUM 0 0 0.100 ... 0 382 1 0 -0.275 ... 0 382 2 0 -0.025 ... 0 382 3 0 -0.050 ... 0 382 4 0 -0.175 ... 0 382 [5 rows x 10 columns] What if your *.csv* file does not have a header? We can still read the file, however Pandas will manually provide the header. There is another file called *ID_IMU_no_headers.csv*. It is the same file as before, without the headers. We can see data in **Run Window** now. **0**, **0.1**, **-0.1**, etc. are headers of columns now: .. code-block:: pycon 0 0.1 -0.1 0.15 0.02125 0.07875 -1.19875 38.2 0 382 0 0 -0.275 0.000 0.150 -0.0025 0.03000 -1.10125 38.2 0 382 1 0 -0.025 -0.125 0.075 0.0000 0.01375 -1.18750 38.2 0 382 2 0 -0.050 -0.300 0.025 0.0325 0.02875 -1.12500 38.2 0 382 3 0 -0.175 -0.275 -0.050 0.0200 0.00500 -1.01625 38.2 0 382 4 0 0.125 -0.425 -0.025 0.0425 0.01375 -1.08375 38.2 0 382 But this looks not good. Let us declare our own *headers*. We would like to name them *diagonal*, *x-axis*, *y-axis*, etc: .. code-block:: python headers = ["diagonal","x-gyros","y-gyros","z-gyros","x-accel","y-accel","z_accel","temp_out","time_stamp","checksum"] dataframe_no = pd.read_csv("ID_IMU_no_headers.csv", names = headers) print(dataframe_no.head(5)) We can see this result: .. code-block:: pycon diagonal x-gyros y-gyros z-gyros ... z_accel temp_out time_stamp checksum 0 0 0.100 -0.100 0.150 ... -1.19875 38.2 0 382 1 0 -0.275 0.000 0.150 ... -1.10125 38.2 0 382 2 0 -0.025 -0.125 0.075 ... -1.18750 38.2 0 382 3 0 -0.050 -0.300 0.025 ... -1.12500 38.2 0 382 4 0 -0.175 -0.275 -0.050 ... -1.01625 38.2 0 382 [5 rows x 10 columns] As we have data now, it is time to plot it. We start by removing the index from the file, as Pandas by default puts an index. If we see our data structure, we can observe the values going *0,1,2,3,4...*. We remove these indexes column by replacing them with *diagonal* column. So, how do we replace one with another? .. code-block:: python dataframe = pd.read_csv("ID_IMU.csv") dataframe.set_index("DIAG_STAT",inplace=True) print(dataframe.head(5)) ``inplace`` makes the changes in the dataframe if true. We see this result: .. code-block:: pycon DIAG_STAT X_GYRO (16 bit) Y_GYRO (16 bit) ... TIME_STAMP BM_CHECKSUM ... 0 0.100 -0.100 ... 0 382 0 -0.275 0.000 ... 0 382 0 -0.025 -0.125 ... 0 382 0 -0.050 -0.300 ... 0 382 0 -0.175 -0.275 ... 0 382 [5 rows x 9 columns] Before plotting the graph, be sure that you commented (#dataframe.set_index...) the dataframe, where you have made changes with index. Plotting with changes in index might lead to wrong graph: .. code-block:: python dataframe.plot() plt.show() We get this image: .. image:: images//01.png :width: 450 Second example: Wage hours -------------------------- Let us look one more example. This time we will deal with wages data and learn how to add our own headers, deal with tab separated files and extract columns from the data. We start as before: .. code-block:: python #! /usr/bin/python import pandas as pd import matplotlib.pyplot as plt dataframe = pd.read_csv("wages_hours.csv") print(dataframe.head(5)) It seems to be a bit messy: .. code-block:: pycon HRS\tRATE\tERSP\tERNO\tNEIN\tASSET\tAGE\tDEP\tRACE\tSCHOOL 0 2157\t2.905\t1121\t291\t380\t7250\t38.5\t2.340... 1 2174\t2.970\t1128\t301\t398\t7744\t39.3\t2.335... 2 2062\t2.350\t1214\t326\t185\t3068\t40.1\t2.851... 3 2111\t2.511\t1203\t49\t117\t1632\t22.4\t1.159\... 4 2134\t2.791\t1013\t594\t730\t12710\t57.7\t1.22... If we open *wages_hours.csv*, we can notice that values in data are separated not with **comma (,)**. Even though the name is *comma separated values - csv*, they can be separated by anything, including **tabs**. The ``\t`` in the run window means *tab*. Pandas was not able to parse the file, as it was expecting **commas**, not **tabs**. By default, it expects **commas**. We will read now the file again. This time we tell to Pandas the separator is **tabs**, not **commas** using ``sep = \t`` argument: .. code-block:: python dataframe = pd.read_csv("wages_hours.csv", sep = "\t") print (dataframe.head(5)) We can see organized data now: .. code-block:: pycon HRS RATE ERSP ERNO NEIN ASSET AGE DEP RACE SCHOOL 0 2157 2.905 1121 291 380 7250 38.5 2.340 32.1 10.5 1 2174 2.970 1128 301 398 7744 39.3 2.335 31.2 10.5 2 2062 2.350 1214 326 185 3068 40.1 2.851 * 8.9 3 2111 2.511 1203 49 117 1632 22.4 1.159 27.5 11.5 4 2134 2.791 1013 594 730 12710 57.7 1.229 32.5 8.8 What if there are data that more than sufficient to us. Let us extract only two columns from the dateframe, namely **AGE** and **RATE**: .. code-block:: python dataframe2 = dataframe[["AGE","RATE"]] print(dataframe2.head(5)) We can see data only for **AGE** and **RATE**: .. code-block:: pycon AGE RATE 0 38.5 2.905 1 39.3 2.970 2 40.1 2.350 3 22.4 2.511 4 57.7 2.791 In order to plot a proper graph, we should sort the data for **AGE** in either *ascending* or *descending* order. Also, we should again replace **AGE** with index, as we did previously: .. code-block:: python dataframe_sorted = dataframe2.sort_values(["AGE"], ascending=True) dataframe_sorted.set_index("AGE", inplace=True) print(dataframe_sorted.head(5)) We get: .. code-block:: pycon AGE RATE 22.4 2.511 37.2 3.015 37.4 1.901 37.5 1.899 37.5 3.009 We are ready to plot the data: .. code-block:: python dataframe_sorted.plot() plt.show() We get: .. image:: images//03.png :width: 450 Merging in Pandas ----------------- Now, we will learn merging in Pandas. All *.csv* data for this course can be found under the URL previously mentioned. Go there and select the subfolder **merging in pandas** now. Download the *.csv* files and move these files directly to that folder where you created your python file. We will start with two files: *visitors.csv* and *visitors-new.csv*. Both files are similar. The first contains all visitors to an unknown website, second contains the new visitors only, who are visiting the website for the first time. If you open these files, you can see that the first five lines are a comment. The actual data starts below and in the format: *date*, *visitors*. So, on 9 February, 2015, we had 59 visitors. Our first task is to remove first five lines of comments. We start by importing Python libraries: .. code-block:: python import pandas as pd import matplotlib.pyplot as plt The data does not contain headers, so we have to define them on our own: .. code-block:: python headers = ["date", "visitors"] data = pd.read_csv("visitors.csv", skiprows=4, names = headers) print(data.head(5)) We used the function ``skiprows``. This means we skipped the first four rows, as they contained comments only. You may ask why 4 rows instead of 5? Because, Pandas assumes that the 5th row is for the header. So, we get: .. code-block:: pycon date visitors 0 2015-02-09 59 1 2015-02-08 79 2 2015-02-07 73 3 2015-02-06 89 4 2015-02-05 80 Let us now open another file, *visitors-new.csv*: .. code-block:: pycon headers = ["date", "visitors_new"] data_new = pd.read_csv("visitors-new.csv", skiprows=4, names = headers) print(data_new.head(5)) We get: .. code-block:: pycon date visitors-new 0 2015-02-09 55 1 2015-02-08 64 2 2015-02-07 61 3 2015-02-06 79 4 2015-02-05 60 We opened both files and now we can see that they contain a common field: *date*. The dates are actually same. Now we will merge these two dataframes, using ``merge`` function. Pandas is smart enough to figure out which date is common between them, so it merges on **date**. You can also manually specify which field you want to merge on. .. code-block:: python data_combined = pd.merge(data, data_new) print(data_combined.head(5)) We now get a combined data: .. code-block:: pycon date visitors visitors-new 0 2015-02-09 59 55 1 2015-02-08 79 64 2 2015-02-07 73 61 3 2015-02-06 89 79 4 2015-02-05 80 60 Now, let us sort the **date** in ascending order and remove the index: .. code-block:: python data_combined.sort_values(["date"], inplace=True) data_combined.set_index("date", inplace=True) print(data_combined.head(5)) We get: .. code-block:: pycon date visitors visitors-new 2014-07-14 5 4 2014-07-15 58 55 2014-07-16 18 15 2014-07-17 14 10 2014-07-18 11 9 We plot it: .. code-block:: python data_combined.plot() plt.show() We get the following image: .. image:: images//04.png :width: 350 ----------------------------------------------------------------------- Data Analysis with Pandas ------------------------- We will work with obesity data in England. The *.xls* file for this chapter can be found under the URL previously mentioned. Go there and select the subfolder **data analysis** now. Download the *.xls* file and move it directly to that folder where you created your python file. We start with importing **pandas** and **matplotlib**. We should also import the **xlrd** dependency. It is a utility that helps to extract data from Microsoft Excel spreadsheet files. Go to ``Settings --> Project(name_of_your_project) --> Project Interpreter``. There you should add the required package by pressing **+** symbol and install *xlrd* from available packages respectively. .. code-block:: python #! /usr/bin/python import pandas as pd import matplotlib.pyplot as plt import xlrd We will open now an *.xls* file. The great thing about pandas is that we can open Excel files as well. Normally, most libraries can only work with *.csv* files. We print all sheet names first: .. code-block:: python data = pd.ExcelFile("obesity-2014.xls") print(data.sheet_names) We get: .. code-block:: pycon ['Chapter 7', '7.1', '7.2', '7.3', '7.4', '7.5', '7.6', '7.7', '7.8', '7.9', '7.10'] Let us have a look at sheet 1. If you open *obesity-2014.xls*, sheet 7.1, you can see four columns: *year*, *total*, *males*, *females*. The *year* column does not have a header. The header for year is empty. Let us define the header columns ourselves: .. code-block:: python # Define the columns to be read columns = ['year', 'total', 'males', 'females'] Now we read sheet 7.1. from excel file. If we look carefully, the top 4 and bottom 14 rows contain useless info, so we should skip them, using ``skiprows`` and ``skipfooter`` functions. The former function deals with rows in the beginning, the latter function deals with last rows. Finally, we tell pandas to name the column headers using our list of *names*: .. code-block:: python data = data.parse('7.1', skiprows=4, skipfooter=14, names=columns) print(data) By using **data.parse** we are analysing a string of symbols of data structures. We get: .. code-block:: pycon year total males females 0 NaN NaN NaN NaN 1 2002/03 1275.0 427.0 848.0 2 2003/04 1711.0 498.0 1213.0 3 2004/05 2035.0 589.0 1442.0 4 2005/06 2564.0 746.0 1786.0 5 2006/07 3862.0 1047.0 2807.0 6 2007/08 5018.0 1405.0 3613.0 7 2008/09 7988.0 2077.0 5910.0 8 2009/10 10571.0 2495.0 8074.0 9 2010/11 11574.0 2919.0 8654.0 10 2011/12 11736.0 2993.0 8740.0 11 2012/13 10957.0 2950.0 8007.0 If we pay attention to the result we got, we see *NaN* (Not A Number) values. If we compare it with the original spreadsheet, we can see that there is an empty space on line 4. Since it is empty, it is read as NaN by pandas. We should get rid of it. We use inbuilt ``dropna`` function. It allows the user to analyze and drop Rows/Columns with Null values: .. code-block:: python # Remove the N/A from the data data.dropna(inplace = True) We get: .. code-block:: pycon year total males females 1 2002/03 1275.0 27.0 848.0 2 2003/04 1711.0 498.0 1213.0 3 2004/05 2035.0 589.0 1442.0 4 2005/06 2564.0 746.0 1786.0 5 2006/07 3862.0 1047.0 2807.0 6 2007/08 5018.0 1405.0 3613.0 7 2008/09 7988.0 2077.0 5910.0 8 2009/10 10571.0 2495.0 8074.0 9 2010/11 11574.0 2919.0 8654.0 10 2011/12 11736.0 2993.0 8740.0 11 2012/13 10957.0 2950.0 8007.0 We do not see *NaN* values any more. Let us get rid of the **index** numbers now, as we were doing since the beginning: .. code-block:: python data.set_index('year', inplace=True) print(data) We get: .. code-block:: pycon year total males females 2002/03 1275.0 427.0 848.0 2003/04 1711.0 498.0 1213.0 2004/05 2035.0 589.0 1442.0 2005/06 2564.0 746.0 1786.0 2006/07 3862.0 1047.0 2807.0 2007/08 5018.0 1405.0 3613.0 2008/09 7988.0 2077.0 5910.0 2009/10 10571.0 2495.0 8074.0 2010/11 11574.0 2919.0 8654.0 2011/12 11736.0 2993.0 8740.0 2012/13 10957.0 2950.0 8007.0 Let us plot all: .. image:: images//05.png :width: 350 We can see that obesity for men has gone up more strongly than for women. ------------------------------------------------------------------ Let us have a look now at sheet 2. Last time we defined the headers ourselves. This time, we will let pandas pick them up: .. code-block:: python # Read 2nd section, by age data2 = data.parse('7.2', skiprows=4, skipfooter=14) print(data2) We get: .. code-block:: pycon Unnamed: 0 Total Under 16 16-24 ... 45-54 55-64 65-74 75 and over 0 NaN NaN NaN NaN ... NaN NaN NaN NaN 1 2002/03 1275.0 400.0 65.0 ... 216.0 94.0 52.0 23.0 2 2003/04 1711.0 579.0 67.0 ... 273.0 151.0 52.0 24.0 3 2004/05 2035.0 547.0 107.0 ... 364.0 174.0 36.0 32.0 4 2005/06 2564.0 583.0 96.0 ... 554.0 258.0 72.0 20.0 5 2006/07 3862.0 656.0 184.0 ... 872.0 459.0 118.0 43.0 6 2007/08 5018.0 747.0 228.0 ... 1198.0 598.0 157.0 53.0 If you remember, the **year** column did not have a header, that is why pandas named it *Unnamed*. Let us rename it: .. code-block:: python # Rename unames to year data2.rename(columns={'Unnamed: 0': 'Year'}, inplace=True) print(data2) We get: .. code-block:: pycon Year Total Under 16 16-24 ... 45-54 55-64 65-74 75 and over 0 NaN NaN NaN NaN ... NaN NaN NaN NaN 1 2002/03 1275.0 400.0 65.0 ... 216.0 94.0 52.0 23.0 2 2003/04 1711.0 579.0 67.0 ... 273.0 151.0 52.0 24.0 3 2004/05 2035.0 547.0 107.0 ... 364.0 174.0 36.0 32.0 4 2005/06 2564.0 583.0 96.0 ... 554.0 258.0 72.0 20.0 5 2006/07 3862.0 656.0 184.0 ... 872.0 459.0 118.0 43.0 Let us drop **NaN** again. Also we replace again **index** with **Year**: .. code-block:: python # Drop empties and reset index data2.dropna(inplace=True) data2.set_index('Year', inplace=True) print(data2) We get: .. code-block:: pycon Year Total Under 16 16-24 25-34 ... 45-54 55-64 65-74 75 and over ... 2002/03 1275.0 400.0 65.0 136.0 ... 216.0 94.0 52.0 23.0 2003/04 1711.0 579.0 67.0 174.0 ... 273.0 151.0 52.0 24.0 2004/05 2035.0 547.0 107.0 287.0 ... 364.0 174.0 36.0 32.0 2005/06 2564.0 583.0 96.0 341.0 ... 554.0 258.0 72.0 20.0 Let us plot: .. code-block:: python data2.plot() plt.show() We get: .. image:: images//06.png :width: 350 We can observe from the graph, that **Total** column is huge and overrides everything. We should get rid of it in order to make the graph clearer: .. code-block:: python # Drop the total column and plot data2_minus_total = data2.drop('Total', axis = 1) We used ``drop()`` function. It drops entries from a table. Now we plot the graph, dropping the **Total** column: .. code-block:: python data2_minus_total.plot() plt.show() We get: .. image:: images//07.png :width: 350 What if we want to compare age groups? How do children compare to adults? In pandas, we can review any column by using ``data[column_name]``. We can plot any age group this way: .. code-block:: python data2['Under 16'].plot() data2['25-34'].plot() plt.show() We get: .. image:: images//08.png :width: 350 If the legend is missing for you, you can make it visible again by adding: .. code-block:: python plt.legend() We see that while children’s obesity has gone down, that for adults has jumped. ----------------------------------------------------------------------- **Congratulations, you have learnt how to use the Pandas library in Python programming.**