As the number of cases of novel Corona Virus is increasing day by day globally, I thought why not use Data Analysis to analyse the COVID-19 cases in different countries. So, this project is basically using two datasets to see if there is any relationship between the spread of the virus in a country and how happy people are, living in that country.
So, first things first.. the tools and libraries we are going to use in this project are Numpy, Pandas, Seaborn and Matplotlib. Also, we are going the use two datasets — first, COVID-19 dataset, published by Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE), which consists of the data related to the cumulative number of confirmed cases (till July 21,2020), per day, in each country and second dataset of World Happiness Report 2019, published by Sustainable Development Solutions Network, that consists of various life factors, scored by the people living in each country around the globe. Both of the datasets are available in my Github repository. I’ll provide the link at the end of the post.
So, we’ll start by importing all the libraries. If any of the libraries aren’t installed on your machine, use the pip or conda command to do it and then import them.
We then import our first dataset of confirmed COVID-19 cases using the read_csv method in pandas. We also check the shape of our dataset and it is found to be (266, 186) i.e. 266 rows and 186 columns.
Our next task is to clean the data. First of all, we delete the columns like Lat and Long which are of no use for us. After that, we aggregate the rows by country because our other dataset will be based on countries. After aggregating, we check the shape of our new dataset and now it’s (188, 182).
After this, we visualize the data of some countries like India, China, US and find the maximum infection rate for all the countries using first derivative and max method. Then, we create another dataframe which has only Maximum Infection Rate for each country.
At this point, we are done cleaning our first dataset of confirmed COVID-19 cases and will proceed to our second dataset by first importing it using read_csv method. Then we will first drop some unnecessary columns like Overall Rank, Score, Generosity, Perceptions of corruption . Also, we’ll change the indices to Country or region using the set_index method.
So, at this moment we have now our final datasets which are ready to be merged together and visualized. The final datasets now look like :
Now we will join both the datasets using the join method and find the correlation matrix. A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables.
Our analysis isn’t finished unless we visualize the results in terms of figures and graphs so that everyone can understand what we have got out of our analysis. I am showing the visualization of GDP per capita vs. Maximum Infection Rate. Other visualization graphs can be found in my Github repository. For the visualization purpose, we have used Seaborn library.
We can clearly see in the graph above that it has a positive slope. So, we came to a conclusion that people who are living in more developed countries are more prone to getting infected by the novel Corona virus as compared to those living in less developed countries. This may be due to lack of Corona tests in the less developed countries. In order to prove that this is not the case, we can perform similar analysis on dataset related to cumulative number of deaths.
Github Repository :
If you need any more help, do visit my repo at : https://github.com/ashutoshkrris/COVID-19-Analysis
If you liked my project, please support by giving claps on Medium or giving star on my Github repository. Thanks, in advance.
You can contact me at my personal website : http://ashutoshkrris.herokuapp.com/