This blog came from a question on my Instagram (@mosesmastering). My buddy Jeff asked a great question. He said that he’s noticed that the majority of music currently has a very similar level, sonic…
This dataset actually reveals a lot of interesting stuff.
Starting a data project can be (1) proactive where we formulate questions around a problem then find datasets to analyze to get the answers, or (2) retroactive where we analyze a dataset to get something out of it.
In this retroactive project, I followed the Data Analysis roadmap by Google which enumerates the 5 steps to complete a data project:
I first listed down the column names and read the metadata so I could formulate questions that are simple yet meaningful.
This dataset includes information on a crash event (date, time, location) registered with a unique ‘COLLISION ID’. The total headcount for those who sustained an injury (‘NUMBER OF PERSONS INJURED’) or died (‘NUMBER OF PERSONS KILLED’) are further categorized into types of casualties: pedestrian, cyclist, or motorist.
When reported, each crash event is tagged with 5 levels of cause of incident, with Contributing Factor Vehicle 1 as the main cause, followed by Contributing Factor 2, and so on. Lastly, the vehicle/s involved in the incident are described.
In this newbie-friendly project, I wanted to limit my questions to the following:
I wanted to know the ratio of injured and killed, and the breakdown of casualty types (pedestrian, cyclist, motorist) based on outcomes (killed/injured), etc.
Using .describe(), it was revealed that the dataset has 1,612,178 rows and 29 columns. Before the analysis, I needed to prepare the data to improve its quality.
My steps in cleaning the data are the following:
Note that the duplicate count also revealed that the identical IDs are limited to 2 instances (i.e. no collision ID existed in 3 or more copies).
There are a total of 394,221 duplicates which is 24.5 % of the original dataset.
Before deciding whether to drop these rows, I did a spot check on 3 random IDs just in case there was just a mistype of the collision ID that led to its duplication.
Next step is to set the right data type which makes analysis easier later on. Ideally, the date and time should be a datetime type, the head counts are int64 (not decimals), and the rest are objects.
My next thoughts involved asking questions like, “Should I include all the data about the collision?”, “Are there data that I can combine to make my dataframe more compact and easier to handle?”, and “What other parameters have inconsistencies and how do I approach them?”
I realized that the only way to answer these is to explore my dataset even more. My next plan of action is as follows:
At this point, the question that helped me was, “How specific do I want to be in terms of location?” Since this is a New York (NY) dataset, I thought that clustering the data per borough is more logical than per street.
Besides, subdividing data into hundreds of categories will lead to very low sample count per category which would make the trend hard to see.
Therefore, I decided to keep the ‘BOROUGH’ column then drop the rest except for ‘LOCATION’ which is a pair of coordinates, just in case it becomes handy during map visualization.
2. There are 5 contributing factors available but after totaling the empty rows for each, it becomes obvious that in most cases, the factors 3, 4, and 5 are mostly missing. Remember that at this point, there are 823,736 rows hence by the 3rd factor, 93.6% is already missing.
Therefore I choose to keep only CONTRIBUTING FACTOR VEHICLE 1 AND 2.
I made a new dataframe for the columns that I decided to retain.
3. Combine accident date and time into a datetime column with the correct data type. All head counts must be set to integer type (counting numbers) instead of float (with decimal).
You can see that changing our 2 head count columns from float to int64 throws an errors. Hence, we need to fill in the missing values first. In this case, I replaced NaN with 0 then change it to int64.
I now have a properly formatted data!
Now that my data is pre-cleaned. I will now process some of them to create derivatives that will help reveal some patterns later on.
The newly created columns were added to the end of the dataframe so I moved them into better positions, while dropping the old date and time columns (since I already combined them under datetime).
Placement of columns is very important to facilitate easy viewing (e.g. Collision ID 4247732 occured tuesday night)
At this point, I realized that my pre-cleaning is not enough since I saw some more missing values from other columns (e.g. borough). I went to investigate if I should drop them since I don’t want to be left with almost nothing.
I had a sigh of relief knowing my datetime is complete. The missing values for the borough (256,517 rows) comprised 31% of the dataset so I decided to replace the NaN with ‘unspecified’ instead of dropping them.
Next is the head count-related information.
Ideally, the total number of killed and injured is the sum of the casualties for each category; but in some cases, the casualties do not fall under pedestrian, cyclist, and motorist classes.
According to the metadata: The Motor Vehicle Collisions person table contains details for people involved in the crash. Each row represents a person (driver, occupant, pedestrian, bicyclist,..) involved in a crash.
To show you my point:
There are 822,918 and 680,969 rows where the total killed and total injured count, respectively, are NOT the sum of the casualty types. This is more than 50% of the working data.
Therefore, I took a mental note that I should not take the number of persons injured/killed as the total number of casualties per collision ID.
My last area of concern is the contributing factors of each crash event. My plan was to list down and combine all the unique values for the factor 1 and 2 columns, then check if there are mispellings that mean the same thing…which are to be merged later on.
There are 62 unique values for the factors, and printing them in full revealed that there were indeed misspellings of some terms (e.g. Illnes and Ilness) which is remediated as follows:
Using this reference, I clustered my data then listed the factors under each category for checking (see portion of my output) :
I prepared a subset dataframe for the new classication for faster processing (less data = faster run time). The collision ID was retained to easily merge it with the previously cleaned dataset.
Lastly, the columns were renamed to improve their ability to describe the values. The dataset is saved as a csv file.
My final dataset ready for analysis:
Getting a general sense of our dataset is important and it involves knowing the minimum and maximum values, or the data distribution across different parameters to uncover some trends.
Instead of using Python to analyze, I straight ahead uploaded my final dataset into Tableau where data analysis and data visualization can be done simultaneously. Note that this is applicable for me since I’m handling very simple clustering of data to reveal trends.
For us not to get side-tracked, let’s recall my questions:
I wanted to know the ratio of injured and killed, and the breakdown of casualty types (pedestrian, cyclist, motorist) based on outcomes (killed/injured), etc.
A. Ratio of killed injured to killed
For every killed casualty, there are 236 injured casualties.
B. Injured and killed count (2012–2018, with 5-year forecast)
The max injured head count is 35,575 while for killed, 140. Interestingly, the peaks were both in 2017.
Notice that I did not stack two trend lines into one graph since there is a stark difference on their densities. The former is by the thousands while the latter, hundreds. Therefore, merging them into a single graph will make the ‘killed’ trend difficult to see.
Based on the forecast, the next high is on 2019. Since fit-checking of forecasted data is not within the scope of this project, I just stopped here.
This part allowed me to experiment with labelling so that I can highlight the important points on my graph.
C. Density of casualties per borough
Density maps are excellent when dealing with locations and case counts where the colors correspond to intensity. I learned here that floating legends can only be used in dashboards.
Likewise, I inverted the colors such that green is equal to lowest case while red is for highest, since red is usually associated to ‘danger’. The colors are also set to step-wise to create more distinction on the case count volume.
Based on this visual, Brooklyn has consistent highest killed and injured counts, while Staten Island has the lowest.
D. Injured and killed count at different times per day
Using stacked bars is good when dealing with multiple categories for a given parameter — in this case, the different times of a given day. Layered visuals work as long as they are properly sorted, too.
When interpreting this graph, consider that the tallest height of the color has the highest head count.
Notice the importance of small details such as order of graph: the injured count always comes first and killed comes second. Moreover, I rearranged the stacks such that Morning -> Afternoon -> Evening -> Night.
I highlighted the time of the day with the highest reported case for each outcome. The floating legends are re-positioned away from the highlighted stacks.
Based on this graph, the following observations were deduced:
E. Injured and Killed count based on casualty type
The pie chart is a good visual if dealing with parts of a whole.
We can see here that for majority of collisions, most injuries are sustained by motorists while most deaths are observed for pedestrians.
F. Most contributing factors for all collision cases
Majority of the cases have not specified the factors so they were excluded from the graph to improve the length ratio of the bars.
A vertical bar is appropriate when displaying lengthy descriptions. Sorting revealed the top causes.
For factor 1:
The main contributing factor for most cases is (1) driver inattention or distraction followed by (2) failure to yield right-of-way, and (3) Following too closely.
For factor 2: This is like saying, “Factor 1 is the main reason but besides that, Factor 2.”
Driver inattention/distraction is still the top second cause which means, when a person identifies their main reason as anything else besides driver inattention (e.g. glare, unsafe speed, etc.), their next reason is driver inattention/distraction.
If you think this is invalid, look again: The values for the primary factor is scaled up to 160K while for the secondary factor, it is until 40k only. Meaning, the reporters mostly identified only the main factor.
Lastly, it’s quite obvious that human type of cause will come on top since it has the most number of components plus the top identified causes are under human category.
The share part is this article! To wrap it up, here are the important take-aways from this dataset based on how I analyzed it:
I gotta hand it to ya, pal. I don’t know how many people could have pulled off what you did, for as long as you did, invoking as many of our epidemic syndromes as you did. You managed to appeal to…
Elaine Kasket is a counselling psychologist based in London. Her first book, All the Ghosts in the Machine: Illusions of Immortality in the Digital Age, examines the ethical and technical issues…
Today I am going to discuss about Maximus Coin.As we know blockchain innovation is defeating serious worries of safety, dependability, and mediators. Throughout the time blockchain projects have…