Sounds the Same

This blog came from a question on my Instagram (@mosesmastering). My buddy Jeff asked a great question. He said that he’s noticed that the majority of music currently has a very similar level, sonic…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Simple analysis of the New York motor vehicle collision dataset

This dataset actually reveals a lot of interesting stuff.

Starting a data project can be (1) proactive where we formulate questions around a problem then find datasets to analyze to get the answers, or (2) retroactive where we analyze a dataset to get something out of it.

In this retroactive project, I followed the Data Analysis roadmap by Google which enumerates the 5 steps to complete a data project:

I first listed down the column names and read the metadata so I could formulate questions that are simple yet meaningful.

This dataset includes information on a crash event (date, time, location) registered with a unique ‘COLLISION ID’. The total headcount for those who sustained an injury (‘NUMBER OF PERSONS INJURED’) or died (‘NUMBER OF PERSONS KILLED’) are further categorized into types of casualties: pedestrian, cyclist, or motorist.

When reported, each crash event is tagged with 5 levels of cause of incident, with Contributing Factor Vehicle 1 as the main cause, followed by Contributing Factor 2, and so on. Lastly, the vehicle/s involved in the incident are described.

In this newbie-friendly project, I wanted to limit my questions to the following:

I wanted to know the ratio of injured and killed, and the breakdown of casualty types (pedestrian, cyclist, motorist) based on outcomes (killed/injured), etc.

Using .describe(), it was revealed that the dataset has 1,612,178 rows and 29 columns. Before the analysis, I needed to prepare the data to improve its quality.

My steps in cleaning the data are the following:

There are 394, 221 duplicate collision IDs.

Note that the duplicate count also revealed that the identical IDs are limited to 2 instances (i.e. no collision ID existed in 3 or more copies).

There are a total of 394,221 duplicates which is 24.5 % of the original dataset.

Before deciding whether to drop these rows, I did a spot check on 3 random IDs just in case there was just a mistype of the collision ID that led to its duplication.

Spot checking to validate the duplication
Dropping the duplicate cases

Next step is to set the right data type which makes analysis easier later on. Ideally, the date and time should be a datetime type, the head counts are int64 (not decimals), and the rest are objects.

The different datatypes in my dataframe

My next thoughts involved asking questions like, “Should I include all the data about the collision?”, “Are there data that I can combine to make my dataframe more compact and easier to handle?”, and “What other parameters have inconsistencies and how do I approach them?”

I realized that the only way to answer these is to explore my dataset even more. My next plan of action is as follows:

At this point, the question that helped me was, “How specific do I want to be in terms of location?” Since this is a New York (NY) dataset, I thought that clustering the data per borough is more logical than per street.

Besides, subdividing data into hundreds of categories will lead to very low sample count per category which would make the trend hard to see.

Columns related to the location of the collision

Therefore, I decided to keep the ‘BOROUGH’ column then drop the rest except for ‘LOCATION’ which is a pair of coordinates, just in case it becomes handy during map visualization.

2. There are 5 contributing factors available but after totaling the empty rows for each, it becomes obvious that in most cases, the factors 3, 4, and 5 are mostly missing. Remember that at this point, there are 823,736 rows hence by the 3rd factor, 93.6% is already missing.

Therefore I choose to keep only CONTRIBUTING FACTOR VEHICLE 1 AND 2.

Count of missing values for all the contributing factors

I made a new dataframe for the columns that I decided to retain.

3. Combine accident date and time into a datetime column with the correct data type. All head counts must be set to integer type (counting numbers) instead of float (with decimal).

Combining the date and time then changing it into a date-time format
There’s an error when converting the # of persons to integer…

You can see that changing our 2 head count columns from float to int64 throws an errors. Hence, we need to fill in the missing values first. In this case, I replaced NaN with 0 then change it to int64.

Filling the missing values before converting to int64 type

I now have a properly formatted data!

updated and proper data types

Now that my data is pre-cleaned. I will now process some of them to create derivatives that will help reveal some patterns later on.

Dates are categorized into days of the week

The newly created columns were added to the end of the dataframe so I moved them into better positions, while dropping the old date and time columns (since I already combined them under datetime).

Moving columns to where they should be

Placement of columns is very important to facilitate easy viewing (e.g. Collision ID 4247732 occured tuesday night)

Arranged columns where Collision ID comes first, and weekday + time category follows DateTime

At this point, I realized that my pre-cleaning is not enough since I saw some more missing values from other columns (e.g. borough). I went to investigate if I should drop them since I don’t want to be left with almost nothing.

Count of missing info on borough and datetime columns

I had a sigh of relief knowing my datetime is complete. The missing values for the borough (256,517 rows) comprised 31% of the dataset so I decided to replace the NaN with ‘unspecified’ instead of dropping them.

Filling NaN with ‘unspecified’ to salvage sample size

Next is the head count-related information.

Ideally, the total number of killed and injured is the sum of the casualties for each category; but in some cases, the casualties do not fall under pedestrian, cyclist, and motorist classes.

According to the metadata: The Motor Vehicle Collisions person table contains details for people involved in the crash. Each row represents a person (driver, occupant, pedestrian, bicyclist,..) involved in a crash.

To show you my point:

There are 822,918 and 680,969 rows where the total killed and total injured count, respectively, are NOT the sum of the casualty types. This is more than 50% of the working data.

Therefore, I took a mental note that I should not take the number of persons injured/killed as the total number of casualties per collision ID.

My last area of concern is the contributing factors of each crash event. My plan was to list down and combine all the unique values for the factor 1 and 2 columns, then check if there are mispellings that mean the same thing…which are to be merged later on.

Listing down all the factors 1 and 2, then extracting the unique values.

There are 62 unique values for the factors, and printing them in full revealed that there were indeed misspellings of some terms (e.g. Illnes and Ilness) which is remediated as follows:

Misspellings that can be merged into one term
Correcting the misspellings for the factor columns

Using this reference, I clustered my data then listed the factors under each category for checking (see portion of my output) :

A snippet of the dictionary to classify the factors.
This code lists down the factors per category

I prepared a subset dataframe for the new classication for faster processing (less data = faster run time). The collision ID was retained to easily merge it with the previously cleaned dataset.

Lastly, the columns were renamed to improve their ability to describe the values. The dataset is saved as a csv file.

My final dataset ready for analysis:

Getting a general sense of our dataset is important and it involves knowing the minimum and maximum values, or the data distribution across different parameters to uncover some trends.

Instead of using Python to analyze, I straight ahead uploaded my final dataset into Tableau where data analysis and data visualization can be done simultaneously. Note that this is applicable for me since I’m handling very simple clustering of data to reveal trends.

For us not to get side-tracked, let’s recall my questions:

I wanted to know the ratio of injured and killed, and the breakdown of casualty types (pedestrian, cyclist, motorist) based on outcomes (killed/injured), etc.

A. Ratio of killed injured to killed

For every killed casualty, there are 236 injured casualties.

B. Injured and killed count (2012–2018, with 5-year forecast)

The max injured head count is 35,575 while for killed, 140. Interestingly, the peaks were both in 2017.

Notice that I did not stack two trend lines into one graph since there is a stark difference on their densities. The former is by the thousands while the latter, hundreds. Therefore, merging them into a single graph will make the ‘killed’ trend difficult to see.

Based on the forecast, the next high is on 2019. Since fit-checking of forecasted data is not within the scope of this project, I just stopped here.

This part allowed me to experiment with labelling so that I can highlight the important points on my graph.

C. Density of casualties per borough

Density maps are excellent when dealing with locations and case counts where the colors correspond to intensity. I learned here that floating legends can only be used in dashboards.

Likewise, I inverted the colors such that green is equal to lowest case while red is for highest, since red is usually associated to ‘danger’. The colors are also set to step-wise to create more distinction on the case count volume.

Based on this visual, Brooklyn has consistent highest killed and injured counts, while Staten Island has the lowest.

D. Injured and killed count at different times per day

Using stacked bars is good when dealing with multiple categories for a given parameter — in this case, the different times of a given day. Layered visuals work as long as they are properly sorted, too.

When interpreting this graph, consider that the tallest height of the color has the highest head count.

Notice the importance of small details such as order of graph: the injured count always comes first and killed comes second. Moreover, I rearranged the stacks such that Morning -> Afternoon -> Evening -> Night.

Morning (5amto 12noon), Afternoon (12:01pm to 5pm), Evening (5:01pm to 9pm), Night (9:01pm to 4:59 am)

I highlighted the time of the day with the highest reported case for each outcome. The floating legends are re-positioned away from the highlighted stacks.

Based on this graph, the following observations were deduced:

E. Injured and Killed count based on casualty type

The pie chart is a good visual if dealing with parts of a whole.

We can see here that for majority of collisions, most injuries are sustained by motorists while most deaths are observed for pedestrians.

F. Most contributing factors for all collision cases

Majority of the cases have not specified the factors so they were excluded from the graph to improve the length ratio of the bars.

A vertical bar is appropriate when displaying lengthy descriptions. Sorting revealed the top causes.

For factor 1:

Driver inattention/Distraction is the main cause of collision

The main contributing factor for most cases is (1) driver inattention or distraction followed by (2) failure to yield right-of-way, and (3) Following too closely.

For factor 2: This is like saying, “Factor 1 is the main reason but besides that, Factor 2.”

Driver inattention/Distraction is still the other main cause of collision

Driver inattention/distraction is still the top second cause which means, when a person identifies their main reason as anything else besides driver inattention (e.g. glare, unsafe speed, etc.), their next reason is driver inattention/distraction.

If you think this is invalid, look again: The values for the primary factor is scaled up to 160K while for the secondary factor, it is until 40k only. Meaning, the reporters mostly identified only the main factor.

Lastly, it’s quite obvious that human type of cause will come on top since it has the most number of components plus the top identified causes are under human category.

The share part is this article! To wrap it up, here are the important take-aways from this dataset based on how I analyzed it:

Add a comment

Related posts:

The Facebook Fiasco

I gotta hand it to ya, pal. I don’t know how many people could have pulled off what you did, for as long as you did, invoking as many of our epidemic syndromes as you did. You managed to appeal to…

What Happens to Our Data When We Die? Elaine Kasket on a Digital Dilemma

Elaine Kasket is a counselling psychologist based in London. Her first book, All the Ghosts in the Machine: Illusions of Immortality in the Digital Age, examines the ethical and technical issues…

Maximus Coin digital currency as the core of a new robust banking and financial services ecosystem.

Today I am going to discuss about Maximus Coin.As we know blockchain innovation is defeating serious worries of safety, dependability, and mediators. Throughout the time blockchain projects have…