New York subway, one of the main public transportations for New Yorkers, provides super convenience for local citizens, at the same time, brings potential danger to passengers, where criminals are attracted to busier subway stations for certain kinds of crime like pickpocketing, grand larceny, and assault. This closest place will trigger evil.
Wordcloud using victims description
On November 21, around 12:00 AM, at 34th Street-Penn Station in Manhattan, Alkeem Loney, a 32-year-old male, was stabbed in the neck during an unprovoked attack and was pronounced dead later as NYPD stated. The deadly incident is the latest in a pate of violence underground that comes as the MTA tries to get commuters back on mass transit. The horrible crime event raised lots of public concern about the safety at subway stations, the safety is tightly related to almost every citizen who is living, working, and studying in New York City.
As students who are living here in New York City, most of us will almost take the subway to the campus in the early morning and back to the apartment in the night on weekdays, and hang out with friends on weekends. However, some of my friends experienced uncompleted crimes. Keeping away from danger at subway stations is closely related to ourselves. We hope we are able to help citizens to find comparatively safe and reliable routes when taking subways.
The orginal subway crime data has two parts.The first one contains all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department— NYPD. The second one includes similar crimes. We join these two data frames and only analyze crimes which happen in subway, NYC.
The variables we use are(some useless variable’s meaning can be found in the link above):
column name | description | type |
---|---|---|
CMPLNT_NUM | Randomly generated persistent ID for each complaint | Number |
CMPLNT_FR_DT | Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists) | Date & Time |
CMPLNT_FR_TM | Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists) | Plain Text |
CMPLNT_TO_DT | Ending date of occurrence for the reported event, if exact time of occurrence is unknown | Date & Time |
CMPLNT_TO_TM | Ending time of occurrence for the reported event, if exact time of occurrence is unknown | Plain Text |
OFNS_DESC | Description of offense corresponding with key code | Plain Text |
LAW_CAT_CD | Level of offense: felony, misdemeanor, violation | Plain Text |
SUSP_AGE_GROUP | Suspect’s Age Group | Plain Text |
SUSP_RACE | Suspect’s Race Description | Plain Text |
SUSP_SEX | Suspect’s Sex Description | Plain Text |
Latitude | Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) | Number |
Longitude | Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) | Number |
STATION_NAME | Transit station name | Plain Text |
VIC_AGE_GROUP | Victim’s Age Group | Plain Text |
VIC_RACE | Victim’s Race Description | Plain Text |
VIC_SEX | Victim’s Sex Description | Plain Text |
The orginal Subway passenger data is from MTA(Metropolitan Transportation Authority). The orginal data contains total entries and exits in each station in every 4 hours from 2010 to now. Data is not in a readable format, they are seperated by time in different htmls, we read and process passenger data with GenerateSubwayPassengerData.rmd
The variables we use are:
colum name | description | type | ||
---|---|---|---|---|
STATION | station name | Character | ||
LINENAME | lines in this station, there can be more than one lines in one station | Character | ||
DATE | format MM/DD/YYYY | Date | ||
TIME | format HH:MM:SS | Date | ||
ENTRIES | cumulative entries | Intergar | ||
EXITS | cumulative exits | Intergar |
In order to compare crime and subway passengers’ data, we find that we need to transfer to the same subway line and station name.(Different stations have different abbreviation.) We use the crime data’s latitude and longitude to match the subway’s data. The station in the subway information closet to the each row of crime data will be matched. (which has information about all the station’s name, line and location.) Some crime data who have deviant longitude and latitude will be excluded.
The reason why we do not want to use the orignal boro (Manhatten, Brooklyn, Queens and Bronx) is that there is too few categories based on our analysis on crime rate, making the model to be less predictive. We want to use Kmeans to produce more boros/clusters so that it better shows the pattern of the crime distribution. For instance, crime rate in upper Manatten > mid Manatten> lower Manatten. And we use Kmeans to better seperate these parts so that we can have a better model.
We set the number of clusters to be 8 and use Kmeans to cluster latitudes and longitudes. After K-means we have 8 clusters of locations instead of the original 4 boro, making it closer to reality (for instance we have lower, middle and upper Manhatten in the clusters)and better for model classification. The kmeans code is in PassengerEDA.Rmd
Some missing data from passenger’s exit and enter count, we use mean of former values to impute them. The imputation code we use is FutherCleanPassenger.py
We want to get coordinates of each station for the following reasons
However, how to get the correct coordinates is tricky, there are open datas about NYC subway stations infomation and all of them have different naming system with ours. In addition, the station names contain lots of dupilicates. For instance, there are 2 86 st stations in middle Manhattan and another one in Brooklyn. We can get the correct coordinates of stations by using both station names and line names. Therefore, our solution is to use Google Maps Api. The code we use is Subway_info.py
According to the New York City Subway instruction, there are several different transfer between lines. The first is the inside transfer, where you can transfer from one line to other line inside the station. For example, 14 St-Union Sq is a station of Line LNQRW456. We don’t need to some adjustment for these stations. The second one is free subway transfer and free out-of-way-system. This transfer is different from the inside transfer, passengers need to move from one station to other station for transfer. The data of these transfers has some problems. For example, there are free subway transfer between Court ST-23 ST(EM) and Court Sq(G7). However, the dataset shows the station and line is Court ST-23 ST:EGM, Court Sq:EGM, Court Sq:7. To deal with problem like this, we reassigned the line of station with free subway transfer or free out-of-way-system according to the New York City Subway instruction. In this case, we only consider the insider transfer station.
For each station and given time, We got the actual entries and exits by calculating the difference of cumulative entries and exits between current time and last time. However, final results contains some outliers, some entries and exits are negative or extremely large. For these outliers, we replaced them with the mean of last two observations at the same time and station. We did this by FutherCleanPassenger.py.
We conduct EDA to find the trends of data and provide insights for model. In the first section, we analyze subway crime data and produce an interactive Shiny Dashboard about subway crime, people can look up crime rate in each location, distribution of each crime type. For the other one, we analyze passenger data and create a new variable cluster using Kmeans. Additionally, we build a shiny app for Subway passenger flow animation and info lookup and a more detailed app on each line
New York City can be a dangerous place and crime from above ground will often extend into the NYC Subway. We mainly focus on the recent crime data on subway in NYC in this year, and there are 124439 complaints from 2006 to now.
From this map, you can check where the crime happened frequently.
From this map, you can check each crime’s location, type, victim, and suspects’ information and time.