Introduction

New York subway, one of the main public transportations for New Yorkers, provides super convenience for local citizens, at the same time, brings potential danger to passengers, where criminals are attracted to busier subway stations for certain kinds of crime like pickpocketing, grand larceny, and assault. This closest place will trigger evil.

Wordcloud using victims description

On November 21, around 12:00 AM, at 34th Street-Penn Station in Manhattan, Alkeem Loney, a 32-year-old male, was stabbed in the neck during an unprovoked attack and was pronounced dead later as NYPD stated. The deadly incident is the latest in a pate of violence underground that comes as the MTA tries to get commuters back on mass transit. The horrible crime event raised lots of public concern about the safety at subway stations, the safety is tightly related to almost every citizen who is living, working, and studying in New York City.

As students who are living here in New York City, most of us will almost take the subway to the campus in the early morning and back to the apartment in the night on weekdays, and hang out with friends on weekends. However, some of my friends experienced uncompleted crimes. Keeping away from danger at subway stations is closely related to ourselves. We hope we are able to help citizens to find comparatively safe and reliable routes when taking subways.

Data

Data Introduction

Subway Crime

The orginal subway crime data has two parts.The first one contains all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department— NYPD. The second one includes similar crimes. We join these two data frames and only analyze crimes which happen in subway, NYC.

The variables we use are(some useless variable’s meaning can be found in the link above):

column name	description	type
CMPLNT_NUM	Randomly generated persistent ID for each complaint	Number
CMPLNT_FR_DT	Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists)	Date & Time
CMPLNT_FR_TM	Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists)	Plain Text
CMPLNT_TO_DT	Ending date of occurrence for the reported event, if exact time of occurrence is unknown	Date & Time
CMPLNT_TO_TM	Ending time of occurrence for the reported event, if exact time of occurrence is unknown	Plain Text
OFNS_DESC	Description of offense corresponding with key code	Plain Text
LAW_CAT_CD	Level of offense: felony, misdemeanor, violation	Plain Text
SUSP_AGE_GROUP	Suspect’s Age Group	Plain Text
SUSP_RACE	Suspect’s Race Description	Plain Text
SUSP_SEX	Suspect’s Sex Description	Plain Text
Latitude	Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)	Number
Longitude	Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)	Number
STATION_NAME	Transit station name	Plain Text
VIC_AGE_GROUP	Victim’s Age Group	Plain Text
VIC_RACE	Victim’s Race Description	Plain Text
VIC_SEX	Victim’s Sex Description	Plain Text

Subway Passenger

The orginal Subway passenger data is from MTA(Metropolitan Transportation Authority). The orginal data contains total entries and exits in each station in every 4 hours from 2010 to now. Data is not in a readable format, they are seperated by time in different htmls, we read and process passenger data with GenerateSubwayPassengerData.rmd

The variables we use are:

colum name	description	type
STATION	station name	Character
LINENAME	lines in this station, there can be more than one lines in one station	Character
DATE	format MM/DD/YYYY	Date
TIME	format HH:MM:SS	Date
ENTRIES	cumulative entries	Intergar
EXITS	cumulative exits	Intergar

Data Cleanning

Subway Crime

the Least Distance

In order to compare crime and subway passengers’ data, we find that we need to transfer to the same subway line and station name.(Different stations have different abbreviation.)
We use the crime data’s latitude and longitude to match the subway’s data. The station in the subway information closet to the each row of crime data will be matched. (which has information about all the station’s name, line and location.)
Some crime data who have deviant longitude and latitude will be excluded.

Subway Passenger

K-Means

The reason why we do not want to use the orignal boro (Manhatten, Brooklyn, Queens and Bronx) is that there is too few categories based on our analysis on crime rate, making the model to be less predictive. We want to use Kmeans to produce more boros/clusters so that it better shows the pattern of the crime distribution. For instance, crime rate in upper Manatten > mid Manatten> lower Manatten. And we use Kmeans to better seperate these parts so that we can have a better model.

We set the number of clusters to be 8 and use Kmeans to cluster latitudes and longitudes. After K-means we have 8 clusters of locations instead of the original 4 boro, making it closer to reality (for instance we have lower, middle and upper Manhatten in the clusters)and better for model classification. The kmeans code is in PassengerEDA.Rmd

Imputation

Some missing data from passenger’s exit and enter count, we use mean of former values to impute them. The imputation code we use is FutherCleanPassenger.py

Google Map Api to find station coordinates

We want to get coordinates of each station for the following reasons

location-based data visualization and analysis
More location-based features for the model
The station name in crime and passenger data are not matched, we can use corrdinates to match them

However, how to get the correct coordinates is tricky, there are open datas about NYC subway stations infomation and all of them have different naming system with ours. In addition, the station names contain lots of dupilicates. For instance, there are 2 86 st stations in middle Manhattan and another one in Brooklyn. We can get the correct coordinates of stations by using both station names and line names. Therefore, our solution is to use Google Maps Api. The code we use is Subway_info.py

Add service column

There are too many subway lines and some of them share most of the rails, therefore it is not reasonable to conduct analysis or building models with the line name. Therefore, we created a new variable called service based on the defination of MTA. For instance, line A, B and C are called ‘8 Avenue’.

Correct subway line

According to the New York City Subway instruction, there are several different transfer between lines. The first is the inside transfer, where you can transfer from one line to other line inside the station. For example, 14 St-Union Sq is a station of Line LNQRW456. We don’t need to some adjustment for these stations. The second one is free subway transfer and free out-of-way-system. This transfer is different from the inside transfer, passengers need to move from one station to other station for transfer. The data of these transfers has some problems. For example, there are free subway transfer between Court ST-23 ST(EM) and Court Sq(G7). However, the dataset shows the station and line is Court ST-23 ST:EGM, Court Sq:EGM, Court Sq:7. To deal with problem like this, we reassigned the line of station with free subway transfer or free out-of-way-system according to the New York City Subway instruction. In this case, we only consider the insider transfer station.

Outliers of entries and exits

For each station and given time, We got the actual entries and exits by calculating the difference of cumulative entries and exits between current time and last time. However, final results contains some outliers, some entries and exits are negative or extremely large. For these outliers, we replaced them with the mean of last two observations at the same time and station. We did this by FutherCleanPassenger.py.

Exploratory Data Analysis

We conduct EDA to find the trends of data and provide insights for model. In the first section, we analyze subway crime data and produce an interactive Shiny Dashboard about subway crime, people can look up crime rate in each location, distribution of each crime type. For the other one, we analyze passenger data and create a new variable cluster using Kmeans. Additionally, we build a shiny app for Subway passenger flow animation and info lookup and a more detailed app on each line

Subway Crime

New York City can be a dangerous place and crime from above ground will often extend into the NYC Subway. We mainly focus on the recent crime data on subway in NYC in this year, and there are 124439 complaints from 2006 to now.

Crime by Location

Heat Map of Subway Crime in NYC, 2006-2021

From this map, you can check where the crime happened frequently.

Map of Subway Crime in NYC, 2006-2021

From this map, you can check each crime’s location, type, victim, and suspects’ information and time.

Distribution of crime in 7 Clusters

When firstly scanning the map above, you can ambiguously know how many crimes in each part of NYC, so let us check them in each ambiguously using a bar chart.

Top 10 offense classification

There are 50 kinds of crime occurring in the subway, there is the bar chart shows the wildest 10 crimes in the subway.

The most frequent crime mainly consists of grand larceny and assaults.

Top 20 stations where crime happens frequently

From this chart, you can mainly check which station is the most dangerous station.

Barchart by each Borough about Victims

From this graph, you can check in each borough, which races more possibly vulnerable in the subway.

As you can see, the proportion of African Americans in each cluster are stable; In Manhattan, where has the most crimes, white people(including white Hispanics) are more vulnerable than African Americans in these places.

Gender Distribution for popular crime types

In this part, we want to know which gender are more possible to become potential victims. We choose several types of crime from the most frequent kinds of crimes:
ASSAULT 3 & RELATED OFFENSES,HARRASSMENT 2,GRAND LARCENY,FELONY ASSAULT, ROBBERY,PETIT LARCENY,SEX CRIMES.
We can find that crimes about sex crimes and harassment, females are more possible to be attacked. Crimes about regular theft, such as assault and larceny, male are potential victims compared with females.

Female Age Distribution for Sex Crimes

Let us talk more about age groups for some specific crimes: SEX CRIMES and HARASSMENT 2.
Most of the victims’ age is in the age 25-44 interval.

Crime Rate Top 20

Sometimes, we more care about the crime rates on the subway rather than the number of crimes, because we also care about the possibility that the people standing in front of us is suspective.

linename	station	flow	crime	crime_rate
AS	BROAD CHANNEL	230815	271	0.0011741
FM	6 AV	1049168	1096	0.0010446
23	125 ST	10012389	2900	0.0002896
3	NOSTRAND AV	4981435	1436	0.0002883
1	FRANKLIN ST	4151331	1035	0.0002493
1	23 ST	13849705	2870	0.0002072
7	74 ST-BROADWAY	5244829	1068	0.0002036
1	CANAL ST	6143790	1061	0.0001727
1	125 ST	8760782	1437	0.0001640
25	INTERVALE AV	4020500	621	0.0001545
ACJLZ	BROADWAY JCT	10856301	1584	0.0001459
6	3 AV 138 ST	5030499	730	0.0001451
J	111 ST	1578616	208	0.0001318
BD	182-183 STS	2054457	241	0.0001173
4	KINGSBRIDGE RD	8467318	922	0.0001089
25	JACKSON AV	3582715	381	0.0001063
L	ATLANTIC AV	1447978	147	0.0001015
23	CENTRAL PK N110	9509973	951	0.0001000
F	AVENUE P	1970015	194	0.0000985
C	155 ST	2984168	282	0.0000945

Crime by Time

We already had a general picture about stations that are comparably dangerous, and types of crime events that happened frequently in last several years.

Now Let’s dig into the relationship between time and the occurrence of various crime events.

In the next few plots, we may try to explore questions like…

Which year is along with the highest frequency of various crime events?
Which season is along with the highest frequency of various crime events?
Which time points in a day are along with more crime events?

Number of events each year

sub_crime_year = 
  raw_sub_crime %>% 
  select(start_date, start_time_hour, crime_event, law_cat) %>% 
  mutate(start_date = substring(start_date, 1, 4))

plot_1 = 
  sub_crime_year %>% 
  group_by(start_date) %>% 
  summarise(event_num = n() / 1000) %>% 
  plot_ly(
    x = ~start_date, y = ~event_num, type = "bar"
  )

layout(plot_1, title = "Crime events over years", xaxis = list(title = "Year"), yaxis = list(title = "Number of Crime Events (K)"))

Number of events each quarter

sub_crime_season = 
  raw_sub_crime %>% 
  select(start_date, start_time_hour, crime_event, law_cat) %>% 
  separate(start_date, into = c("year", "month", "day"), sep = "-") %>% 
  mutate(season = case_when(
    month %in% c("01","02","03") ~ "Q1",
    month %in% c("04","05","06") ~ "Q2",
    month %in% c("07","08","09") ~ "Q3",
    month %in% c("10","11","12") ~ "Q4"
  )) %>% 
  mutate(event_quarter = paste(year, season, sep = "-"))
  
plot_2 = 
  sub_crime_season %>% 
  group_by(event_quarter) %>% 
  summarise(event_num = n()) %>% 
  plot_ly(
    x = ~event_quarter, y = ~event_num, type = "scatter", mode = "points"
  )

layout(plot_2, title = "Crime events over quarter", xaxis = list(title = "Quarter"), yaxis = list(title = "Number of Crime Events"), width = 800)

Number of events each month

sub_crime_month = 
  raw_sub_crime %>% 
  select(start_date, start_time_hour, crime_event, law_cat) %>% 
  separate(start_date, into = c("year", "month", "day"), sep = "-") %>% 
  mutate(month = ifelse(month == "01", "January", ifelse(month == "02", "February", ifelse(month == "03", "March", ifelse(month == "04", "April", ifelse(month == "05", "May", ifelse(month == "06", "June", ifelse(month == "07", "July", ifelse(month == "08", "August", ifelse(month == "09", "September", ifelse(month == "10", "October", ifelse(month == "11", "November", "December")))))))))))) %>% 
  mutate(month = ordered(month, level = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")))
  
plot_3 = 
  sub_crime_month %>% 
  group_by(month, law_cat) %>% 
  summarise(event_num = n() / 1000) %>% 
  plot_ly(
    x = ~month, y = ~event_num, type = "scatter", mode = "points", color = ~law_cat, colors = "viridis"
  )


plot_4 = 
  sub_crime_month %>% 
  group_by(month) %>% 
  summarise(event_num = n() / 1000) %>% 
  plot_ly(
    x = ~month, y = ~event_num, type = "scatter", mode = "points", color="TotalNumber", colors = "#2171B5"
  )

subplot = subplot(plot_4, plot_3)

layout(subplot, title = "Event number over month", xaxis = list(title = "Month"), yaxis = list(title = "Number of crime events (K)"), width = 1000)

##Application-Crime Map

On the shiny dashboard, two tabs are designed(click here). The first tab shows crime data, which includes a cluster map and bar charts by the service and cluster. In this tab, the users can control the crime type and time period to check the plots in different scenarios. The second tab shows the crime rate mainly, which contains a zipcode map and bar charts by the service and cluster. Users can choose the time period detailed to days and type of the crime’s degree.

Subway Passenger

Subway passenger EDA with location

Passenger flow is closely related with crime. The more passenger flow in a station, the more criminals there will be. Therefore, we conduct EDA to:

Find relationship between location and passenger flow
Determine the most appropriate location variable for the model

Total passengers in each station

The color of each circle is the line of the subway and the size is the total number of passengers in 2021.

df = 
  passenger_df %>% 
    drop_na(entry_diff_imputed, exit_diff_imputed) %>% 
    group_by(station, service, linename, sublocality, postal_code, lat, long) %>% 
    summarise(total_entry = sum(entry_diff_imputed),
              total_exit = sum(exit_diff_imputed)) %>% 
    mutate(passenger_flow = total_entry + total_exit,
           # set passenger_flow to int
           passenger_flow = as.integer(passenger_flow))

# df %>% 
#   leaflet() %>% 
#   addTiles() %>% 
#   addCircleMarkers(~long, ~lat,radius= df$passenger_flow/100000000, weight= 0.9)


qpal <- colorQuantile("YlOrRd", df$passenger_flow, n = 4)

pal <- 
   colorFactor(palette = c("blue", "azure4", "orange",'green','green','brown','yellow','red','forestgreen','purple'), 
               levels = c('8 Avenue(ACE)',
                          'Shuttle(S)',
                          '6 Avenue(BDFM)',
                          'Brooklyn-Queens Crosstown(G)',
                          'Brooklyn-Queens(G)',
                          '14 St-Canarsie(L)',
                          'Broadway(NQRW)',
                          '7 Avenue(123)',
                          'Lexington Av(456)',
                          'Flushing(7)'))


df %>% 
  mutate(service = ifelse(service == 'Brooklyn-Queens Crosstown(G)', 'Brooklyn-Queens(G)', service)) %>% 
  mutate(passenger_flow2 = 10*log(passenger_flow)) %>% 
  leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addCircles(lng = ~long, lat = ~lat, weight = 1, stroke = FALSE,
    radius = ~sqrt(passenger_flow)/20, popup = ~station, color = ~pal(service), opacity = 0.75, fillOpacity = 0.75) %>%
  addLegend("topright", pal = pal, values = ~service, 
            title = "Subway Service", opacity = 0.75) %>% 
  setView(-73.8399986, 40.746739, zoom = 11)

There are patterns between station locations and total passenger flow. Big stations are mostly located in lower and middle Manhattan, and there are some sub center stations in other areas, such as 9th Street station in Brooklyn and FLUSHING-MAIN station in Queen.

By sublocality

df %>% 
  ungroup %>% 
  filter('sublocality' != 'None') %>% 
  drop_na() %>% 
  group_by(sublocality) %>%  
  summarise(passenger_flow = sum(passenger_flow)) %>% 
  mutate(sublocality = as.factor(sublocality)) %>% 
  arrange(-passenger_flow) %>% 
  filter(passenger_flow < 500000 |passenger_flow > 60000000) %>% 
  knitr::kable()

sublocality	passenger_flow
Manhattan	890710230
Brooklyn	288469501
Queens	174401767
Bronx	98407980
Staten Island	297340

Manhattan has the most subway passengers in 2021 and Staten Island has the least subway passenger_flow. Additionally, sublocality only has 5 levels, which is too few for a machine learning model.

EDA with zipcode

Total passengers in each zipcode

# cache zip boundaries that are download via tigris package
options(tigris_use_cache = TRUE)


# get zip boundaries that start with 282
char_zips = zctas(cb = TRUE)
char_zips = 
  char_zips %>% 
  rename(postal_code = GEOID10)

summary_df<-
  df %>%
  mutate(postal_code) %>% 
  group_by(postal_code) %>%
  summarise(passenger_flow = sum(passenger_flow),
            station_cnt = n_distinct(station, linename)) 


summary_df<-geo_join(char_zips, 
                      summary_df, 
                      by_sp = "postal_code", 
                      by_df = "postal_code",
                      how = "left") %>% 
  filter(passenger_flow>=0)

pal <- colorNumeric(
  palette = "Greens",
  domain = summary_df$passenger_flow,
  na.color = "white")

labels <- 
  paste0(
    "Zip Code: ",
    summary_df$postal_code, "<br/>",
    "Flow of Passengers: ",
    summary_df$passenger_flow) %>%
  lapply(htmltools::HTML)

# summary_df2 = 
#   char_zips %>% 
#     select(postal_code) %>% 
#     left_join(summary_df, by = 'postal_code') 

summary_df %>%  
  mutate(postal_code_int = as.integer(postal_code)) %>% 
  filter(postal_code_int >= 10000 & postal_code_int < 14900) %>% 
  leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>% 
   addPolygons(fillColor = ~pal(passenger_flow),
              weight = 2,
              opacity = 1,
              color = "white",
              dashArray = "3",
              fillOpacity = 0.7,
              highlight = highlightOptions(weight = 2,
                                           color = "#666",
                                           dashArray = "",
                                           fillOpacity = 0.7,
                                           bringToFront = TRUE),
              label = labels) %>% 
  addLegend(pal = pal, 
            values = ~passenger_flow, 
            opacity = 0.7, 
            title = htmltools::HTML("Total Passengers 2021"),
            position = "bottomright") %>% 
  setView(-73.8399986, 40.746739, zoom = 10)

Total subway stations in each zipcode

labels <- 
  paste0(
    "Zip Code: ",
    summary_df$postal_code, "<br/>",
    "Stations count: ",
    summary_df$station_cnt) %>%
  lapply(htmltools::HTML)


pal <- colorNumeric(
  palette = "Purples",
  domain = summary_df$station_cnt,
  na.color = "white")

summary_df %>%  
  mutate(postal_code_int = as.integer(postal_code)) %>% 
  filter(postal_code_int >= 10000 & postal_code_int < 14900) %>% 
  leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>% 
   addPolygons(fillColor = ~pal(station_cnt),
              weight = 2,
              opacity = 1,
              color = "white",
              dashArray = "3",
              fillOpacity = 0.7,
              highlight = highlightOptions(weight = 2,
                                           color = "#666",
                                           dashArray = "",
                                           fillOpacity = 0.7,
                                           bringToFront = TRUE),
              label = labels) %>% 
  addLegend(pal = pal, 
            values = ~station_cnt, 
            opacity = 0.7, 
            title = htmltools::HTML("Total Stations 2021"),
            position = "bottomright") %>% 
  setView(-73.8399986, 40.746739, zoom = 10)

The zipcode does not demonstrate the exact relationship between location and passenger flow. For instance, some zipcodes such as 10002 and 10011 in lower Manhattan should have more passengers, however few stations are built there. Therefore, the key cause to this confusion is that subway stations are not built based on zipcode.

Kmeans analysis of station

We set the number of clusters to be 8 and use Kmeans to cluster latitudes and longitudes. The color of each circle is the Kmeans cluster they belong and the size is the total number of passengers in 2021.

# conduct kmeans
df_sub = 
  df %>% 
  ungroup() %>% 
  select(long, lat) %>% 
  drop_na()

k2 = kmeans(df_sub, centers = 8, nstart = 25)

# EDA with Kmeans results
df$cluster = k2$cluster

df = 
  df %>%
  mutate(cluster = case_when(
    cluster == 1 ~ 'Queen',
    cluster == 2 ~ 'Upper Manhattan',
    cluster == 3 ~ 'Queen-Brooklyn',
    cluster == 4 ~ 'Middle Manhattan',
    cluster == 5 ~ 'Bronx',
    cluster == 6 ~ 'Brooklyn',
    cluster == 7 ~ 'Lower Manhattan',
    cluster == 8 ~ 'Rockaway Beach',
  ))

pal = colorFactor(
  brewer.pal(n = 10, name = "Set1"),
  df$cluster,
  levels = NULL,
  ordered = FALSE,
  na.color = "#808080",
  alpha = FALSE,
  reverse = FALSE
)

df %>% 
  leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addCircles(lng = ~long, lat = ~lat, weight = 1, stroke = FALSE,
    radius = ~sqrt(passenger_flow)/20, popup = ~station, color = ~pal(cluster), opacity = 1, fillOpacity = 1) %>%
  addLegend("topright", pal = pal, values = ~cluster, 
            title = "Kmeans Cluster", opacity = 1) %>% 
  setView(-73.8399986, 40.746739, zoom = 11)

Kmeans algorithm cluster Manhattan into three parts: lower Manhattan, middle Manhattan and upper Manhattan. Brooklyn and Queens shares 3 cluster. Also, there is a cluster for Bronx. We think the Kmeans result is easier to interpret than that of zipcode or sublocality and it can partly represent the relationship between passenger flow and location. Therefore, we use Kmeans result as the location variable in our model.

Application-Passenger Info Search

On the shiny app, we support two kind of info search(click here). The first is searching based on the specific line,time and date. Users can get the information of passenger flow during that period of time in the line they are interested in. The second is searching based on the specific line and date range. Users can get the daily passenger flow and top 10 busiest station during that in the line they choose. The detail of this app can be found on the app main page

Model

By exploratory data analysis, we found the crime was related to the victim, time and space. While producing the most precisely individual level analysis has high computation, it is relative fast and accurate for group/class analysis. Therefore, we intend to do a group analysis of crime and these independent variables. The data can be considered as a graph naturally, which represents the relations (crime) between entities (times and victims at different certain spaces). In addition, because the data graph was permutation invariant, we used graph neural networks (GNN) to solve this link prediction task.

Methodology

Graph neural networks can extract features and make predictions about entities and relations with more information. The reader is redirected to [1] for more details. An end-to-end trainable graph auto-encoder (GAE) has shown a significantly improvement in graph-structured data for link prediction on undirected graphs [2,3]. In what follows, we implement and evaluate graph auto-encoder on our dataset.

Experiments

Data Processing

Data sources have been mentioned earlier. In terms of time, we considered variables of date and time for events. Age, race and sex were selected as variables of victims. In the degree of space, we used service of subway lines and cluster of neighborhoods mentioned before. Considering the computing cost, we grouped subway crime data from 2006 to 2021, as shown in below. In order to transform our data into GNN acceptable, we set different (date, time) pair as item nodes and different (age, race, sex, service, cluster) vector as user nodes, and link prediction task appeared between user and item nodes. Finally, there were 1612 user nodes, 2196 item nodes and 28126 edges between them.

date	366 days of the year
time	6 time intervals with a length of 4 hours
age	5 age groups by raw data
race	6 race groups by raw data
sex	male and female
service	8 service groups obtained by previous section
cluster	8 cluster groups obtained by previous section

Model selection

The data set were divided by training set, validation set (with negative sampling) and testing set (with negative sampling) with ratio 0.6, 0.2 and 0.2, respectively. After adjusting the super parameters to get better results in the validation set, we utilized two layers graph convolution neural networks and 0.5 dropout between them as encoder, which could add noise between layers to enhance the robustness of the model. The inner product was considered as the decoder. The learning rate was selected as 0.006 by validation set. After choosing binary cross entropy loss as loss function, the model has been basically established.

Results

The AUC was utilized to evaluate this model. As shown in plot, although there were a little bit overfitting at the end epoch, the AUC of GAE in validation set was 0.8539. It indicated the probability that the predicted positive case is ahead of the negative case is 0.8539. Finally, in the testing set the AUC was 0.8543, which showed GAE was a classifier with good effect (>0.85).

GNN to Application

In crime prediction task, we though highly of recall more than precision, because FN is more serious than FP and we wouldn’t take that risk. Therefore, we took threshold as 0.45 and predicted as positive (crime occurring) if outcome was greater than it.

Model Reproduction

See README.txt file in github model folder.

Model Application

We build the No crime Navigation APP based on Google Maps Api and GNN model.

Input parameters

In the left panel, users can select their infomation and typed in their current location and destination. This include:

Who are you: gender, age and race
When you leave: date and time
Where to go: your location and destination

Routes

When user input their infomation click on the submit button several candidates routes will be displayed in the right table. The table shows several infomation of the route including:

time:time to get to the destination from start location
walking distance: walking distance in this route
crime score:the likelihood of being the victim of crime events
crowdness score:crowdness on the route
line[stops]:brief introduction to this route, take how many stops in each line

Interactive route map

Users can click on each row in the routes dataframe to show the detail of this route in the map. Users can select multiple lines.

Summary

Results

Event and Location

We explored which stations are the dangerous stations that have the most crimes( or highest crime rate) during these years, check the crime distribution by cluster, type of offenses. Furthermore, we get some information that which age groups, gender, or races are more possible to be assaulted using bar charts.

Event and time

As for the part regarding the relationship between crime events and occurrence time including year, quarter, month, day of the week, and time points, we explore the association between crime and events, and we show the specific relationship between the outcome of interest - crime events and variable time, for instance, which year may be the most frequent for crime events, by this, we may try to further investigate the reason for variation among years; or which time point on with a day of the week will be the most dangerous time for people hanging out, also by this, it may guide citizens to try to avoid these time points for outdoor activities.

Subway Passenger

Subway passenger flow and subway station are not evenly distributed in NYC. Most stations and passengers are in the middle and lower Manhatten. Therefore, we use K-means to better demonstrate the distribution of the subway stations. We also build an App for users to look up details of the subway info.

Limitations

Data limitations

The crime data is merged by two datasets, one from 2006-2020 and the other from 2021 to 2022. Although they have similar columns and both contain all the variables we are interested in, the definition of crime and collection of data still have differences.

The records of some original data are not quite reasonable, for instance, the proceeding time of some violation or misdemeanor cases last for more than several years, it is not clear whether that case really takes such a long period or not, so that leads to extremely unreasonable outliers. Second, there would be some censoring data in the original data set for a specific year, for example, only several days in January and February of 2020 were recorded in the data set, so it may become hard to explore the true situation regarding crime events in this period of time.

Original passenger data only contains cumulative entries and exits. When taking differences from entries and exits, the diff contains negative and unreasonably large values. We imputed the erroneous. Besides that, the day subway line of some station are different from night subway line of these stations. However, due to the data limit, we cannot split the passenger data for day and night subway line and have to analyze based on day subway line.

Another problem is that station names in the crime data and original data cannot match. We matched them based on an external data source (Google Map Api). We use Api to get the exact coordinates of the station and category crime data to a station. A mismatch can happen in this case. For some crime locations, whose distance to the closest station is less than 0.0001 is removed from the crime dataset. Furthermore, some of the crime data have no information about latitude or longitude, and some else have the wrong location which is in Canada.

Finally, When there are several different lines in one station, we have to divide the total passenger by all the lines because we don’t know the passenger of each line, which will cause some error.

Model limitations

Considering computation cost, group analysis is relatively fast and accurate. If computing power can be strengthened, the model will be stronger. In addition, due to a number of missing values, the results can be better when a ‘healthier’ data are given. The decoder we utilized is the default, inner product, and it can be specially designed for this task.

Acknowledgement

We would like to thank Zhuohui Liang, who gives us suggestions about this project. In addition, we want to thank team ‘Police Violence and Protest’ last year. Their interactive map gives us the idea of building a interactive crime map. Moreover, we would like to thank Rebekah Hughes in this team for her answering our question about Shiny Dashboard Navbar. Finally, we want to thank the Google Map Api team for their open-source code and free to use api, the No crime Navigation App will not be possible without their contributions.

Reference

[1] Daigavane, et al., “Understanding Convolutions on Graphs”, Distill, 2021.

[2] Thomas N. Kipf and Max Welling. 2016. Variational Graph Auto-Encoders. NIPS Bayesian Deep Learning Workshop (2016).

[3] Berg, Rianne van den, Thomas N. Kipf, and Max Welling. “Graph convolutional matrix completion.” arXiv preprint arXiv:1706.02263 (2017).

Subway Crime Analysis

Introduction

Data

Data Introduction

Subway Crime

Subway Passenger

Data Cleanning

Subway Crime

the Least Distance

Subway Passenger

K-Means

Imputation

Google Map Api to find station coordinates

Add service column

Correct subway line

Outliers of entries and exits

Exploratory Data Analysis

Subway Crime

Crime by Location

Heat Map of Subway Crime in NYC, 2006-2021

Map of Subway Crime in NYC, 2006-2021

Distribution of crime in 7 Clusters

Top 10 offense classification

Top 20 stations where crime happens frequently

Barchart by each Borough about Victims

Gender Distribution for popular crime types

Female Age Distribution for Sex Crimes

Crime Rate Top 20

Crime by Time

Number of events each year

Number of events each quarter

Number of events each month

Subway Passenger

Subway passenger EDA with location

Total passengers in each station

By sublocality

EDA with zipcode

Total passengers in each zipcode

Total subway stations in each zipcode

Kmeans analysis of station

Application-Passenger Info Search

Model

Methodology

Experiments

Data Processing

Model selection

Results

GNN to Application

Model Reproduction

Model Application

Input parameters

Routes

Interactive route map

Summary

Results

Event and Location

Event and time

Subway Passenger

Limitations

Data limitations

Model limitations

Acknowledgement

Reference