Justin Soll (jsoll1@uchicago.edu)
William Zhu (wzhu4@uchicago.edu)
Introduction
Divvy is a public bike sharing service in Chicago operated by Lyft. Launched in 2013, Divvy bike share currently has 681 bike stations and about 337 thousand monthly riderships (as in April 2021). Riders can process bike rental in two ways: casual riders can rent out divvy bikes from stations using a one-time check-out system installed in the station. Members can rent out divvy bikes by scanning the QR code on each bike from their Lyft app on their phone. Figure I1 shows the current pricing model (as in June 2021).
Figure I1: Divvy bike share service pricing (as in June 2021)
Divvy offers riders two types of bikes: the blue classic bike has no special features. It must be checked out from and returned to divvy bike stations. In contrast, the black e-bike has a pedal-assist motor to boost the riding speed. Every black e-bike is also equipped with an internal lock, so that riders can lock them at any bike rack or signpost after a trip. These locked e-bikes may be either unlocked by another membership rider, or picked up by divvy service cars. In April 2021, e-bikes serviced about 29% of the total trips.
Figure I2: Network graph of Chicago Divvy bike stations and trips (April 2021)
Figure I2 shows the network graph of Divvy bike trips. Each node represents a bike station, and the width of every undirected edge between two nodes represents the number of trips that took place from one station to the other. We can clearly see that bike stations vary widely by usage volume. In this project, the usage volume of a divvy bike station is defined as the sum between the number of station-to-station divvy bike trips in April 2021 that started at the station and those that ended at the station. Stations in downtown and the northside of Chicago are more popular than those in other areas. Though stations are well distributed at the westside and southside of Chicago, Hyde Park is the only neighborhood in the southside of Chicago with high Divvy bike usage volume in April 2021.
Figure I3: Distribution of divvy bike stations by usage volume (April 2021)
Figure I3 shows the distribution of divvy bike stations by usage volume, which ranges quite widely. Some stations were used only once in April 2021, while the most popular station was used 7109 times. What factors affect the variations in usage volume among divvy bike stations? To answer this question, this project collected a wide range of data sources, including images, tabular data, and text reviews. Here the three primary data sources:
- Divvy historical trip data (April 2021) is the primary data that links all other data sources of this project. The dataset is stored in tabular format, which is publicly available on Divvy bike share’s official website. Each row stores information on every divvy bike trip, including start and end station names, coordinates, and time stamps.
- Google Street Map images of 681 divvy bike stations were collected from the Google map API using the coordinates from the Divvy trip data. Most of these images were taken from 2015 to 2018. We categorized these images into three groups based on each station’s usage volume in April 2021.
- We also web-scrapped 322 Yelp ratings and text reviews of the Divvy bike share service since 2013 to enrich our understanding of users’ feedback on the bike share service.
Besides these three data sources, we also collected data from the City of Chicago Data Portal, US Census (2010), and Zillow to investigate the associations between the popularity of bike stations and various factors.
The following blog post is separated into three parts.
- In part one, we employed both inference (multivariate OLS regression) and prediction (deep learning models) methodologies to analyze Divvy historic trip data (April 2021).
- Inference: By compiling a station level dataset based on trip level data and seven other external datasets, we found interesting associations between station usage volume and factors including location, crime, socio-economic status, and demography.
- Prediction: Using three variations of deep learning models (baseline, wide and deep, deep and cross), we achieved limited progress in predicting the usage volume class of the end station based on start station and trip information. The deep and cross model consistently performs better than the other two models for this particular prediction task.
- In part two, we analyzed google street view image data of bike station locations to attempt to discover factors separating rarely used stations from frequently used stations.
- In part three, we employed various computational content analysis techniques to investigate users’ opinion toward the divvy bike share service. We found that the bike share service is appreciated for its convenience in helping riders explore the city of Chicago. Meanwhile, some users were not satisfied with the payment process and occasional difficulties in checking in and out of the bike stations.
Part one: Investigating station usage volume using Divvy historic trip data
Overview of Divvy historic trip data (April 2021)
Figure P1-a: Distribution of station-to-station divvy bike trip by date (April 2021)
Figure P1-b: Distribution of station-to-station divvy bike trip by hours of the day (April 2021)
The April 2021 Divvy historic trip data records 337,230 trips in total, among which 298,207 trips were from station to station. Figure P1-a shows the distribution of Divvy bike trips by date. We can see that the number of bike trips spiked during weekends (highlighted in red boxes) and decreased on rainy days (highlighted in blue boxes). Figure P1-b shows the breakdown of bike trips by the hours of the day. It is clear that Divvy bike usage peaked during the afternoon hours (3pm to 7pm). In April 2021, 681 Divvy bike stations were used (to check in or check out a bike) at least once. Figure P1-c shows the top 10 most used Divvy Bike stations in April. As expected, all of them are located in the downtown area and close to the lakeshore.
Figure P1-c: Top 10 divvy bike stations with the highest usage volume (April 2021)
Section one: Inference
What factors caused some divvy bike stations to be used heavily (7k+ times per month) and other stations to be hardly used at all? To answer this question, we compiled a station-level dataset from several data sources to explore the impact of six groups of factors on station usage volume in Chicago in April 2021. In the following section, we will first introduce six groups of hypotheses, and then describe how relevant independent variables are collected. Finally, we will test these hypotheses using multivariate OLS regression models.
Hypotheses Group 1: location and purpose
It is likely that bike stations in certain locations mainly serve users with a particular purpose. For example, bike stations located near tourist attractions are mainly used for sightseeing. Stations along a long bike trail are used for exercising. Stations in residential areas are used for commuting. These locations and user objectives may impact the volume of station usage. Tourist sites are likely to attract more bike traffic than residential areas or long bike trails. Here are our hypotheses:
- H1a: Stations with a high proportion of riders who pay one-time fees, which signal proximity to tourist sites, are positively associated with usage volume.
- H1b: Stations with high average trip distances, which signal proximity to bike trails, are associated with low usage volume.
- H1c: Stations with a high proportion of morning or weekday usage, which signal proximity to residential areas, are negatively associated with usage volume.
Table P1-1: variables for Group 1 hypotheses
data source: Divvy system data (https://www.divvybikes.com/system-data) (April 2021)
variables | meaning |
total_count | usage volume (number of trips that used a Divvy bike station as the start station + end station) |
casual_p | percentage of trips (that used the station) that are paid via one-time check-out system |
average_distance | the average length (the Euclidean distance between the start station coordinates and end station coordinates) of trips that used the station |
weekday_p | percentage of trips (that used the station) that took place during the weekdays |
morning_p | percentage of trips (that used the station) with a start time recorded in the morning (6am to noon) |
evening_p | percentage of trips (that used the station) with a start time recorded in the evening (9pm to 5:59am) |
Table P1-1 lists the variables we collected to test these hypotheses. In the station-level dataset compiled from Divvy historic trip data, every row represents a unique divvy bike station that was used at least once in station-to-station trips in April 2021. ‘Total_count’ is the outcome variable, which measures a station’s usage volume.
Predictor variables include payment process (‘casual_p’), distance (‘average_distance’) and usage time period (‘weekday_p’, ‘morning_p’, ‘evening_p’), all of which were collected from the Divvy historic trip data (April 2021). Figure P1-1 shows the distribution of stations by these variables.
Figure P1-1: Histograms of variables for Group 1 hypotheses (y=number of stations)
Group 2: crime rates
Figure P1-2a: Top 10 crime categories by number of cases in Chicago (2020)
It is likely that bike stations located in places with high crime rates were used less frequently than those in low crime rate areas. We collected the Chicago Crime record data in 2020 from the City of Chicago data portal. Figure P1-2a shows the top 10 crime categories by the number of cases in Chicago (2020). Notice that except for ‘Deceptive practice’, the other nine categories all involve physical violence that mostly took place in public streets. In contrast, deceptive practices, which include identity theft and financial fraud, represent white collar crimes that are most likely to take place in downtown office buildings. Therefore, we separately measure the number of these two types of crime in 2020 that took place near the bike station. We hypothesize that:
- H2a: Bike stations with high numbers of physical crime nearby are negatively associated with the volume of usage.
- H2b: bike stations with high numbers of white collar crime nearby are positively associated with the volume of usage.
Table P1-2 lists the two variables we collected to test the hypothesis. Figure P1-2 shows the distribution of stations by the two types of crime.
Table P1-2: Variables for Group 2 hypotheses
Chicago Data Portal (Crimes – 2020) (https://data.cityofchicago.org/Public-Safety/Crimes-2020/qzdf-xmn8)
variables | meaning |
num_phys_crime | the number of crime cases (excluding ‘deceptive practice’) within 0.004 degrees (444m) in longitude and latitude of the station (in April 2020) |
num_wc_crime | the number of ‘deceptive practice’ crime cases within 0.004 degrees (444m) in longitude and latitude of the station (in the 2020 calendar year) |
Figure P1-2: Histograms of variables for Group 2 hypotheses (y=number of stations)
Group 3: Local supply and demand
We hypothesize that the local divvy bike station density and population density function as the supply and demand of each bike station. Having a large supply of bike stations means that each bike station is used less frequently. Having a large demand for divvy bikes means that each station is used more often. Here are our hypotheses:
- H3a: High bike station density is negatively associated with the usage volume of each station.
- H3b: High population density is positively associated with the usage volume of each station.
Table P1-3 shows the two variables we collected. “num_bike_stations” is measured by counting the number of other divvy bike stations within 0.008 degrees (888m) in the longitude and latitude of the station. “population_density” measures the population density of the zipcode where the station is located. It is accessed via the “uszipcode” python package, which sourced data from the US Census 2010. Figure P1-3 shows the distribution of bike stations by the two variables.
Table P1-3: Variables for Group 3 hypotheses
variables | meaning | data source |
num_bike_stations | the number of other divvy bike stations within 0.008 degrees (888m) in longitude and latitude of the station (April 2021) | Divvy system data
(https://www.divvybikes.com/system-data) |
population_density | the population density of the zipcode where the station is located (2010 Census data) | uszipcode python package, (https://pypi.org/project/uszipcode) |
Figure P1-3: Histograms of variables for Group 3 hypotheses (y=number of stations)
Group 4: alternative public transportations
Chicago has 144 CTA rail stations (as of December 2018) and 10847 bus stops (as of November 2020). These alternative public transportation facilities may have complex associations with bike stations’ usage volume. After controlling for population density, we suspect two conflicting effects: (1) substitution effect. It is likely that bus, rail, and public bike shares serve similar purposes and compete with each other for passengers/riders. Therefore, having many bus stops or rail stations nearby reduces the popularity of a bike station. (2) Complement effect. It is likely that public bike shares are used for shorter distance travel than bus or rail systems. Therefore, passengers may use bike share service often before or after commuting by bus or rail. For this reason, having many bus stops nearby or close to rail stations may increase the popularity of the bike stations. We hypothesize that the complement effect plays a strong role:
- H4: A large number of nearby bus stops or close proximity to a rail station is associated with a high usage volume of the divvy bike station.
Table P1-4 lists two relevant variables. “min_dis_rail_station” measures the euclidean distance between the bike station and the nearest CTA rail station. “num_bus_stop” counts the number of bus stops within 0.002 degrees (200m) in the longitude and latitude of the bike station. Both data were collected from the City of Chicago Data Portal. Figure P1-4 shows the distribution of stations by these two variables.
variables | meaning | data source |
min_dis_rail_station | the Euclidean distance between the bike station and the nearest CTA rail station (data updated on December 31 2018) | Chicago Data Portal (CTA – ‘L’ (Rail) Stations) (https://data.cityofchicago.org/Transportation/CTA-L-Rail-Stations-kml/4qtv-9w43) |
num_bus_stop | the number of bus stops within 0.002 degrees (222m) in longitude and latitude of the station (data updated on November 9th 2020) | Chicago Data Portal (CTA – Bus Stops) (https://data.cityofchicago.org/Transportation/CTA-Bus-Stops-kml/84eu-buny) |
Figure P1-4: Histograms of variables for Group 4 hypotheses (y=number of stations)
Group 5: socio-economic status
We suspect that the divvy bike share service is mainly used by the middle class. It is because the residents in disadvantaged neighborhoods may not feel comfortable paying for public bike rides. Meanwhile, the wealthy upper-class population may prefer their own personal bikes or vehicles rather than public bikes. Therefore, we hypothesize an inverse “U” shaped curve for the relationship between socio-economic status and usage volume:
- H5: As the average home value of a neighborhood increases, the usage volume of bike stations in the neighborhood first increases, then decreases.
Table P1-5 shows the variable, “average_home_value”, which measures the average home value of the zip code where bike stations are located. The variable is collected from Zillow Housing data recorded on April 30th, 2021. Figure P1-5 shows the distribution of stations by home value.
Table P1-5: Variables for Group 5 hypotheses
variables | meaning | data source |
average_home_value | the average home value (SFR, Condo/Co-op) of the zipcode where the station is located (April 30th, 2021) | Zillow Housing data (home value index) (https://www.zillow.com/research/data) |
Figure P1-5: Histograms of variables for Group 5 hypotheses (y=number of stations)
Group 6: demographics
Finally, it is likely that the preference of using the public bike share system differs by race and age group. We hypothesize that:
- H6a: The racial composition of the region affects the usage volume of the bike station located in the region.
- H6b: A larger proportion of young people in the region is positively associated with the usage volume of the bike stations located in the region.
Table P1-6 shows the demographic variables collected from the City of Chicago Data Portal (recorded in 2019). Figure P1-6 shows the distribution of bike stations by these regional demographics variables.
Table P1-6: Variables for Group 6 hypotheses
Data source: Chicago Data Portal (Chicago Population Counts) (2019)
(https://data.cityofchicago.org/Health-Human-Services/Chicago-Population-Counts/85cm-7uqa)
variables | meaning |
black_p | the percentage of population that self-identified as black in the zipcode where the station is located |
asian_p | the percentage of population that self-identified as asian in the zipcode where the station is located |
latinx_p | the percentage of population that self-identified as latinx in the zipcode where the station is located |
white_p | the percentage of population that self-identified as white in the zipcode where the station is located |
age18_29_p | the percentage of population with age between 18 and 29 in the zipcode where the station is located |
age30_39_p | the percentage of population with age between 30 and 39 in the zipcode where the station is located |
age40_49_p | the percentage of population with age between 40 and 49 in the zipcode where the station is located |
age50_59_p | the percentage of population with age between 50 and 59 in the zipcode where the station is located |
age65_p | the percentage of population with age greater than 65 in the zipcode where the station is located |
Figure P1-6: Histograms of variables for Group 6 hypotheses (y=number of stations)
Table P1-7 shows the summary table of all variables. Figure P1-7 shows the correlation heat map. Here are a few observations: (1) the number of bike stations nearby is highly correlated with the number of white collar crimes nearby. It suggests that places with high numbers in both categories are likely to be downtown Chicago. These two variables are positively correlated with the outcome variable (total_count). (2) the proportion of the white population is highly correlated with the proportion of young people (age 18-39) in an area. These variables are also positively correlated with the outcome variable.
Table P1-7: summary table of variables
count | mean | std | min | 25% | 50% | 75% | max | |
total_count | 681 | 875.79 | 1,083.94 | 1.00 | 62.00 | 431.00 | 1,335.00 | 7,109.00 |
casual_p | 681 | 0.48 | 0.22 | 0.00 | 0.32 | 0.40 | 0.58 | 1.00 |
average_distance | 681 | 0.02 | 0.01 | 0.00 | 0.02 | 0.02 | 0.03 | 0.07 |
weekday_p | 681 | 0.70 | 0.11 | 0.00 | 0.65 | 0.70 | 0.75 | 1.00 |
morning_p | 681 | 0.21 | 0.09 | 0.00 | 0.17 | 0.21 | 0.25 | 1.00 |
evening_p | 681 | 0.12 | 0.09 | 0.00 | 0.07 | 0.10 | 0.14 | 1.00 |
num_phys_crime | 681 | 17.58 | 13.47 | 0.00 | 8.00 | 15.00 | 24.00 | 75.00 |
num_wc_crime | 681 | 35.21 | 35.06 | 0.00 | 15.00 | 26.00 | 41.00 | 223.00 |
num_bike_stations | 681 | 7.34 | 7.33 | 0.00 | 2.00 | 5.00 | 10.00 | 35.00 |
population_density | 681 | 16,949.41 | 7,997.76 | 1,259.00 | 10,459.00 | 15,920.00 | 21,570.00 | 35,505.00 |
min_dis_rail_station | 681 | 0.01 | 0.02 | 0.00 | 0.00 | 0.01 | 0.01 | 0.11 |
num_bus_stop | 681 | 5.10 | 3.45 | 0.00 | 2.00 | 5.00 | 7.00 | 16.00 |
average_home_value | 662 | 368,576.08 | 150,504.76 | 86,191.00 | 219,731.00 | 379,250.00 | 503,115.00 | 662,782.00 |
black_p | 664 | 0.30 | 0.33 | 0.01 | 0.05 | 0.15 | 0.56 | 0.95 |
asian_p | 664 | 0.10 | 0.10 | 0.00 | 0.03 | 0.07 | 0.14 | 0.39 |
latinx_p | 664 | 0.15 | 0.16 | 0.01 | 0.06 | 0.08 | 0.18 | 0.83 |
white_p | 664 | 0.42 | 0.27 | 0.01 | 0.15 | 0.46 | 0.64 | 0.82 |
age18_29_p | 664 | 0.25 | 0.08 | 0.12 | 0.18 | 0.24 | 0.30 | 0.47 |
age30_39_p | 664 | 0.20 | 0.08 | 0.11 | 0.14 | 0.20 | 0.24 | 0.46 |
age40_49_p | 664 | 0.12 | 0.02 | 0.07 | 0.11 | 0.12 | 0.13 | 0.16 |
age50_59_p | 664 | 0.11 | 0.03 | 0.06 | 0.08 | 0.11 | 0.12 | 0.17 |
age65_p | 664 | 0.11 | 0.04 | 0.01 | 0.08 | 0.11 | 0.14 | 0.21 |
Figure P1-7: Variable Correlation heat map
Results
To ensure interpretability, we employ multivariate OLS regression models to test our hypotheses. Because we notice that the distributions of most variables are highly skewed, we decide to apply log transformation on all variables for consistent interpretation, including the outcome variable. Table P1-8 shows the results of the two regression models. The model (1) contains all variables discussed in the six hypotheses. The model (2) represents the final model that only includes variables with statistically significant associations (p<0.1) to the outcome. Both models have adjusted R-squared values of about 0.75, which means that the predictor variables in the models explained about 75% of the variations in divvy bike station volume usage.
Table P1-8: OLS regression results
Group 1:
The result from model (1) provides no support for hypothesis H1a. It provides support for H1b, and suggests an opposite effect for H1c.
The coefficient for log(casual_p) is not statistically significant (p>0.1). There are three possible explanations: (a) high one-time payment percentage does not suggest that the station is located at tourist attractions, (b) locating at tourist sites does not lead to high volume of bike usage, (c) because the data is recorded during the covid-19 pandemic (April 2021), tourist sites did not attract as much visitors as pre-pandemic times.
The coefficient for log(average_distance) is -0.24 and statistically significant (p<0.01). It means that, controlling for other factors, a 1 percent increase in the average distance of trips that use the station (in the divvy bike station distribution) is associated with a 0.24% reduction in station usage volume.
Controlling for other factors, a 1 percent increase in weekday usage (in the divvy bike station distribution) is associated with a 0.35% increase in station usage volume, and a 1 percent increase in morning usage (in the divvy bike station distribution) is associated with a 0.17% increase in station usage volume (p<0.01). These results suggest an opposite effect from what we hypothesized in H1c. A possible explanation is that stations mainly used for commute or daily errands are associated with a higher volume of usage.
Furthermore, a one percent increase in the evening (9pm to 6am) usage (in the divvy bike station distribution) is linked with a 0.08% increase in station usage volume (p<0.01). It may be because stations that are used often in the evening are located in areas with good lighting, most likely in geographically important locations.
Group 2
The results in model (1) provide strong support for H2a and H2b. Controlling for other factors, a one percent increase in physical crime cases nearby (in the divvy bike station distribution) is associated with a 0.14% reduction in station volume. A one percent increase in white collar crime cases nearby is associated with a 0.15% increase in station volume (p<0.01).
Group 3
Results in model (1) provide support for H3b and suggest an opposite effect to H3a. Controlling for other factors, a one percent increase in population density of the zipcode in which the station is located is linked to a 0.36% increase in bike station volume (p<0.01). Meanwhile, a one percent increase in the number of other divvy bike stations nearby is also associated with a 0.04% increase in bike station volume (p<0.1). Network effect may be an explanation for the positive link between the number of other divvy bike stations nearby and bike station volume: having more bike stations nearby makes it more convenient for riders to use divvy bikes for short distance trips. The network effect also explains H1b, where stations with high average trips, which signals that they are far away from other stations (absence of network effect), are associated with low usage volume.
Group 4:
Results in the model (1) do not support H4a. Controlling for other factors, including population density, having rail stations or bus stops nearby are not linked to the popularity of bike stations (p>0.1). It may be because the complement and substitutes effects cancel out each other. Further investigation is needed to understand how and where these interactions took place.
Group 5:
Results in model (1) show an opposite effect from H5. In H5, we expect an inverse U shape curve, which suggests that the stations in the middle class zip code areas are used more often than stations in wealthy or disadvantaged zip code areas. To our astonishment, we find that the result suggests a U shape curve instead: as the average home value in a zip code area increases, the usage volume of bike stations first decreases, then increases (p<0.05) (see Figure P1-8). So far, we are unable to come up with a convincing explanation.
Figure P1-8: results on home price
Group 6:
Results in model (1) support both H6a and H6b. The preference of using public bike share varies by racial composition of the area. Controlling for other factors, a one percent increase in the proportion of asian population in the zip code area where the bike station is located (in the divvy bike station distribution) is associated with a 0.1% increase in bike station usage volume (p<0.05). A one percent increase in the proportion of latinx population is associated with a 0.17% decrease in usage volume (p<0.01). A one percent increase in the proportion of white population is associated with a 0.35% increase in usage volume (p<0.01). Changes in the proportion of black population are not associated with bike station usage volume (p>0.1).
Having a greater proportion of young people in an area also increases public bike share usage. Controlling for other factors, a one percent increase in the proportion of population between 18 and 29 in the zip code area where the bike station is located (in the divvy bike station distribution) is associated with a 0.88% increase in bike station usage volume (p<0.01). A one percent increase in the proportion of population between 30 and 39 is linked to a 1.64% increase in bike station usage volume (p<0.01).
Table P1-9: Summary of the hypothesis testing results
variable (in log scale) | hypothesized association with station usage volume | whether supported by regression results | |
H1a | proportion of casual riders (one time payment) | positive | No |
H1b | average trip distance | negative | Yes |
H1c | proportion of weekday or morning trips | negative | Opposite (positive) |
H2a | number of physical crime cases nearby | negative | Yes |
H2b | number of white collar crime cases nearby | positive | Yes |
H3a | number of divvy bike stations nearby | negative | Opposite (positive) |
H3b | population density nearby | positive | Yes |
H4 | distance to the nearest rail station, number of bus stops nearby | positive | No |
H5 | average home value nearby | inverse U shaped curve | Opposite (U shaped curve) |
H6a | racial compositions of population nearby | exist association | Yes (positive for asian and white, negative for latinx, no association for black) |
H6b | proportion of young people nearby | positive | Yes |
Section two: Prediction
Given the start station of a bike trip and trip information, how well can we predict whether the trip will end at a station with high usage volume or low usage volume? To answer this question, we employed deep learning models to train the April 2021 trip-level record from the Divvy system data. We grouped the end stations into eight classes based on usage volume, and found that the highest performing model (deep and cross model) can predict the end station class of trips at an accuracy of 39.3 percent.
Table P2-1: Distribution of divvy bike stations by usage volume (April 2021)
station usage volume range | number of stations | class assignment |
fewer than 1K | 458 | 0 |
1K-2K | 124 | 1 |
2K-3K | 60 | 2 |
3K-4K | 26 | 3 |
4K-5K | 7 | 4 |
5K-6K | 4 | 5 |
6K-7K | 1 | 6 |
7K+ | 1 | 7 |
The April 2021 Divvy trip record contains 298207 station-to-station trips. Table P2-1 shows the distribution of stations by usage volume. Based on this table, we grouped the stations into eight classes. Table P2-2 shows the distribution of trips by end station class, which is the outcome variable that we aim to predict. Class “1” contains the largest percentage of trips (29.8%). This represents the baseline prediction result that our models expect to exceed, because a model that always returns class “1” will achieve an accuracy of 29.8%. Table P2-3 and figure P2-1 show the cross table percentage and heatmap between the distributions of start station class and end station class for all trips. It suggests that trips that start at every station class are most likely to end at either the same station class, or classes 1,2, and 3.
Table P2-2: Distribution of trips (April 2021) by the class of end station
class | number of trips | proportion of all trips |
0 | 60224 | 20.2% |
1 | 88984 | 29.8% |
2 | 72769 | 24.4% |
3 | 43600 | 14.6% |
4 | 15412 | 5.2% |
5 | 10514 | 3.5% |
6 | 3250 | 1.1% |
7 | 3454 | 1.2% |
total | 298207 | 100% |
Table P2-3: Cross table between the classes of start and end stations for trips (April 2021)
class of end station | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
class of start station | ||||||||
0 | 50.1% | 27.4% | 13.2% | 6.1% | 1.8% | 0.9% | 0.2% | 0.3% |
1 | 18.4% | 38.6% | 22.5% | 12.7% | 4.0% | 2.5% | 0.5% | 0.7% |
2 | 11.1% | 27.2% | 33.9% | 16.3% | 5.7% | 3.9% | 1.0% | 1.0% |
3 | 8.7% | 26.4% | 26.8% | 25.1% | 6.5% | 4.4% | 1.1% | 1.0% |
4 | 7.1% | 22.6% | 27.4% | 18.3% | 15.6% | 5.9% | 1.9% | 1.3% |
5 | 5.5% | 21.4% | 26.8% | 19.3% | 7.7% | 16.0% | 1.5% | 1.8% |
6 | 2.8% | 12.9% | 20.4% | 15.5% | 12.0% | 7.4% | 23.2% | 5.8% |
7 | 3.4% | 15.7% | 21.2% | 13.2% | 7.6% | 5.9% | 6.5% | 26.3% |
Figure P2-1: Heat map between the classes of start and end stations for trips (April 2021)
For the prediction task, the outcome variable is the trip’s end station class. The predictor variables include 6 numerical variables (the latitude and longitude of the start station, the usage volume of the start station in April 2021, the date and hour of the start time, and the duration of the trip), and 3 categorical variables (e-bike vs classics, one-time payment vs membership, and the unique station id of the start station). Table P2-4 summarizes the variables involved in the prediction task.
Table P2-4: Variables in the prediction task
variable | type |
the class of a trip’s end station | categorical (outcome) |
start station latitude | numeric |
start station longitude | numeric |
usage volume of start station (April 2021) | numeric |
date of trip start time | numeric |
hour of trip start time | numeric |
duration of the trip (minutes) | numeric |
bike type (classics vs ebike) | categorical |
start station unique id | categorical |
payment type (member vs one time) | categorical |
Table P2-5: Hyperparameters
learning rate | 0.001 |
dropout rate | 0.1 |
batch size | 265 |
number of epochs | 10 to 50 |
hidden units | 32, 32 |
We employed three deep learning prediction models: the baseline model, wide and deep model, and deep and cross model from Keras (https://keras.io/examples/structured_data/wide_deep_cross_networks/). Table P2-5 lists the hyperparameters used in all three deep learning models. We experimented with the “number of epochs” parameters ranging from 10 to 50. For all three models, we trained them using 70% of the trip data and evaluated their performances on 30% of test sets. Figure P2-2 shows the classification result on the test sets. Overall, deep and cross models perform slightly better than the other two model types. It achieves the highest test accuracy of 39.3%, which is about 10% higher than our minimum expectation. Figure P2-3, P2-4, and P2-5 show the layers of the three model types. We are hopeful that by fine tuning other hyperparameters, we can improve the prediction accuracy in the future.
Figure P2-2: prediction results of three deep learning models
Figure P2-3: Layers of the baseline model
Figure P2-4: Layers of the wide and deep model
Figure P2-5: Layers of the deep and cross model
Part two: predicting station usage volume using Google street view images
The next segment of our project was to work with Google Maps image data to predict station popularity. Through our tabular data we knew that some bike locations were significantly more popular than others. While there are many factors that influence the popularity of a given locale, we would have been remiss not to examine whether there were any geographical predictors that could be gathered from image data. For example, would more open spaces be more popular? This can be easily envisioned because they should be fairly easy to bike through. However, we could also imagine more congested spaces with significantly more traffic would be more popular locations in general for transit (because of base rates), so would also experience increased levels of popularity. In the best-case scenario, a machine learning model will be able to differentiate low popularity sites from high popularity sites and we will be able to identify specific features of the high popularity sites that possibly make them easier to traffic by bike.
As you might recall, the distribution of Divvy Bike stations is extremely right-skewed. This makes logical sense, as one would expect that there are significant differences in both the location and quality of bike stations which may impact people’s decisions to use them.
Figure G1: Distribution of divvy bike stations by usage volume (April 2021)
We decided to use a classifier to try to distinguish between sparsely used stations, moderately used stations, and frequently used stations. We decided to structure the cut-offs in this category in such a way that each of the categories had a roughly equal number of images. So, for the purposes of our project, sparsely used stations were those which had fewer than 120 uses in April, moderately used stations were those with between 120 and 900 uses during April, and frequently used stations were those with more than 900 uses during this same month. The most frequently used station was used 7109 times! There are 213 stations in this first category, 230 in the second, and 238 in the third. While the category boundaries are arbitrary (for the purposes of having a large enough sample size for training), this shouldn’t be an issue if neural networks are able to make meaningful distinctions between these categories (by having fairly high accuracy/low loss).
There were two main approaches we used. One is creating our own Convolutional Neural Network, in Keras. After doing this, we used powerful pretrained models (on ImageNet) in an attempt to make more accurate classifications.
The details of the model that we created from scratch can be seen in full in the annotated appendix. But for the purposes of this section, the relevant details are that we used Keras to implement a relatively basic Convolutional Neural Network. We utilized the early stopping technique to make sure that we didn’t cut it off until it truly had stopped learning. In layman’s terms, the network keeps track of its lowest loss on the validation set thus far, and when there are three losses higher than its lowest loss, which is a signal that further training is no longer improving, the model stops learning and reverts to its most accurate state. In this case, the model stopped after 44 epochs, which means that the best version of the model was at 41 epochs. This will not be consistently true, when run at different times the model should stop after slightly different epoch numbers.
Below we can see the graph of its loss on the validation set during each epoch. It doesn’t really begin learning until around epoch 18 or so, at which point it begins to make sudden improvements, until the stopping point. An interesting characteristic is the sudden jump in loss for the validation set at around epoch 37, which is likely due to chance. Another interesting result is while the model stopped due to a stop to the decline of the validation loss from epoch 41, near the end the training loss began to decrease again. I don’t think that this is a cause for concern, because I trained this model several times, and each ended at an epoch in the mid-forties.
Figure G2: Basic model loss
Below is the graph of this model’s accuracy over time. It consistently has similar accuracy between the training set and the validation set, which is a good sign that we’re observing a tangible differentiation. The accuracy follows a similar pattern to the loss, in that it stays level until beginning to make improvements at epoch 20. However, it levels off at about epoch 30, which is a little earlier than the loss does. It’s likely that had we been optimizing for accuracy instead of loss, the model would have finished running much sooner.
Figure G3: basic model accuracy
The best validation loss of this pretrained model was .9768, and the validation accuracy for this was .4747. If we had optimized for maximizing accuracy our accuracy would have been .5091. The loss of the model when predicting the test set was 1.015, and the accuracy was .4833.
This performs significantly better than mere chance (which should have an accuracy at around .33). So it stands to reason that our model is able to find some features of the data to use to make its classification decisions. We extracted the weights on its second conv2D layers in an attempt to catch a glimpse of the prototypical images for each of the three categories we had.
Sparsely Used Image:
Figure G4: Sparsely used image
Moderately Used Image:
Figure G5: Moderately used image
Frequently Used Image:
Figure G6: Frequently used image
Unfortunately, these representations all look extremely similar. There are two possible hypotheses we have for this. One is that our network might just not be very good, and perhaps if there was a network that was able to make these distinctions with an accuracy at around .7 or .8 there would be more differentiation. Our other hypothesis is that generally google image data for streets is very similar: there are plenty of buildings. It could be that while there are subtle differences that the network is able to pick up on, it’s difficult to make those differences discernible to the human eye.
Nevertheless, it would be remiss of us to not at least attempt to make inferences from these images. This may not be accurate, since the naked eye isn’t amazing at differentiating very similar images, but it seems as if each image is progressively slightly brighter, with the sparsely used image as the darkest and the frequently used image as the brightest. One theory we had for this is that it could do with sunlight access. More open spaces would both have more access to sunlight, but also be more easily reachable by bike. On the other hand, extremely closed in spaces which would be difficult and less than ideal to navigate by bicycle would tend to be darker – even during the day. Sadly, with the information we’ve gathered thus far it’s impossible to make stronger hypotheses.
The second phase of our work with image models was to implement models which were pre-trained on ImageNet on our data, also known as fine-tuning the network. These are much more powerful models, which have had the benefit of access to more computational resources than we did. They are extremely likely to give us much more accurate predictions than our own network. Our initial model was the Resnet, and our implementation of it can be observed in full in the annotated appendix. We didn’t use early stopping because we didn’t have the time to work out how to implement it with a pre-trained network. Thus, we chose to use 40 epochs in order to have a fair comparison with the model that we trained ourselves.
This process resulted in a model with .547 accuracy on the validation set and 1.077 loss. This is discernibly better than the model we trained ourselves. Unfortunately, we were unable to plot the learning of this model over time, or extract its weights for visualization. However, it makes sense that this model performed better than ours. Besides being fine tuned, it also used a decaying learning rate, which is a better way of performing gradient descent.
Here’s an example of an image and the prediction it made. Group three here corresponds to the high-frequency group.
Figure G7: predicted image 1
Our second pretrained model was implementing this same Resnet model, but freezing the entire network except for the last layer, so that we’re using it as a fixed feature extractor. We set this model to train for 40 epochs as well, so that we have a fair point of comparison. The best validation loss and accuracy of this model are 1.0035 and .5766, respectively. This is a significantly stronger performance than our original model, and a slightly stronger performance than our previous attempt at fine tuning. However, it’s unclear whether this is a significant improvement over the other fine tuning model, or if it just happens to perform slightly better under this randomized trial. With more time, we’d have liked to examine the features the model found for the three categories to see if the differentiations it makes are more discernible to the human eye than those of our model.
Here’s an example of an image, and the prediction it made. This image was also predicted to be in group three.
Figure G8: predicted image 2
Part Three: Exploring Divvy users’ opinions using Yelp reviews
To extend from predicting the usage volume of stations, we are interested in learning about riders’ thoughts and feedback on the Divvy bike share service. To do so, we web-scrapped Divvy bike’s reviews (n=322) on Yelp since 2013. Figure Y1 shows the number of Divvy Yelp reviews by year. We can see that the number of reviews peaked in 2014, and gradually declined afterward. Figure Y2 shows the distribution of reviews by ratings. Most of the reviews are either 1 star or 5 stars. The extreme rating distribution is likely caused by selection bias: users who were motivated to leave a review on Yelp either really enjoyed or felt dissatisfied by the service.
Figure Y1: Numbers of Divvy bike share service review on Yelp by year (2013- May 2021)
Figure Y2: Rating distribution of Divvy bike sharing service on Yelp
Figure Y3 and Y4 applied dimension reduction techniques (TSNE and UMAP) to visualize the text content of reviews. Both figures show that the positions of reviews differ significantly by ratings. In Figure Y3, high rating reviews are concentrated at the bottom half of the figure, while low rating reviews are concentrated at the top. In Figure Y4, high rating reviews are concentrated at the left side of the figure, and low rating reviews are concentrated at the right.
Figure Y3: TSNE Projection of Yelp reviews by ratings
Figure Y4: UMAP Projection of Yelp reviews by ratings
Since one and five star reviews differ significantly in terms of textual content. What topic do they specifically cover? Figure Y5 and Y6 zoomed in on the keywords of five star reviews. Figure Y5 shows that some of the top 20 keywords for five star reviews (after filtering out common english stop words) include “membership”, “easy”, “city”, and “dock”. Figure Y6 mentioned keywords including “quick”, and “navigate”. From these keywords, we interpret that riders gave Divvy bike share service top ratings for its convenience, both in terms of the ease of checking in and out bikes at station docks, and for their usefulness in helping riders explore the city of Chicago.
Figure Y5: Top 20 most frequently used words (5 star reviews)
Figure Y6: Visualizing word embeddings (5 star reviews)
Figure Y7 and Y8 highlight the keywords of one star reviews. Figure Y7 lists the top 20 most frequently used words in one star reviews (after filtering out common english stop words), including “customer service”, “charged”, “pass”. Figure Y8 mentioned words including “secur”, “night”, “unlock”. It seems that most of the frustrations of the one star reviewers come from poor customer service, issues with price charges, and difficulty in unlocking bikes.
Figure Y7: Top 20 most frequently used words (1 star reviews)
Figure Y8: Visualizing word embeddings (1 star reviews)
Access all data and code here:
https://drive.google.com/drive/folders/13Pnae4vdX_XpYcN2u0cCtTQJt2rufmKF?usp=sharing