Chicago Bikeshare Trip Projections
BACKGROUND
In 2013, the City of Chicago launched Divvy, a new bike share system designed to connect people to transit, and to make short trips across town easy. While the system is experiencing growth in usage, it also faces a big problem: the unbalanced bike flows result in empty or full stations frequently, making the bikeshare unreliable to use. This largely contributes to the loss of potential customers.
Currently, Divvy staffs drive vans to reallocate bikes based on the real-time availability, which usually lag behind the need due to time of driving across the city, especially during rush hours. Furthermore, research has shown that rebalancing is the single largest operating expense for bikeshare, which is up to 1,944 dollars per bike annually.
To improve the effectiveness of rebalancing operations and minimize operations cost, the operators need to know where and when the imbalance is supposed to happen to plan for rebalancing in advance. There are aleady some successful precedents adopting this approach. For example, the bikeshare program in Moscow has deployed a predictive model to predict start and end trips at bike stations and use the predictions as input for rebalancing decision making, which was proved that has achieved increase in bikeshare trips by 50%.
This project is aimed at developing a valid and transferrable model that projects bikeshare departure trips in the next hour. It could be served as an initial demo for later use to develop a functional application based on mobile terminals like cell phones and ipads. While this project only shows how to predict departures, similar mechanism is applicable to predicting arrivals. Combining predictions for departures and arrivals with each other, imbalance across space and availability of bikes at each station in the next hour could be predicted, which could be fed into the process of decision making for rebalancing.
METHODOLOGY
This project developed a linear regression model for prediction. In the process of doing so, we first thrown in a bunch of candidate predictors into the models and selected out the final predictors based on their predictive power indicated by statistics. To test if the model is generalizable across time, we trained the model using bikeshare trip data in one week, and used the resulting model to predict bike trips in the following week. If the model’s goodness of fit is consistent/comparable for the training week and the test week, that means it is generalizable across time. We also checked whether the model’s predictive power is consistent across space by examine whether there is spatial autocorrelation in the errors of predictions.
Data Wrangling
The data we used in this prediction include bike trip data from Divvy, station area characteristics data from ACS and Chicago Open Data. These data are processed in both R and ArcGIS. The model was built on 147 stations near downtown Chicago for the second week of June in 2017, because there are greater rebalancing demand due to the prominent commuting patterns in warm days, as shown in the two graphics below.
Therefore, we summarized the departures by station, hour and date for a week and added 0 for hours that do not have any bike departures to make a full dataset for all. And we did the same for the following week for test sets.
As for predictors, we included four types of varibles, namely 1) spatially-and-temporally lagged hourly depatures, 2) socio-economic and bike network characteristics of station areas, 3) distance to and average ridership of nearest transit stations for each bike station, and 4) time of departures.
For the first type, we assumed that demands are usually concentrated in certain time period and location. Therefore, we included four lag variables: departures in the previous hour, departures a week ago during the same hour, nearest neighbor’s previous hourly departures, and nearest neighbor’s hourly departures a week ago during the same hour. To demonstrate the correlation between the original hourly depatures and the spatial and time lags, we can randomly select one bike station and plot the historical departures for itself and its nearest neighbor at a randomly selected day.
The last three types of independent variables are all theoretically related to bike trip demands. These variables are all processed from ArcGIS spatial join and near table, and then summarized and joined in R. As shared bikes have become a new mode that feeds the last-and-first mile travel of public transportation, bike stations located near rail stations generally have higher hourly departures too. As a way to complement transit, bike trips are not supposed to be only correlated with the distance to public transit, but also the ridership of its nearby station.
At last, these variables, the time of day and day of week can also influence the ridership of bike stations. As shown in the fifth and sixth plot, the travel patterns during weekdays and weekends are different, and there are definitely more departures during peak hours.
At last, these variables, the time of day and day of week can also influence the ridership of bike stations. As shown in the fifth and sixth plot, the travel patterns during weekdays and weekends are different, and there are definitely more departures during peak hours.
Model Building
After getting out dataset ready, we can now start to build our OLS regression. First to mention, our dependent varibles are not exactly normal-distribution with or without log transformation. Considering the fact that we have more than 20 thousands entries, this is acceptable. For the sake of interpretation, we chose hourly departure (without log transformation) as dependent variables. For replication, you can choose functions other than OLS.
Before we start, we can take a look at the summary statistics and correlation table of all the potential continuous variables in our model.
To select out the most significant and influential predictors and get a refined model, we can use stepwise regression methods, through which the computer will throw all candidate predictors into calculation and gradually throw out those not significant and not contributing while calculating over and over again. Here we use bidirectional elimination which test variables for both include and exclude for each step. A summary of three kinds of stepwise model are provided below.
According to table, three regression has almost same R-squared values. The coefficients in Forward selection are not all significant, while Backward and Stepwised Regression has the same results with most of the varible significant. Therefore, we decided to build our final model based on the results of last two regressions.
Based on the coefficient table percent of no vehicle and number of rail stations near bike stations are not that significant. Therefore, we run an anova test to see where inputing these variables increse the explanatory power of our model. The statistic tests suggested that only number of stations significantly improve the explanatory power of our model.
So we add number of stations to the model and compare the generalizability of original stepwise model. We pulled 75% of our data to fit a model, projected for the rest 25% and calculated the standard deviation of absolute error (|predicted - observed|). By running several times and comparing the result of with and without number of stations in the model, we found that the one that's without number of stations has generally lower absolute error, suggesting higher generalizability. Since our explanatory power is already high, we chose the one with better generalizability. Our final model is shown as below.
As shown in the results, the adjusted R-squared is 0.702, which means our model can explan 70% of the variations in hourly departures at each station across time. To make sure that our model is not influenced by multicollinearity, we checked the vif. No predictors are greater than 5, indicating little to no multicollinearity.
We can also plot the predicted hourly departures as a function of observed hourly departures to visualize the relationship between predicted and observed values.
The plot shows that the predicted results almost line up with observed results. The only weakness is that our model underpredicts stations within extremely high departures. Considering the fact of low occurance in these value, it is not that problematic. To make sure that our model is not influenced by multicollinearity, we checked the vif. No predictors are greater than 5, indicating little to no multicollinearity.
But what are the most influential predictors? We can generate a bar chart of absolute statndardized coefficients of independent variables to visualize the importance of them. As shown in the graphic below, among all the 12 significant predictors in our model, historical departure is the most important predictors, followed by departure time, neighboring station departures, Rail stations ridership and employment density.
The bar charts show that both the R-squareds and mean absolute errors are concentrated, showing little variation. This suggests that the predictive power of our model is consistent across different observations.
So how well does our model predict for different time periods of day? We can plot the prediction errors by hour to check that!
We can see there is also no significant temporal autocorrelation, since errors are randomly distributed across different time periods of day.
We also need to check if the model is doing well predicting for both weekdays and weekends. First, we need to create two data frame containing data for one weekday and for one weekend in the testset respectively.
After checking temperal autocorrelation, let’s look at whether there is spatial autocorrelation, in other words, whether our model is generalizable enough across bike stations at different places.
First, We randomly select one hour in a day in the following week and conduct moran’s I test for the prediction errors. We run it for several times and the p-value of Moran’s I is generally larger than 0.05, suggesting there is no significant spatial autocorrelation in the errors. We can also visualize the one of the randomly selected prediction errors on the map. As shown in the map, residuals are randomly distributed, as indicated by the previous statistical test.
Discussion
In conclusion, the model successfully explains most of variation lies in departures of an hour, with consistent satisfying performance through different time of day, day of week, and across different locations.This means by feeding historical data of bikeshare trips into the model, we can predict the demand of bikes at each station in the following hour, which could provide very useful information for bikeshare rebalancing.
However, there is still room for improvement about this model. To start with, as weather being a major influential factor for bike trips, in the future we should include real-time weather data as another predictor.
Furthermore, since the demand for bike trips is not a typical linear relationship with those predictors theoretically–demand could exponentiate when the independent variables reach certain threshold, we should consider using generalized linear regression models like logistic and poisson regression, or using non-linear regression model in the future.
Lastly, the current model is not targeted to those stations suffer from extreme bike imbalance across time, as a previous plot shows that our model tends to underpredict stations with extremely high departures. However, these stations are the ones that need more attention during rebalancing operations. Future improvements could also focus on targeting these stations that are suffering from continous and extreme bike imbalance.