Caltrain On-Time Performance

Data visualization of on-time performance, delays, and more

Overall On-Time Performance

Daily On-Time Statistics

Commute Delays

Delay Minutes

Methodology

1. Data Collection

The system fetches real-time train location data from the GTFS-RT (General Transit Feed Specification - Real-time) API using the API key for the Caltrain operator. The data collected includes each train’s ID, its latitude and longitude, stop ID, and the timestamp of when the location was recorded. Collected data from to with data points.

2. Storing Data

The data is stored in an SQLite database (caltrain_lat_long.db). A table called train_locations is used to store the information if it doesn't already exist. Each record is uniquely identified by the combination of timestamp, trip ID, and stop ID to prevent duplicate entries.

3. Data Processing

The stored data is processed by joining it with additional static data from the GTFS (schedules and stops). The system calculates the distance between a train's current position and the scheduled stop using the Haversine formula to detect when a train has arrived at a stop.

4. On-Time Performance Analysis

The on-time performance is calculated by comparing the actual arrival time (when the train is closest to the stop) with the scheduled arrival time. The delay for each trip is computed in minutes, and a trip is flagged as "delayed" if it arrives more than 4 minutes late.

5. Performance Metrics

6. Visualizations

Plotly is used to generate graphs for various metrics such as daily on-time performance, delay severity, and delays during morning/evening commute times. These graphs are saved as HTML files and displayed in a website using iframes.