Tyler A. Green

In Transit

Category: Projects

Graphing Transit Systems, Part III – Centrality Extended

This is the third post diving into the graph structure of the New York City subway system. Read the first two for more background!

At the start of last post, I threw out two questions:

  1. Does the network structure of the New York City subway indicate Times Square is a critical station, or is that just where the most riders board?
  2. Can all stations in a transit network be important?

We discussed the difference between centrality metrics and node importance metrics. The former identify important nodes in a network, while the latter ranks nodes by importance. We’ll use the node importance metrics to answer these questions.

To support our discussion, I whipped up a map showing the MTA subway ridership data by itself using Carto. Here’s the interactive map! The data is from the years 2010 to 2015 and is provided by the MTA.

Does the network structure of the New York City subway indicate Times Square is a critical station, or is that just where the most riders board?

To answer this question, I calculated the correlation between ridership and centrality. In the scatter plots below, the independent variable is the centrality score per station, and the dependent variable is the ridership at that station, averaged over the years 2010 through 2015. This might seem backwards, but I chose this because the centrality metric is a reflection of the network structure and we are studying the effect of network structure on ridership.


The correlation coefficient for these two data sets show a moderate positive correlation.

  • Closeness centrality, r = 0.43
  • Outward accessibility, r = 0.30

Remember, correlation does not imply causation, but these figures suggest that for an increase in the centrality metric, you can expect a moderate increase in ridership.

Did you notice Times Square on the scatter plots? Yep, with an average annual ridership of almost 63 million, it’s the outlier. Based on its position on the horizontal axis, closeness centrality thinks Times Square is an important station in the network, while outward accessibility does not. If you remember from last post, PageRank also finds Times Square to be important and Katz just confused us all. That answers our first question!

Before we go on, I have a theory that any outlier in these plots are the result of externalities. For example, the average ridership at Yankee Stadium – 161 St is 8.7 million, but its neighboring stations have ridership of 1.3, 3, 3, and 4.3 million each on average. What is its externality? The world-famous New York Yankees. Times Square – 42 St is a similar situation. Not only is it a transfer point for 12 NYC subway services, it is also below the mega tourist attraction and its namesake, Times Square. I have no hard data on this outliers theory, but more research could be done on this!

Can all stations in a transit network be important?

Why would we want all stations to be “important”? If our goal is for all citizens to have equal access to quality to public transportation, we would like everyone to live near a station which provides this gateway. A transit network will always have stations which are more centrally located than others, but is it possible to minimize the differences between the most connected and the least connected stations? Let’s see how do our metrics evaluate the structure of another world-class network in this regard. Enter Paris, its minimal geographical constraints, and its lovely radial network.

The two histograms below sit on the same range on the horizontal axis. The count on the vertical axis is the number of stations which fall into the horizontal range represented by its bar. As you can see, Paris has many, many stations which score higher than all of New York City’s.


There is one large caveat here: land area. Officially, the area of New York City is 302.6 square miles, while Paris is only 40.7 square miles. Another metric is longest subway line: New York’s A train extends 31 miles, while Paris’ Line 13 is just over 15 miles. Closeness centrality uses shortest path between station pairs, which in my graph, are the number of seconds for a trip. A 31-mile subway trip will take longer than a 15-mile subway trip, so this metrics are stacked against New York City subway and the large area it covers.

Concerning our question, even though Paris’ stations score much higher than New York’s, they are not all equal. This gets back to my earlier point: there will be an importance continuum among stations, but improving the importance of the least connected stations can still provide a benefit to citizens.

Next, let’s look at this histograms for New York and Paris outward accessibility scores.

There is not as much difference between New York City and Paris in the histograms for outward accessibility. This metric is independent of network area or subway line length, so this does not surprise me. It may hint that more of the difference between the networks for closeness centrality may be due to geographical area.

If you look at the densest parts of the Paris network and see how interconnected it is, the upper bound for its accessibility distribution being higher than New York’s also will not be a surprise.

Next Stop

Now that we have evaluated the centrality of multiple transit networks and performed limited cross-network comparisons, I want to know whether these metrics can tell us the best future subway routes. For example, given the budget for a single new subway line, what is the best route for this new line? It will be a very empirical and barely human analysis, so we may have to take the results with a grain or six of salt, but hopefully the results will have value besides making shapes on maps.

See you then!

Graphing Transit Systems, Part II – Centrality

This post is the second of three four looking into the graph structure of the New York City subway system. In the previous post, I discussed a frontend I built to visualize a depth-first search, breadth-first search, and shortest path algorithm. I ended with a discussion of centrality algorithms. We pick up our hero there…

Centrality metrics identify important nodes in a graph. In the gtfs-graph world, nodes represent subway stations. Why might we want to identify important stations in the NYC subway network? Honestly, my initial reason was I thought it sounded cool. I was curious to see if there are numbers (besides ridership…we’ll get to that in the next post!) to rank stations which align with our human perception of important stations in the system. Meaning: does the network structure indicate Times Square is a critical station, or is that just where the most riders board? That was the first question I wanted to explore. The next question would challenge the Lake Wobegon effect. That is: can all stations in a network be important?

To answer these questions, I created a web app for three cities and their heavy rail networks:

Each city has results for four centrality metrics: PageRank, Katz centrality, closeness centrality, and outward accessibility. I will be discussing the results in terms of the New York City network.

It is worth noting at this point that analyzing a transit network only using stops and edges is a very simplified model. To make any real decisions on the system as it relates to the city and population it serves, we would need to consider population density and employment centers at minimum. Knowing that, let’s proceed!

PageRank

If PageRank sounds familiar to you, it’s likely because it is the algorithm used by book publishers to identify pages, and definitely not because it was invented by Google co-founders Larry Page and Sergey Brin to rank web pages for their search engine. In this algorithm, a node’s importance is derived from the importance of all the nodes which link to it. Mapped over to transit, a station’s importance is derived from the importance of all the stations which have direct connections to it.

The PageRank results look interesting and definitely pick out important stations, but they do not give us insight into the entire distribution of stations.

The PageRank results look interesting and definitely pick out important stations, but they do not give us insight into the entire distribution of stations.

I was giddy while implementing this and my brain swirled with grand visions of unlocking new insights to generations-old transit networks. As it turns out, PageRank is not a great model for a transit network. Let’s look at an example.

In the NYC PageRank view, you can see that Times Square comes out on top. Let’s collectively channel our inner undergrad physics lab student and breathe a sigh of relief that the numbers show us what we expected. Phewwwwwww. However, if we look at one of its neighbors, 34 St – 11 Av AKA the 7 train extension, we see that it ranks last. Not just maybe not top ten or top 100, but dead last. PageRank is saying that the 7 train extension produced a station that is literally the least important in the NYC network.

Have no fear Andrew Cuomo, let’s consider the model again. If you throw in sample numbers using the PageRank formula, you can see that the above behavior is correct. 34 St – 11 Av only has one “link” and that node’s PageRank is high, but it also has a high out-degree. Using the random surfer / random transit rider model, a rider passing through Times Square is not likely to end up at 34 St – 11 Av. Sorry 7 train, but PageRank is just does not do your $2.4 billion price tag justice. Let’s see how the other centrality metrics view the subway network!

Poor 34 St - 11 Av doesn't get any love from PageRank. The data on the right shows the top 10 stations serve several subway routes each. This is not a coincidence; PageRank picks out highly connected nodes.

Poor 34 St – 11 Av doesn’t get any love from PageRank. The data on the right shows the top 10 stations serve several subway routes each. This is not a coincidence; PageRank picks out highly connected nodes.

Katz Centrality

Katz Centrality builds on PageRank by considering all walks between two stops in a network, as opposed to only the shortest path between nodes. This appealed to me in a transit context because in a dense network such as Paris, there are often numerous routes between any two stops. This built-in redundancy has been brought up recently as a weakness of the DC metro during the on-going two-track vs. four-track debate and how it affects the maintenance window for a major heavy rail system.

Now is a good time to mention that I would highly recommend the Wikipedia entry for Katz centrality and all the metrics in this post. The original Katz paper is insightful as well.

The results from Katz are……confusing. If you picked South Ferry as the most important MTA station, you either love platform extenders or misguidedly added the Staten Island Ferry to your subway network. The Staten Island Railway data is included in the MTA subway GTFS feed, so I kept it on my map. Closeness centrality (up next!) requires all nodes to be reachable from every other node, so I threw a fake edge in to the graph to represent the ferry. Believe me: the results were just as confusing before I added the ferry route. Due to the multiplicative nature of Katz centrality, the resulting distribution ranges from 0.00244 (Ozone Park – Lefferts Blvd) to 693,246.863 (St George, just across from South Ferry on the south-bound ferry).

Here’s all the insight I can offer on Katz centrality: all traffic between two well-connected sections of the graph (Staten Island and the entire rest of the MTA subway) has to pass through two stations: South Ferry and St George. Therefore, they are “important” and “central” and I am “confused” and “ready to talk about other metrics.”

Katz says the subway network is equally unimpressive. Except for South Ferry. What a champ.

Katz says the subway network is equally unimpressive. Except for South Ferry. What a champ.

Closeness Centrality

My friend Calvin and I half made-up, half realized-it-was-already-a-thing, a centrality metric which promised a return to the fundamentals. Closeness centrality (or as Cal and I called it, the squiggly-doo) is intuitive in that the closer a node is to all other nodes, the more “central” it is. It does this by ranking a node by the sum of the shortest paths to all other nodes in the network. As you may remember from last post, the distance of each edge in our network is the number of seconds to travel via that route segment according to that system’s GTFS feed.

At this point of confusing results from two metrics, I discovered the term “node influence metrics.” These metrics seek to answer my second question from earlier: can all nodes in a network be important? PageRank and Katz identify important nodes, but only the top of their resulting distribution should be considered. This means the metric results for the bottom half of the distribution are more or less meaningless. Technically, closeness centrality is not a node influence metric, but I treat it as such. Intuition tells me that its results have meaning for the entire distribution of nodes. Please comment if you feel otherwise!

Neapolitan ice cream anyone? Closeness centrality results have no surprises.

Neapolitan ice cream anyone? Closeness centrality results have no surprises.

Manhattan stations are ranked highly by closeness centrality. This uniformity is in contrast to the Manhattan results for outward accessibility.

Manhattan stations are ranked highly by closeness centrality. This uniformity is in contrast to the Manhattan results for outward accessibility.

The closeness centrality results are extremely straightforward. Subway stations on Manhattan score higher because riders can reach all other stations in less time there than elsewhere. The opposite is true for Far Rockaway. This algorithm will play an important role in the next post!

Outward Accessibility

Outward accessibility is one of the primary node influence metrics. It produces a normalized version of diversity entropy proposed in this paper by Travençolo and Costa. A node ranks highly when many unique paths can be taken from it over a course of random walks of varying distances. Sections of a graph which rank highly by this metric are found to have high network redundancy and high accessibility from the rest of the network. Redundancy and accessibility are both critical when evaluating a transit network, so this seemed like a good fit!

One drawback to the outward accessibility metric is performance and repeatability. Before calculating the actual metric, one must perform a series of random walks of varying distances from each node. For these walks to be representative, the walk count must be high, which can lengthen execution time of the analysis. Due to the nature of random calculations, the answers change every time! This could be solved by using a consistent random number generator seed when running the analysis, or by always running enough random walks for the results to converge.

Outward accessibility gives us the weather map similar to closeness centrality, but are its individual stations ranked similarly?

Outward accessibility gives us the weather map appearance similar to closeness centrality, but are its individual stations ranked similarly?

Outward accessibility picks out hotspots of importance in a graph network. These can vary slightly due to the random nature of this algorithm, but should converge over time with enough random walks.

Outward accessibility picks out hot spots of importance in a graph network. These can vary slightly due to the random nature of this algorithm, but should converge over time with enough random walks.

The results for outward accessibility appear to parallel those of closeness centrality at first glance. However, a closer look at the accessibility results shows hot spots. The metric tells us these are the nodes which allow riders to traverse the most unique routes in a given distance. Translated to the real world, this is valuable to the rider’s perception of a transit network. If I can go to 20 different stations within 10 subway stops (on any route), my location is better served by public transit than if I can only go to 10 stations within 10 stops.

Accessibility also has a strange property of ranking end stations higher. The logic is that if I start from the second from the end station, half of my random walks will go outwards and produce little diversity entropy. Conversely, if I start from the end station, all of my walks will go towards the potentially more diverse part of the graph. I emailed the paper authors to comment on this behavior, but have not heard back. If you are reading this, Travençolo or Costa, please comment with insight!

Next Stop

If you’ve hung with me this long and have noticed I haven’t answered either question posed at the start of this post, I’m going to grant you a short break. In the next post, we’ll discuss how the closeness centrality and outward accessibility results correlate to the NYC subway ridership numbers, as well as how these metrics compare between NYC and Paris. I hope you’ll stay on board!

Graphing Transit Systems

I’ve been away from the blogging world for a while! The last few months included a fantastic and inspiring trip to Transportation Camp NYC and loads of (mostly) fun weekend work on transit graphs.

In a hodgepodge effort to improve on Javascript, learn React, create a generic graph representation of a GTFS feed, and implement a few graph algorithms, I finally have a working TRANSIT GRAPH DEMO.

Why transit graphs?

While reviewing algorithms on Jason Park’s algorithm visualizer, I thought, “WE CAN APPLY THESE TO TRANSIT.” It was a moment of pure destiny. To call it multidisciplinary intrigue would be underselling my excitement. Of course, I was not the first person to connect transit and graphs; Google Maps, Open Trip Planner, and Mapzen’s Valhalla are all built on graph representations.

My original goal was to display an animated graph traversal of the New York City subway system. I’ve ended up with a platform to study graph algorithms on transit maps. (I learned that if I’m unsure what I’m building, just call it a platform. The solutions will follow.)

As is the norm in 2016 JavaScript, I used almost as many tools and libraries as there are NYC subway stations. My goal in all projects is to use as little custom data as possible, so I stuck with my Boston model and loaded the MTA GTFS feed into an Amazon RDS Postgres instance. The backend is a Node.js server which boots up after constructing a graph. I used the new ES6 ‘class’ keyword to create a TransitGraph in the style of the object-oriented languages I was raised on. The original frontend was written using JQuery, but when I reached the point of implementing an autocomplete search box, I knew I needed to up my tool game. Enter: React. Facebook’s documentation on the library is quite comprehensive and I latched on to the object-oriented feel and state-based programming model. All the data (stops and routes/edges) is communicated via WebSocket that persists through an entire client connection.

As you can see when using the graph demo, there are three modes. A bit on each…

Shortest Path

Dijkstra’s is the classic gateway algorithm to finding shortest paths in graphs. Wikipedia’s explanation is as clear-worded as I’ve read, so I’ll defer to them:

It picks the unvisited vertex with the lowest distance, calculates the distance through it to each unvisited neighbor, and updates the neighbor’s distance if smaller.

Fire up the algorithm visualizer for to help picture this. In my graph, the edge weights are the time between stations. After running Dijkstra’s, we have an ordered sequence of nodes which represent the shortest path between the origin and destination and the time it would take to do so.

A sample shortest-path from 50th St to 1 Av. The routes are calculated from the GTFS feed based on the trips that pass through that stop. This can periodically result in slightly different route listings than the official MTA map.

A sample shortest path from 50th St to 1 Av. The routes which serve each station are derived from the GTFS feed based on the trips that pass through that stop. This can periodically result in slightly different route listings than the official MTA map.

The user interface to pick the origin and destination nodes. I studied Pinterest's CSS to help build the stop tokens that populate the input fields when selected. The route details at the bottom uses "display: flex;", a tip I picked up from the Google Maps CSS.

The user interface to pick the origin and destination nodes. I studied Pinterest’s CSS to help build the stop tokens that populate the input fields when selected. The route details at the bottom uses “display: flex”, a tip I picked up from the Google Maps CSS.

Sound familiar? Google Maps transit directions do the exact same time. And much better! Knowing when to switch trains becomes a luxury after using my tool.

Depth-First Search

A depth-first search, or DFS as the real algorithm geeks call it, is a classic traversal method for both graphs and trees. The idea of a traversal is to visit all the nodes in the graph which can be reached given a starting node. The depth-first variety is contrasted with the breadth-first procedure (up next!) in that, given a starting node, one of its neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, and so on. Was there anything weird about that last sentence? This is a recursive algorithm! When a visited node has no unvisited neighbors, the algorithm pops back up the call stack, testing for unvisited neighbors at each level.

A snapshot of visited nodes early in a depth-first search from Yankee Stadium. A red line segment is an edge that has been visited, but not unvisited, while a blue line segment has already been unvisited. As the recursive function pops higher up the call stack, more edges turn blue.

A snapshot of visited nodes early in a depth-first search from 161 St – Yankee Stadium. A red line segment is an edge that has been visited, but not unvisited, while a blue line segment has already been unvisited. As the recursive function pops higher up the call stack, more edges turn blue.

We can see that at the completion of a DFS from 161 St - Yankee Stadium, the entire MTA subway system has been visited. The nodes that have not been visited are the Staten Island railway, which has no rail connections to the subway system and therefore no edges in my graph.

We can see that at the completion of a DFS from 161 St – Yankee Stadium, the entire MTA subway system has been visited. The nodes that have not been visited are the Staten Island railway, which has no rail connections to the subway system and therefore no edges in my graph.

Breadth-First Search

A breadth-first search is another traversal variant whose lofty goal is to identify connected components of a graph while providing zero valuable info to passengers riding transit. (Now would be a good time to say that identifying connected components will play a key role in merging nodes during a later step in this project. Traversals are a necessary part of any graph analyzer’s toolkit!) As you may have guessed, a BFS goes wide before it goes deep. From a given node, all of its neighboring nodes are visited before any of their neighbors are visited. This produces a different exploration pattern, which is illustrated in the following three images.

A snapshot early in a breadth-first search from Queensboro Plaza. We see that the visited nodes are spreading outward from the source. Think: diseases. Depth-first search is how you solve a maze and breadth-first search is how you get sick.

A snapshot early in a breadth-first search from Queensboro Plaza. We see that the visited nodes are spreading outward from the source. Think: diseases. Depth-first search is how you solve a maze and breadth-first search is how you get sick.

A bit farther in the breadth-first search, we can see the disease...err...graph traversal has continued to spread outward.

A bit farther in the breadth-first search, we can see the disease…err…graph traversal has continued to spread outward.

The completion of the breadth-first search. There are no blue edges because this is not a recursive algorithm.

The visited edges at the completion of the breadth-first search. There are no blue edges because this is not a recursive algorithm.

What’s next?

“gtfs-graph” (the GitHub project name for now – please help me come up with a better one!) is built to be system-agnostic. I have graph representations for Boston and Paris in addition to New York City. While the GTFS standard allowed me to construct all three graphs in similar ways, there were still a few quirks, resulting mainly from how the different systems represent sub-stops (parent/child or northbound/southbound).

Recently, I have been implementing centrality algorithms to see how the results varied from system to system. Paris’ RATP heavy rail lines certainly look to have higher connectivity than Boston’s hub-centric design, and I’m working to find the numbers to prove this. If I can indeed prove this, I’d like to use a genetic algorithm to efficiently enhance (add lines and stops) a system to match the connected-ness/centrality distribution/equity/whatever-metric-I-end-up-with of a higher quality system.

After implementing Google’s PageRank algorithm, I decided it is a poor model for transit. The rankings currently displayed are a modified version of closeness centrality. I really enjoyed this white paper on a node importance algorithm and plan to implement this soon. It uses random walks to calculate the entropy of a given node after a given number of steps.

I hope to have a much more detailed most on these metrics in the coming weeks! I would love to hear any thoughts or ideas you might have about any or all of this!

Let’s build awesome things to help transit, cities, and, most of all, people.

LIVE: The Boston T Party

I’m a few months late on this one, but I recently wanted to learn about WebSockets and GTFS-realtime feeds. The result: a real-time Boston transit map! I apologize if you were expecting a historical reenactment.

Try clicking on a marker for more information on the subway/bus/light rail/commuter rail vehicle it represents!

The sidebar of the application appears when you click on a vehicle. The area could be populated with tons more info from the GTFS static feed!

The sidebar of the application appears when you click on a vehicle. This area could be populated with tons more info from the GTFS static feed!

How It’s Built

The app runs on a Node.js server that accepts both a socket connection and an API call. Why both? Ask two-months-ago me. A new socket connection is formed when a client (web browser) connects to the server. The server periodically polls for the latest GTFS-realtime update (which the MBTA posts updates to http://developer.mbta.com/lib/GTRTFS/Alerts/VehiclePositions.pb every ~18 seconds) and decodes the resulting protocol buffer using the Google gtfs-realtime-bindings. The decoded data is then broadcast to all socket connections. The frontend client is a simple AngularJS controller which manages the socket connection and updates the markers with the latest vehicle position information.

The basic architecture described until this point can operate completely independent from a GTFS static feed, but this would only produce a bunch of dots on a map which move periodically. Which, don’t get me wrong, made me ecstatic when that was all I had. But linking up a GTFS static feed gives each dot context. I decided to load the MBTA feed into a Postgres database on Amazon’s Relation Database Service using this schema. The GTFS static connection allows for two features: 1) the client issues an API call to fetch the route and headsign when you click on a vehicle, which is fulfilled by the server through a database query, and 2) the colored route lines are pre-generated into a GeoJSON file using a Node.js script which runs a database query to fetch the official MBTA color for each route.

The purple lines are the commuter rail routes. I chuckled the first time these lines loaded and I kept have to zoom out to see where they stop. To Providence and beyond!

The purple lines represent the commuter rail routes. I chuckled the first time these lines loaded and I kept having to zoom out to see where they stop. To Providence and beyond!

Up Next

The map doesn’t have nearly the feature set of NextBus, with gobs of detail about every bus and stop you click on. I do find it clumsy that you have to select routes to view in NextBus, leading me to make sure all the lines appear at load time in my map (or shortly after (#LargeGeoJSONFile)). Feel free to check out the code or even add features yourself; the code lives on GitHub!

A big maintenance issue with the app as constructed is that it requires a manual reload of the GTFS feed after each update by MBTA which changes any trip IDs. The Green Line trains do not have valid trip_ids in the GTFS-realtime feed, so I programmed the app to display any vehicle with an unknown trip_id (one that did not match with the GTFS static feed) as a Green Line trip. After a GTFS static update, you will often see many vehicle markers say they represent a Green Line train, when we really just need to load the new GTFS feed into the Postgres database. Who wants to automate this for me?

You may not need this map to plan your commute from Back Bay to South Station, but it was certainly a fantastic learning experience for me. Many of its components will make an appearance in my next project! (Hint: it involves representing transit networks as a connectivity graph!)

Until next time, ride on!

I never get tired of staring at these colored lines until the markers all jump to their next position! The yellow is the official color specified for the bus routes in the MBTA GTFS static feed. Anyone know the reason for this? It also look like the Silver Line goes a bit crazy right after exiting the Ted Williams Tunnel.

I never get tired of staring at these colored lines until the markers all jump to their next position! The yellow is the official color specified for the bus routes in the MBTA GTFS static feed. Anyone know the reason for this? It also looks like the Silver Line goes a bit crazy right after exiting the Ted Williams Tunnel. Correct me if I’m wrong, but I think this is where Silver Line buses switch from diesel power to trolleybuses?

“Next Stop…Transitland”: A TransportationCamp Colorado Presentation

I attended the inaugural TransportationCamp Colorado was last week! Sticking with the format of an “unconference,” attendees were encouraged to propose their own sessions to present their work and/or lead discussions. I took them up on the format and presented the following slides on Transitland and how I used it to create my New York City transit frequency visualization. We had a really great interactive session with many good ideas exchanged!

Some stories and explanations were left off the slides. Feel free to hit me up for more explanation on anything!

 

The slides can also be downloaded here.

Updated: New York City Transit Frequency Visualization

Since I detailed my New York City transit frequency visualization project last month, there have been a few updates. Check out the web tool to view the changes!

What’s new?

  • The frequency buckets have been realigned to better parallel the psychology of how we use transit. The bins now group trips of less than 4 trips per hour, 4 to 8 trips per hour, and more than 8 trips per hour. Less than 4 trips per hour is generally the threshold where riders should consult a schedule before waiting on a curb, so it was important to separate these visually. The thickness of each edge now also increases with frequency.
  • There is now much more coverage in Queens bus data. No, MTA did not see my first update and decide to expand Queens service, though that would be awesome! I communicated with the Transitland team and my tool helped them discover they were previously missing the feed for the MTA Bus Company. It was historically a separate company and still has its own GTFS feed. I came up with some wild conclusions in my previous post on this project, several of which were rendered invalid by the completion of the data set.

What’s up next?

I’d still like to filter the express bus routes, provide finer-grained sorting by mode, and increase the dynamic nature of the tool in general. I’ve been working on an updated Ruby client to pair with the Transitland datastore, and have already updated my project source with the new interface. I’ve also begun dabbling with GTFS-realtime and plan to build a project with this specification soon.

We’re all #InTransit everyday and I hope to have many more updates soon!

What kind of things are you working on? Let me know in the comments below!

The frequency data for subway routes on a Friday morning for New York City transit. The darker the color, the higher the frequency!

The frequency data for subway routes on a Friday morning in New York City. The darker the color, the higher the frequency!

A Ruby Gem for GTFS to GeoJSON Conversion

I published my first Ruby gem: gtfs-geojson! You can view the source on GitHub. gtfs-geojson is a Ruby utility to convert a GTFS feed to a GeoJSON file. It’s a simple endeavor, for sure, but I’m pleased with what I learned along the way.

Let’s start out with some before-and-after views of the data. These images were created using QGIS, OpenStreetMap, Transfort’s GTFS feed, and the gtfs-geojson library.

The Transfort GTFS data loaded in QGIS before applying the Ruby gem for GTFS to GeoJSON conversion.

This map displays the shapes.txt file from Transfort’s GTFS feed loaded into QGIS. The seemingly-inconsistent shading on the lines is because there are no lines at all; each “line” is made up of a sequence of points. Each point contains a route ID and is ordered relative to the other points in its route by a point sequence value.

The Transfort GTFS data loaded in QGIS after applying the Ruby gem for GTFS to GeoJSON conversion.

After running the GTFS feed through gtfs-geojson, you now have a GeoJSON file whose features are each route from the original feed. I used “Categorized” styles in QGIS to quickly apply a unique color to each route.

As with most transit projects, the input to gtfs-geojson is a GTFS feed. GTFS is the standard format published by transit agencies worldwide to make their routes, stops, and even fares usable by developers. The data is a series of comma-separated text files. To validate a GTFS feed, I used an existing gem. gtfs will fail gracefully if the shapes.txt file is not present, which is the only file I actually need for the conversion to GeoJSON.

gtfs-geojson implements the same algorithm as the “Points to path” QGIS tool I used when looking at Transfort bus data. The main trick is that the points within each route ID must be sorted by their point sequence value. Several other QGIS plugins I tried did not do this correctly, so don’t forget this if implementing this yourself!

While QGIS tools output shapefiles, gtfs-geojson produces a GeoJSON file, which is a JSON stream with geospatial points and polylines data served up in a standard format. I have previously loaded GeoJSON files in Mapbox applications, and they are also useful in a GIS context. The following three lines will load the library, validate the GTFS feed, convert its shapes.txt file to GeoJSON format, and write the GeoJSON to a file.

require 'gtfs-geojson'
geojson = GTFS::GeoJSON.generate("gtfs.zip")
File.open("gtfs.geojson",'w') do { |f| f.write(geojson) }

That’s it! Let me know if you have any suggestions! The README on the GitHub repo gives installation instructions.

The most valuable tip I learned while creating this gem was the use of the $RUBYLIB environment variable. This isn’t necessary when installing a gem onto your system using bundler, but it is extremely helpful during development. $RUBYLIB lets you specify the path searched when the require keyword is used. To add paths dynamically to $RUBYLIB, you can push items to the ‘$:‘ array. $: is shorthand for $LOAD_PATH within a Ruby program. My require_relative days are over!

If you are considering writing your own gem, I highly recommend RubyGems.org’s “Make Your Own Gem” guide. It is comprehensive and just generally fantastic.

I plan to use gtfs-geojson in a Rails project in the future. And speaking of gems, I’ve also been dabbling on a Ruby API client for Transitland. I hope to have more to share on both fronts soon!

Until then, ride on!

Have any transit projects to share? Let me know!

New York City Transit Depicted With (A New Set Of) Colorful Lines

Update 3/29/16: The transit visualization has been updated! The technical details in this post are still relevant, but some of the conclusions are no longer valid. Read about the updates here!

Stop the buses! Hold the phone! I now have visual proof that buses and subways in the Big Apple run more often on Fridays than Saturdays. How insightful, right? Okay, so maybe not, but I still enjoyed making a New York City transit frequency visualization using Transitland and Mapbox.

VIEW THE TOOL HERE. Try hovering over each route and turning on different days and modes (subway versus bus) of service.

Below are a few images showing the difference in frequency of transit service on Friday and Saturday, followed by a discussion of each component of the project.

Friday service in a New York City Transit Visualization

Friday morning subway and bus frequency. The coverage and frequencies are impressive!

Saturday service in a New York City Transit Visualization

Saturday morning bus and subway service. As to be expected, the coverage is similar to on Friday, but the frequencies drop significantly.

What can we learn from this frequency visualization of New York City transit?

Some items this visualization illustrates are to be expected:

  • Transit runs with higher frequencies during the week.
  • Transit runs with higher frequencies in denser areas (Manhattan, Brooklyn) than less dense areas (Staten Island).

A few things made sense after seeing them, but were ideas I had not anticipated:

  • Even in dense areas, bus frequencies are higher in areas that have less subway service, and vice versa. While this is true in Manhattan (more subways and subway frequency) and Brooklyn (more buses and bus frequency), it is quite noticeable in Queens. When you turn the subway layer off, western Queens appears almost devoid of transit. While its subway connections do not reach to the eastern edge of Queens, they do begin to make up for a lack of bus routes in western Queens. A few images below show this.
  • The inter-borough connections between Queens and Brooklyn that are notoriously absent in all heavy rail maps of the area are almost as weak even when viewing bus data. It just isn’t easy to travel between Long Island’s two boroughs. Maybe the planned streetcar will finally help this.

One thing to keep in mind: the trips per hour numbers that appear when you hover over lines on the map are not specific to a transit route. They encompass all transit services, potentially multiple routes and even modes, between the two stops that create an edge.

Queens bus routes in a New York City Transit Visualization

Bus routes in western Queens. Doesn’t this seem like it’s missing something?

Queens subway routes in a New York City Transit Visualization

Bus and subway routes in western Queens. That’s a bit better.

The Data

Transitland is an open source project that aggregates transit feeds from across the world. You can query its JSON API to create apps and visualizations easier than directly crunching the underlying GTFS data.

I was inspired to dig into Transitland by this similar frequency visualization for San Francisco. We both use the stops and schedule_stop_pairs API endpoints to calculate how often the “edge” between any two consecutive transit stops is visited in a given time frame.

I chose an appropriate bounding box to encompass all the transit stops operated by MTA and picked a window of 7:30am to 8:00am on the mornings of Friday, January 22, 2016, and Saturday, January 23, 2016. In addition to buses and subways, ferry service is also returned by Transitland in this bounding box, which explains the trips to Staten Island and oddly-direct routes to New Jersey.

The data returned by Transitland is not real-time data of actual transit performance, only the scheduled service times on those dates. I was able to extrapolate a “trips per hour” frequency metric by dividing the edge weight by the length of my query’s time frame.

The Map

I considered publishing a map using QGIS, but I was fortunate enough to stumble upon Mapbox. Mapbox does not have the analytical tools that QGIS does, but its ease of creating interactive web-based maps is impressive.

GeoJSON is a standard JSON variant that holds geographical information, such as points and line segments. In addition to its required fields, I loaded the GeoJSON output files with styling from Mapbox’s simplestyle-spec based on the frequency for that line segment. Mapbox interprets these “properties” fields when displays the data on a map.

A good tool should be simple enough to let you spend time solving real problems and I found Mapbox to reach this goal swimmingly (is there a similar term for transit??). The small amount of code needed to plot four GeoJSON files, toggle between them, show a map legend, and allow zooming and a loading screen all on top of a satisfactory OpenStreetsMap was remarkable. I will most definitely be using Mapbox for future transit projects!

The Code

As the JSON Transitland interface language-agnostic, any scripting language could be used. Ruby is by far my favorite, so I stuck with what I know. You can view the visualization in my GitHub repository.  The code is divided into an HTML front-end and Ruby back-end, though they do not connect directly. A few ideas I have for the future of this project:

  • The TransitlandAPIReader class could be generalized into a gem with a decent test suite, similar to one Transitland used to maintain and intends to bring back.
  • The run.rb script could take a job spec input to produce GeoJSON files for multiple days and cities in a single run.
  • The Mapbox front-end could be used to visualize any arbitrary transit system’s GTFS shape data. This would likely be done using a Rails back-end, rather than the offline Ruby script I am currently using.

Other News

I spent another few hours this week getting lost reading about the Cincinnati subway. If you haven’t dove into that tunnel of information before, I’d highly recommend it. Something about using an old canal which had become economically unfeasible due to competition from railroads to build a tunnel system that was halted due to a moratorium on capital bonds during World War I and never successfully revived just fascinates me. Seriously, any single part of that last sentence would make for a good story, but all those together create a sort transit tragedy worthy of a Shakespearean drama.

In the bed of the canal née Erie

doth thou venture to lay parallel rails.

To endure and inspire they began,

ere citizens above were admonished

their Sisyphean ambitions would fail.

I’m getting cold shivers just imagining a chorus reciting that at the opening of a transit conference. Please let me know of any other examples of transit stories told in iambic pentameter.

Until next time, ride on!

Baseball Transit Authority. We’ll wave you home.

I like baseball. I like stadiums. I like maps. I really like transit. The result: the Baseball Transit Authority!

What baseball fan hasn’t dreamed of single-seat ride between Seattle and Washington, D.C. to catch a Mariners/Nationals doubleheader? The Baseball Transit Authority’s new stadium subway is here to ferry you between ball parks faster than Rickey Henderson could slide into second.

You may notice there are only 25 stations for the 30 MLB teams, with multiple teams combined into single transit stops in Chicago, Los Angeles, New York, Washington, D.C./Baltimore, and San Francisco. Now may be a good time to say: the map is entirely fictional. You can rest assured no fans heading to an Orioles game will be dropped off halfway between Nationals Park and Camden Yards. Best of all, the Baseball Transit Authority system is entirely underground. An original revision did not contain the route between the Twins and Mariners, but fortunately for baseball-transit-fiction-land, Rob Manfred developed some creative financing schemes and was able to complete the entire subterranean Northwest Corridor on-schedule and under-budget.

Baseball Transit Authority Subway Map

The Baseball Transit Authority prides itself on performance. While “I don’t care if I never get back” is the chorus of its ridership, this iron rail is more reliable than Cal Ripken, Jr.

Joking about highly-unfeasible transit systems aside, I learned loads about the incredibly-useful vector graphics to produce this map! How can you not be a fan of an infinite-resolution image? Raster graphics (of which JPEG is one format) define images using an array of pixels, whereas vector graphics define images using an XML list of shapes, text, and colors. While you can write the code yourself and understand SVGs (scalar vector graphics) 110% better as a result, fortunately this is not necessary. Cam Booth of the Transit Maps blog recommends the industry-standard Adobe Illustrator to create vector transit maps. For beginners and designers on a budget, Inkscape is a capable alternative. Best of all, it’s free to download! I learned about shapes, grids, paths, and nodes, and was quickly able to apply them to my project of confusing baseball and transit fans everywhere.

Click here to download the map as a PDF. And, as always, let me know about any recommendations you may have!

Don’t forget: next time you head out to the ball park…

Ride BTA. We’ll wave you home.

Transfort Bus Stops Through the Lens of GIS

To better understand the Fort Collins population and what percentage of it is adequately served by Transfort bus stops, I decided to jump on board the GIS-hype train. I downloaded QGIS, read a bit at qgistutorials.com, and felt ready to dive in.

You’re talkin’ about Transfort bus stops?

You bet I am! To begin (and prove to myself this wouldn’t be the most manual project I’d ever taken on), I collected data from several sources. I have included a Links section with paths to download the data yourself. You can also jump straight to the data, though you’ll miss some sweet graphics along the way.

  1. Transfort – I could not find shapefiles for either the Transfort stops or routes, so I began with the GTFS feed. Data in this transit agency standard format is in a series of comma-separated text files. The three of interest to me were the stops.txt, shapes.txt, and routes.txt.
  2. City of Fort Collins – I used two shapefiles provided by the city, ones of city limits and street centerlines.
  3. Colorado Information Marketplace – Fortuitously, Colorado publishes population data on the census block level. These correspond to city blocks, which were necessary for analyzing the population within Fort Collins.

To visualize the population density, I began with a heatmap. The census blocks shapefile is essentially a table of polygons, each with an attribute containing the population of that block in 2010. I filtered the layer to only include blocks within Larimer County and then created a layer of the census block centroids, which turned each polygon into a point. At this point, the default layer unit was degrees. To analyze this layer in meters, I reprojected the layer to SIRGAS 2000 / UTM Zone 13N. I then created a raster heatmap with a radius of 402 meters, which corresponds to a quarter-mile radius. This is an area of approximately 0.2 square miles, which is also listed in the map legend.

Before analyzing bus stops, I wanted to visually present each Transfort route. This required converting the shapes.txt file into a routes shapefile. QGIS can do this with the “Points to path” tool under Vector Creation algorithms. I have uploaded the resulting shapefile, along with PDFs of the following maps, in the Downloads section.

Population Density with Transfort Bus Stops

In addition to population density, I wanted to study walking distance from Transfort bus stops. Latitude and longitude information for each stop is contained within the stops.txt file. The QGIS plugin MMQGIS allows imports of these using “Geometry Import from CSV File.” I again needed to reproject the resulting layer to SIRGAS 2000 / UTM Zone 13N to ensure the layer units were meters. I wanted see the results of a 10-minute walk radii, so I created 804 meter (half-mile) buffers around each bus stop.

Ten-Minute Walk Radii from Tranfort Bus Stops

Since the half-mile coverage seemed surprisingly complete, I created a layer of 402 meter (quarter-mile) buffers around each bus stop to show the area within a 5-minute walk.

Five-Minute Walk Radii from Transfort Bus Stops

To allow the population density layer to blend with the walk distance buffers, I changed the layer blending mode to Darken.  The shades of green in the image below show dense areas overlapping with a 5-minute walk radius from a bus stop.

Population Density with Five-Minute Walk Radii from Transfort Bus Stops

Do you have any numbers I can ‘wow’ my friends with?

Fort Collins Population within Walking Distance of Transfort Bus Stops

I’m glad you asked! Another powerful feature of GIS is quantitative analysis. I used “Basic statistics” on the 2010 population field of the census block centroids layer to calculate the population of Larimer county. For the population of Fort Collins, I selected the census blocks that lie within the city limits polygon using a Spatial Query. Running statistics on these selected census block centroids produced the city population number in the table above.

You can see that the first half-mile buffer population is larger than the city population. I calculated the population within the walking distance buffers using two methods to adjust for this:

  1. No Flex Route – The FLEX is a commuter route operated by Transfort whose northern terminus is the South Transit Center. Several of its northern-most stops are inside Fort Collins city limits. I made the decision to remove the population near FLEX stops as commuter bus service has lower frequencies, and therefore different usage patterns, than a typical city bus. This was accomplished by joining the stops shapefile with stop_times.txt and trips.txt to give each route a column with its route name. I then used the Query Builder to select all stops whose route name was not “FLEX”.
  2. City Limits – The northwest corner of the Transfort routes actually runs outside of Fort Collins city limits, meaning the people living close to these stops were not included in my calculated city population. I performed a Spatial Query to select the bus stops within the city limits polygon boundary. These are the only stops I calculated buffers around when selecting census blocks for this method.

In the figure below, the bottom center red circle shows the location of the first adjustment and top left red circle the second.

Exceptions Made When Analyzing Data for Transfort Bus Stops

What does this mean?

You can see that the east side of Fort Collins contains both fewer dense areas and fewer routes, especially north-south routes. I have seen a pre-MAX Transfort map (pre-May 2014) that contained a north-south route on Timberline Road, the easternmost arterial in Fort Collins. While it is disappointing that ridership supposedly did not justify keeping this route, the density numbers back up this service change.

The half-mile buffers confirm that the city is broken into a square mile grid. The two east-west routes (Horsetooth and Harmony) in the southeast corner of the map show the half-mile circles bumping against each other, creating a distance of a mile between the two roads.

Depending on the metric, between 60% and 63% of Fort Collins residents are within a five-minute walk of a Transfort bus stop. This is significantly higher than I would have guessed. However, being near a bus stop is only part of the story; frequency of service and driving disincentives also play a major role in whether a resident will ride the bus or not. Parking is quite easy in most of Fort Collins and the areas where it is harder, mainly Old Town, provide markedly sub-market value parking. The headway on the routes, excluding the MAX, is either 30 or 60 minutes. And there is no Sunday service. All this goes to say living within 5 minutes of a bus stop does not necessarily make for a transit heaven. It should also be noted that the block-level populations may have significantly changed since the 2010 census.

Regardless, having 87% of your population live within a 10-minute walk of a bus stop indicates an overall lack of transit deserts and a fairly comprehensive bus system. I think Fort Collins is very close to a big shift in transit culture!

For better or worse, I will now scrutinize the density surrounding my Transfort bus stops and routes even more. Here’s what I can tell you from my observations thus far: it’s not great. Here’s what the data says: it’s not great. And here’s to using data to continue to improve our city and its bus service!

Let me know if there is more I can do with this data! I’d also enjoy seeing analysis of this type in your own city.

Until next time, ride on!

Downloads

Transfort Routes Shapefile

Population Density with Transfort Bus Routes PDF

Ten-Minute Walk Radii from Tranfort Bus Routes PDF

Five-Minute Walk Radii from Transfort Bus Routes PDF

Population Density with Five-Minute Walk Radii from Transfort Bus Routes PDF

Links

Transfort GTFS Feed
http://www.ridetransfort.com/developers OR
https://code.google.com/p/googletransitdatafeed/wiki/PublicFeeds OR
http://transitfeeds.com/

City of Fort Collins
http://www.fcgov.com/gis/downloadable-data.php

Colorado Information Marketplace
https://data.colorado.gov/Demographics/Census-Blocks-2010/xipb-k5bu