Tyler A. Green

In Transit

Tag: graph

Hello Transportation Techies!

Last Wednesday, I rode the Amtrak Northeast Regional to our nation’s capital to attend my first Transportation Techies meetup! Michael had been urging me to attend since we first met at TransportationCamp Colorado 2016. Now that I’m on the east coast, it was an easy trip!

Michael always puts on an incredible event and Metro Hack Night VII was no exception! We had a great line-up and turnout. I presented on how we can use graph theory to study transit networks, specifically WMATA Metrorail with and without the Purple line.

You can browse my slides below, or download the PDF deck.

 

All the presenters blew me away with their excellent, excellent projects. The DC transit developer community is special and I loved getting to experience it for one night. You can read about the all projects and presenters in this write-up from Mobility Lab.

If you live in the DC region (or anywhere along the Northeast Corridor), I encourage you to drop in for the next gathering of the Transportation Techies!

Graphing Transit Systems, Part III – Centrality Extended

This is the third post diving into the graph structure of the New York City subway system. Read the first two for more background!

At the start of last post, I threw out two questions:

  1. Does the network structure of the New York City subway indicate Times Square is a critical station, or is that just where the most riders board?
  2. Can all stations in a transit network be important?

We discussed the difference between centrality metrics and node importance metrics. The former identify important nodes in a network, while the latter ranks nodes by importance. We’ll use the node importance metrics to answer these questions.

To support our discussion, I whipped up a map showing the MTA subway ridership data by itself using Carto. Here’s the interactive map! The data is from the years 2010 to 2015 and is provided by the MTA.

Does the network structure of the New York City subway indicate Times Square is a critical station, or is that just where the most riders board?

To answer this question, I calculated the correlation between ridership and centrality. In the scatter plots below, the independent variable is the centrality score per station, and the dependent variable is the ridership at that station, averaged over the years 2010 through 2015. This might seem backwards, but I chose this because the centrality metric is a reflection of the network structure and we are studying the effect of network structure on ridership.


The correlation coefficient for these two data sets show a moderate positive correlation.

  • Closeness centrality, r = 0.43
  • Outward accessibility, r = 0.30

Remember, correlation does not imply causation, but these figures suggest that for an increase in the centrality metric, you can expect a moderate increase in ridership.

Did you notice Times Square on the scatter plots? Yep, with an average annual ridership of almost 63 million, it’s the outlier. Based on its position on the horizontal axis, closeness centrality thinks Times Square is an important station in the network, while outward accessibility does not. If you remember from last post, PageRank also finds Times Square to be important and Katz just confused us all. That answers our first question!

Before we go on, I have a theory that any outlier in these plots are the result of externalities. For example, the average ridership at Yankee Stadium – 161 St is 8.7 million, but its neighboring stations have ridership of 1.3, 3, 3, and 4.3 million each on average. What is its externality? The world-famous New York Yankees. Times Square – 42 St is a similar situation. Not only is it a transfer point for 12 NYC subway services, it is also below the mega tourist attraction and its namesake, Times Square. I have no hard data on this outliers theory, but more research could be done on this!

Can all stations in a transit network be important?

Why would we want all stations to be “important”? If our goal is for all citizens to have equal access to quality to public transportation, we would like everyone to live near a station which provides this gateway. A transit network will always have stations which are more centrally located than others, but is it possible to minimize the differences between the most connected and the least connected stations? Let’s see how do our metrics evaluate the structure of another world-class network in this regard. Enter Paris, its minimal geographical constraints, and its lovely radial network.

The two histograms below sit on the same range on the horizontal axis. The count on the vertical axis is the number of stations which fall into the horizontal range represented by its bar. As you can see, Paris has many, many stations which score higher than all of New York City’s.


There is one large caveat here: land area. Officially, the area of New York City is 302.6 square miles, while Paris is only 40.7 square miles. Another metric is longest subway line: New York’s A train extends 31 miles, while Paris’ Line 13 is just over 15 miles. Closeness centrality uses shortest path between station pairs, which in my graph, are the number of seconds for a trip. A 31-mile subway trip will take longer than a 15-mile subway trip, so this metrics are stacked against New York City subway and the large area it covers.

Concerning our question, even though Paris’ stations score much higher than New York’s, they are not all equal. This gets back to my earlier point: there will be an importance continuum among stations, but improving the importance of the least connected stations can still provide a benefit to citizens.

Next, let’s look at this histograms for New York and Paris outward accessibility scores.

There is not as much difference between New York City and Paris in the histograms for outward accessibility. This metric is independent of network area or subway line length, so this does not surprise me. It may hint that more of the difference between the networks for closeness centrality may be due to geographical area.

If you look at the densest parts of the Paris network and see how interconnected it is, the upper bound for its accessibility distribution being higher than New York’s also will not be a surprise.

Next Stop

Now that we have evaluated the centrality of multiple transit networks and performed limited cross-network comparisons, I want to know whether these metrics can tell us the best future subway routes. For example, given the budget for a single new subway line, what is the best route for this new line? It will be a very empirical and barely human analysis, so we may have to take the results with a grain or six of salt, but hopefully the results will have value besides making shapes on maps.

See you then!

Graphing Transit Systems, Part II – Centrality

This post is the second of three four looking into the graph structure of the New York City subway system. In the previous post, I discussed a frontend I built to visualize a depth-first search, breadth-first search, and shortest path algorithm. I ended with a discussion of centrality algorithms. We pick up our hero there…

Centrality metrics identify important nodes in a graph. In the gtfs-graph world, nodes represent subway stations. Why might we want to identify important stations in the NYC subway network? Honestly, my initial reason was I thought it sounded cool. I was curious to see if there are numbers (besides ridership…we’ll get to that in the next post!) to rank stations which align with our human perception of important stations in the system. Meaning: does the network structure indicate Times Square is a critical station, or is that just where the most riders board? That was the first question I wanted to explore. The next question would challenge the Lake Wobegon effect. That is: can all stations in a network be important?

To answer these questions, I created a web app for three cities and their heavy rail networks:

Each city has results for four centrality metrics: PageRank, Katz centrality, closeness centrality, and outward accessibility. I will be discussing the results in terms of the New York City network.

It is worth noting at this point that analyzing a transit network only using stops and edges is a very simplified model. To make any real decisions on the system as it relates to the city and population it serves, we would need to consider population density and employment centers at minimum. Knowing that, let’s proceed!

PageRank

If PageRank sounds familiar to you, it’s likely because it is the algorithm used by book publishers to identify pages, and definitely not because it was invented by Google co-founders Larry Page and Sergey Brin to rank web pages for their search engine. In this algorithm, a node’s importance is derived from the importance of all the nodes which link to it. Mapped over to transit, a station’s importance is derived from the importance of all the stations which have direct connections to it.

The PageRank results look interesting and definitely pick out important stations, but they do not give us insight into the entire distribution of stations.

The PageRank results look interesting and definitely pick out important stations, but they do not give us insight into the entire distribution of stations.

I was giddy while implementing this and my brain swirled with grand visions of unlocking new insights to generations-old transit networks. As it turns out, PageRank is not a great model for a transit network. Let’s look at an example.

In the NYC PageRank view, you can see that Times Square comes out on top. Let’s collectively channel our inner undergrad physics lab student and breathe a sigh of relief that the numbers show us what we expected. Phewwwwwww. However, if we look at one of its neighbors, 34 St – 11 Av AKA the 7 train extension, we see that it ranks last. Not just maybe not top ten or top 100, but dead last. PageRank is saying that the 7 train extension produced a station that is literally the least important in the NYC network.

Have no fear Andrew Cuomo, let’s consider the model again. If you throw in sample numbers using the PageRank formula, you can see that the above behavior is correct. 34 St – 11 Av only has one “link” and that node’s PageRank is high, but it also has a high out-degree. Using the random surfer / random transit rider model, a rider passing through Times Square is not likely to end up at 34 St – 11 Av. Sorry 7 train, but PageRank is just does not do your $2.4 billion price tag justice. Let’s see how the other centrality metrics view the subway network!

Poor 34 St - 11 Av doesn't get any love from PageRank. The data on the right shows the top 10 stations serve several subway routes each. This is not a coincidence; PageRank picks out highly connected nodes.

Poor 34 St – 11 Av doesn’t get any love from PageRank. The data on the right shows the top 10 stations serve several subway routes each. This is not a coincidence; PageRank picks out highly connected nodes.

Katz Centrality

Katz Centrality builds on PageRank by considering all walks between two stops in a network, as opposed to only the shortest path between nodes. This appealed to me in a transit context because in a dense network such as Paris, there are often numerous routes between any two stops. This built-in redundancy has been brought up recently as a weakness of the DC metro during the on-going two-track vs. four-track debate and how it affects the maintenance window for a major heavy rail system.

Now is a good time to mention that I would highly recommend the Wikipedia entry for Katz centrality and all the metrics in this post. The original Katz paper is insightful as well.

The results from Katz are……confusing. If you picked South Ferry as the most important MTA station, you either love platform extenders or misguidedly added the Staten Island Ferry to your subway network. The Staten Island Railway data is included in the MTA subway GTFS feed, so I kept it on my map. Closeness centrality (up next!) requires all nodes to be reachable from every other node, so I threw a fake edge in to the graph to represent the ferry. Believe me: the results were just as confusing before I added the ferry route. Due to the multiplicative nature of Katz centrality, the resulting distribution ranges from 0.00244 (Ozone Park – Lefferts Blvd) to 693,246.863 (St George, just across from South Ferry on the south-bound ferry).

Here’s all the insight I can offer on Katz centrality: all traffic between two well-connected sections of the graph (Staten Island and the entire rest of the MTA subway) has to pass through two stations: South Ferry and St George. Therefore, they are “important” and “central” and I am “confused” and “ready to talk about other metrics.”

Katz says the subway network is equally unimpressive. Except for South Ferry. What a champ.

Katz says the subway network is equally unimpressive. Except for South Ferry. What a champ.

Closeness Centrality

My friend Calvin and I half made-up, half realized-it-was-already-a-thing, a centrality metric which promised a return to the fundamentals. Closeness centrality (or as Cal and I called it, the squiggly-doo) is intuitive in that the closer a node is to all other nodes, the more “central” it is. It does this by ranking a node by the sum of the shortest paths to all other nodes in the network. As you may remember from last post, the distance of each edge in our network is the number of seconds to travel via that route segment according to that system’s GTFS feed.

At this point of confusing results from two metrics, I discovered the term “node influence metrics.” These metrics seek to answer my second question from earlier: can all nodes in a network be important? PageRank and Katz identify important nodes, but only the top of their resulting distribution should be considered. This means the metric results for the bottom half of the distribution are more or less meaningless. Technically, closeness centrality is not a node influence metric, but I treat it as such. Intuition tells me that its results have meaning for the entire distribution of nodes. Please comment if you feel otherwise!

Neapolitan ice cream anyone? Closeness centrality results have no surprises.

Neapolitan ice cream anyone? Closeness centrality results have no surprises.

Manhattan stations are ranked highly by closeness centrality. This uniformity is in contrast to the Manhattan results for outward accessibility.

Manhattan stations are ranked highly by closeness centrality. This uniformity is in contrast to the Manhattan results for outward accessibility.

The closeness centrality results are extremely straightforward. Subway stations on Manhattan score higher because riders can reach all other stations in less time there than elsewhere. The opposite is true for Far Rockaway. This algorithm will play an important role in the next post!

Outward Accessibility

Outward accessibility is one of the primary node influence metrics. It produces a normalized version of diversity entropy proposed in this paper by Travençolo and Costa. A node ranks highly when many unique paths can be taken from it over a course of random walks of varying distances. Sections of a graph which rank highly by this metric are found to have high network redundancy and high accessibility from the rest of the network. Redundancy and accessibility are both critical when evaluating a transit network, so this seemed like a good fit!

One drawback to the outward accessibility metric is performance and repeatability. Before calculating the actual metric, one must perform a series of random walks of varying distances from each node. For these walks to be representative, the walk count must be high, which can lengthen execution time of the analysis. Due to the nature of random calculations, the answers change every time! This could be solved by using a consistent random number generator seed when running the analysis, or by always running enough random walks for the results to converge.

Outward accessibility gives us the weather map similar to closeness centrality, but are its individual stations ranked similarly?

Outward accessibility gives us the weather map appearance similar to closeness centrality, but are its individual stations ranked similarly?

Outward accessibility picks out hotspots of importance in a graph network. These can vary slightly due to the random nature of this algorithm, but should converge over time with enough random walks.

Outward accessibility picks out hot spots of importance in a graph network. These can vary slightly due to the random nature of this algorithm, but should converge over time with enough random walks.

The results for outward accessibility appear to parallel those of closeness centrality at first glance. However, a closer look at the accessibility results shows hot spots. The metric tells us these are the nodes which allow riders to traverse the most unique routes in a given distance. Translated to the real world, this is valuable to the rider’s perception of a transit network. If I can go to 20 different stations within 10 subway stops (on any route), my location is better served by public transit than if I can only go to 10 stations within 10 stops.

Accessibility also has a strange property of ranking end stations higher. The logic is that if I start from the second from the end station, half of my random walks will go outwards and produce little diversity entropy. Conversely, if I start from the end station, all of my walks will go towards the potentially more diverse part of the graph. I emailed the paper authors to comment on this behavior, but have not heard back. If you are reading this, Travençolo or Costa, please comment with insight!

Next Stop

If you’ve hung with me this long and have noticed I haven’t answered either question posed at the start of this post, I’m going to grant you a short break. In the next post, we’ll discuss how the closeness centrality and outward accessibility results correlate to the NYC subway ridership numbers, as well as how these metrics compare between NYC and Paris. I hope you’ll stay on board!

Graphing Transit Systems

I’ve been away from the blogging world for a while! The last few months included a fantastic and inspiring trip to Transportation Camp NYC and loads of (mostly) fun weekend work on transit graphs.

In a hodgepodge effort to improve on Javascript, learn React, create a generic graph representation of a GTFS feed, and implement a few graph algorithms, I finally have a working TRANSIT GRAPH DEMO.

Why transit graphs?

While reviewing algorithms on Jason Park’s algorithm visualizer, I thought, “WE CAN APPLY THESE TO TRANSIT.” It was a moment of pure destiny. To call it multidisciplinary intrigue would be underselling my excitement. Of course, I was not the first person to connect transit and graphs; Google Maps, Open Trip Planner, and Mapzen’s Valhalla are all built on graph representations.

My original goal was to display an animated graph traversal of the New York City subway system. I’ve ended up with a platform to study graph algorithms on transit maps. (I learned that if I’m unsure what I’m building, just call it a platform. The solutions will follow.)

As is the norm in 2016 JavaScript, I used almost as many tools and libraries as there are NYC subway stations. My goal in all projects is to use as little custom data as possible, so I stuck with my Boston model and loaded the MTA GTFS feed into an Amazon RDS Postgres instance. The backend is a Node.js server which boots up after constructing a graph. I used the new ES6 ‘class’ keyword to create a TransitGraph in the style of the object-oriented languages I was raised on. The original frontend was written using JQuery, but when I reached the point of implementing an autocomplete search box, I knew I needed to up my tool game. Enter: React. Facebook’s documentation on the library is quite comprehensive and I latched on to the object-oriented feel and state-based programming model. All the data (stops and routes/edges) is communicated via WebSocket that persists through an entire client connection.

As you can see when using the graph demo, there are three modes. A bit on each…

Shortest Path

Dijkstra’s is the classic gateway algorithm to finding shortest paths in graphs. Wikipedia’s explanation is as clear-worded as I’ve read, so I’ll defer to them:

It picks the unvisited vertex with the lowest distance, calculates the distance through it to each unvisited neighbor, and updates the neighbor’s distance if smaller.

Fire up the algorithm visualizer for to help picture this. In my graph, the edge weights are the time between stations. After running Dijkstra’s, we have an ordered sequence of nodes which represent the shortest path between the origin and destination and the time it would take to do so.

A sample shortest-path from 50th St to 1 Av. The routes are calculated from the GTFS feed based on the trips that pass through that stop. This can periodically result in slightly different route listings than the official MTA map.

A sample shortest path from 50th St to 1 Av. The routes which serve each station are derived from the GTFS feed based on the trips that pass through that stop. This can periodically result in slightly different route listings than the official MTA map.

The user interface to pick the origin and destination nodes. I studied Pinterest's CSS to help build the stop tokens that populate the input fields when selected. The route details at the bottom uses "display: flex;", a tip I picked up from the Google Maps CSS.

The user interface to pick the origin and destination nodes. I studied Pinterest’s CSS to help build the stop tokens that populate the input fields when selected. The route details at the bottom uses “display: flex”, a tip I picked up from the Google Maps CSS.

Sound familiar? Google Maps transit directions do the exact same time. And much better! Knowing when to switch trains becomes a luxury after using my tool.

Depth-First Search

A depth-first search, or DFS as the real algorithm geeks call it, is a classic traversal method for both graphs and trees. The idea of a traversal is to visit all the nodes in the graph which can be reached given a starting node. The depth-first variety is contrasted with the breadth-first procedure (up next!) in that, given a starting node, one of its neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, and so on. Was there anything weird about that last sentence? This is a recursive algorithm! When a visited node has no unvisited neighbors, the algorithm pops back up the call stack, testing for unvisited neighbors at each level.

A snapshot of visited nodes early in a depth-first search from Yankee Stadium. A red line segment is an edge that has been visited, but not unvisited, while a blue line segment has already been unvisited. As the recursive function pops higher up the call stack, more edges turn blue.

A snapshot of visited nodes early in a depth-first search from 161 St – Yankee Stadium. A red line segment is an edge that has been visited, but not unvisited, while a blue line segment has already been unvisited. As the recursive function pops higher up the call stack, more edges turn blue.

We can see that at the completion of a DFS from 161 St - Yankee Stadium, the entire MTA subway system has been visited. The nodes that have not been visited are the Staten Island railway, which has no rail connections to the subway system and therefore no edges in my graph.

We can see that at the completion of a DFS from 161 St – Yankee Stadium, the entire MTA subway system has been visited. The nodes that have not been visited are the Staten Island railway, which has no rail connections to the subway system and therefore no edges in my graph.

Breadth-First Search

A breadth-first search is another traversal variant whose lofty goal is to identify connected components of a graph while providing zero valuable info to passengers riding transit. (Now would be a good time to say that identifying connected components will play a key role in merging nodes during a later step in this project. Traversals are a necessary part of any graph analyzer’s toolkit!) As you may have guessed, a BFS goes wide before it goes deep. From a given node, all of its neighboring nodes are visited before any of their neighbors are visited. This produces a different exploration pattern, which is illustrated in the following three images.

A snapshot early in a breadth-first search from Queensboro Plaza. We see that the visited nodes are spreading outward from the source. Think: diseases. Depth-first search is how you solve a maze and breadth-first search is how you get sick.

A snapshot early in a breadth-first search from Queensboro Plaza. We see that the visited nodes are spreading outward from the source. Think: diseases. Depth-first search is how you solve a maze and breadth-first search is how you get sick.

A bit farther in the breadth-first search, we can see the disease...err...graph traversal has continued to spread outward.

A bit farther in the breadth-first search, we can see the disease…err…graph traversal has continued to spread outward.

The completion of the breadth-first search. There are no blue edges because this is not a recursive algorithm.

The visited edges at the completion of the breadth-first search. There are no blue edges because this is not a recursive algorithm.

What’s next?

“gtfs-graph” (the GitHub project name for now – please help me come up with a better one!) is built to be system-agnostic. I have graph representations for Boston and Paris in addition to New York City. While the GTFS standard allowed me to construct all three graphs in similar ways, there were still a few quirks, resulting mainly from how the different systems represent sub-stops (parent/child or northbound/southbound).

Recently, I have been implementing centrality algorithms to see how the results varied from system to system. Paris’ RATP heavy rail lines certainly look to have higher connectivity than Boston’s hub-centric design, and I’m working to find the numbers to prove this. If I can indeed prove this, I’d like to use a genetic algorithm to efficiently enhance (add lines and stops) a system to match the connected-ness/centrality distribution/equity/whatever-metric-I-end-up-with of a higher quality system.

After implementing Google’s PageRank algorithm, I decided it is a poor model for transit. The rankings currently displayed are a modified version of closeness centrality. I really enjoyed this white paper on a node importance algorithm and plan to implement this soon. It uses random walks to calculate the entropy of a given node after a given number of steps.

I hope to have a much more detailed most on these metrics in the coming weeks! I would love to hear any thoughts or ideas you might have about any or all of this!

Let’s build awesome things to help transit, cities, and, most of all, people.