Tyler A. Green

In Transit

Tag: new york city

Graphing Transit Systems, Part III – Centrality Extended

This is the third post diving into the graph structure of the New York City subway system. Read the first two for more background!

At the start of last post, I threw out two questions:

  1. Does the network structure of the New York City subway indicate Times Square is a critical station, or is that just where the most riders board?
  2. Can all stations in a transit network be important?

We discussed the difference between centrality metrics and node importance metrics. The former identify important nodes in a network, while the latter ranks nodes by importance. We’ll use the node importance metrics to answer these questions.

To support our discussion, I whipped up a map showing the MTA subway ridership data by itself using Carto. Here’s the interactive map! The data is from the years 2010 to 2015 and is provided by the MTA.

Does the network structure of the New York City subway indicate Times Square is a critical station, or is that just where the most riders board?

To answer this question, I calculated the correlation between ridership and centrality. In the scatter plots below, the independent variable is the centrality score per station, and the dependent variable is the ridership at that station, averaged over the years 2010 through 2015. This might seem backwards, but I chose this because the centrality metric is a reflection of the network structure and we are studying the effect of network structure on ridership.


The correlation coefficient for these two data sets show a moderate positive correlation.

  • Closeness centrality, r = 0.43
  • Outward accessibility, r = 0.30

Remember, correlation does not imply causation, but these figures suggest that for an increase in the centrality metric, you can expect a moderate increase in ridership.

Did you notice Times Square on the scatter plots? Yep, with an average annual ridership of almost 63 million, it’s the outlier. Based on its position on the horizontal axis, closeness centrality thinks Times Square is an important station in the network, while outward accessibility does not. If you remember from last post, PageRank also finds Times Square to be important and Katz just confused us all. That answers our first question!

Before we go on, I have a theory that any outlier in these plots are the result of externalities. For example, the average ridership at Yankee Stadium – 161 St is 8.7 million, but its neighboring stations have ridership of 1.3, 3, 3, and 4.3 million each on average. What is its externality? The world-famous New York Yankees. Times Square – 42 St is a similar situation. Not only is it a transfer point for 12 NYC subway services, it is also below the mega tourist attraction and its namesake, Times Square. I have no hard data on this outliers theory, but more research could be done on this!

Can all stations in a transit network be important?

Why would we want all stations to be “important”? If our goal is for all citizens to have equal access to quality to public transportation, we would like everyone to live near a station which provides this gateway. A transit network will always have stations which are more centrally located than others, but is it possible to minimize the differences between the most connected and the least connected stations? Let’s see how do our metrics evaluate the structure of another world-class network in this regard. Enter Paris, its minimal geographical constraints, and its lovely radial network.

The two histograms below sit on the same range on the horizontal axis. The count on the vertical axis is the number of stations which fall into the horizontal range represented by its bar. As you can see, Paris has many, many stations which score higher than all of New York City’s.


There is one large caveat here: land area. Officially, the area of New York City is 302.6 square miles, while Paris is only 40.7 square miles. Another metric is longest subway line: New York’s A train extends 31 miles, while Paris’ Line 13 is just over 15 miles. Closeness centrality uses shortest path between station pairs, which in my graph, are the number of seconds for a trip. A 31-mile subway trip will take longer than a 15-mile subway trip, so this metrics are stacked against New York City subway and the large area it covers.

Concerning our question, even though Paris’ stations score much higher than New York’s, they are not all equal. This gets back to my earlier point: there will be an importance continuum among stations, but improving the importance of the least connected stations can still provide a benefit to citizens.

Next, let’s look at this histograms for New York and Paris outward accessibility scores.

There is not as much difference between New York City and Paris in the histograms for outward accessibility. This metric is independent of network area or subway line length, so this does not surprise me. It may hint that more of the difference between the networks for closeness centrality may be due to geographical area.

If you look at the densest parts of the Paris network and see how interconnected it is, the upper bound for its accessibility distribution being higher than New York’s also will not be a surprise.

Next Stop

Now that we have evaluated the centrality of multiple transit networks and performed limited cross-network comparisons, I want to know whether these metrics can tell us the best future subway routes. For example, given the budget for a single new subway line, what is the best route for this new line? It will be a very empirical and barely human analysis, so we may have to take the results with a grain or six of salt, but hopefully the results will have value besides making shapes on maps.

See you then!

Graphing Transit Systems, Part II – Centrality

This post is the second of three four looking into the graph structure of the New York City subway system. In the previous post, I discussed a frontend I built to visualize a depth-first search, breadth-first search, and shortest path algorithm. I ended with a discussion of centrality algorithms. We pick up our hero there…

Centrality metrics identify important nodes in a graph. In the gtfs-graph world, nodes represent subway stations. Why might we want to identify important stations in the NYC subway network? Honestly, my initial reason was I thought it sounded cool. I was curious to see if there are numbers (besides ridership…we’ll get to that in the next post!) to rank stations which align with our human perception of important stations in the system. Meaning: does the network structure indicate Times Square is a critical station, or is that just where the most riders board? That was the first question I wanted to explore. The next question would challenge the Lake Wobegon effect. That is: can all stations in a network be important?

To answer these questions, I created a web app for three cities and their heavy rail networks:

Each city has results for four centrality metrics: PageRank, Katz centrality, closeness centrality, and outward accessibility. I will be discussing the results in terms of the New York City network.

It is worth noting at this point that analyzing a transit network only using stops and edges is a very simplified model. To make any real decisions on the system as it relates to the city and population it serves, we would need to consider population density and employment centers at minimum. Knowing that, let’s proceed!

PageRank

If PageRank sounds familiar to you, it’s likely because it is the algorithm used by book publishers to identify pages, and definitely not because it was invented by Google co-founders Larry Page and Sergey Brin to rank web pages for their search engine. In this algorithm, a node’s importance is derived from the importance of all the nodes which link to it. Mapped over to transit, a station’s importance is derived from the importance of all the stations which have direct connections to it.

The PageRank results look interesting and definitely pick out important stations, but they do not give us insight into the entire distribution of stations.

The PageRank results look interesting and definitely pick out important stations, but they do not give us insight into the entire distribution of stations.

I was giddy while implementing this and my brain swirled with grand visions of unlocking new insights to generations-old transit networks. As it turns out, PageRank is not a great model for a transit network. Let’s look at an example.

In the NYC PageRank view, you can see that Times Square comes out on top. Let’s collectively channel our inner undergrad physics lab student and breathe a sigh of relief that the numbers show us what we expected. Phewwwwwww. However, if we look at one of its neighbors, 34 St – 11 Av AKA the 7 train extension, we see that it ranks last. Not just maybe not top ten or top 100, but dead last. PageRank is saying that the 7 train extension produced a station that is literally the least important in the NYC network.

Have no fear Andrew Cuomo, let’s consider the model again. If you throw in sample numbers using the PageRank formula, you can see that the above behavior is correct. 34 St – 11 Av only has one “link” and that node’s PageRank is high, but it also has a high out-degree. Using the random surfer / random transit rider model, a rider passing through Times Square is not likely to end up at 34 St – 11 Av. Sorry 7 train, but PageRank is just does not do your $2.4 billion price tag justice. Let’s see how the other centrality metrics view the subway network!

Poor 34 St - 11 Av doesn't get any love from PageRank. The data on the right shows the top 10 stations serve several subway routes each. This is not a coincidence; PageRank picks out highly connected nodes.

Poor 34 St – 11 Av doesn’t get any love from PageRank. The data on the right shows the top 10 stations serve several subway routes each. This is not a coincidence; PageRank picks out highly connected nodes.

Katz Centrality

Katz Centrality builds on PageRank by considering all walks between two stops in a network, as opposed to only the shortest path between nodes. This appealed to me in a transit context because in a dense network such as Paris, there are often numerous routes between any two stops. This built-in redundancy has been brought up recently as a weakness of the DC metro during the on-going two-track vs. four-track debate and how it affects the maintenance window for a major heavy rail system.

Now is a good time to mention that I would highly recommend the Wikipedia entry for Katz centrality and all the metrics in this post. The original Katz paper is insightful as well.

The results from Katz are……confusing. If you picked South Ferry as the most important MTA station, you either love platform extenders or misguidedly added the Staten Island Ferry to your subway network. The Staten Island Railway data is included in the MTA subway GTFS feed, so I kept it on my map. Closeness centrality (up next!) requires all nodes to be reachable from every other node, so I threw a fake edge in to the graph to represent the ferry. Believe me: the results were just as confusing before I added the ferry route. Due to the multiplicative nature of Katz centrality, the resulting distribution ranges from 0.00244 (Ozone Park – Lefferts Blvd) to 693,246.863 (St George, just across from South Ferry on the south-bound ferry).

Here’s all the insight I can offer on Katz centrality: all traffic between two well-connected sections of the graph (Staten Island and the entire rest of the MTA subway) has to pass through two stations: South Ferry and St George. Therefore, they are “important” and “central” and I am “confused” and “ready to talk about other metrics.”

Katz says the subway network is equally unimpressive. Except for South Ferry. What a champ.

Katz says the subway network is equally unimpressive. Except for South Ferry. What a champ.

Closeness Centrality

My friend Calvin and I half made-up, half realized-it-was-already-a-thing, a centrality metric which promised a return to the fundamentals. Closeness centrality (or as Cal and I called it, the squiggly-doo) is intuitive in that the closer a node is to all other nodes, the more “central” it is. It does this by ranking a node by the sum of the shortest paths to all other nodes in the network. As you may remember from last post, the distance of each edge in our network is the number of seconds to travel via that route segment according to that system’s GTFS feed.

At this point of confusing results from two metrics, I discovered the term “node influence metrics.” These metrics seek to answer my second question from earlier: can all nodes in a network be important? PageRank and Katz identify important nodes, but only the top of their resulting distribution should be considered. This means the metric results for the bottom half of the distribution are more or less meaningless. Technically, closeness centrality is not a node influence metric, but I treat it as such. Intuition tells me that its results have meaning for the entire distribution of nodes. Please comment if you feel otherwise!

Neapolitan ice cream anyone? Closeness centrality results have no surprises.

Neapolitan ice cream anyone? Closeness centrality results have no surprises.

Manhattan stations are ranked highly by closeness centrality. This uniformity is in contrast to the Manhattan results for outward accessibility.

Manhattan stations are ranked highly by closeness centrality. This uniformity is in contrast to the Manhattan results for outward accessibility.

The closeness centrality results are extremely straightforward. Subway stations on Manhattan score higher because riders can reach all other stations in less time there than elsewhere. The opposite is true for Far Rockaway. This algorithm will play an important role in the next post!

Outward Accessibility

Outward accessibility is one of the primary node influence metrics. It produces a normalized version of diversity entropy proposed in this paper by Travençolo and Costa. A node ranks highly when many unique paths can be taken from it over a course of random walks of varying distances. Sections of a graph which rank highly by this metric are found to have high network redundancy and high accessibility from the rest of the network. Redundancy and accessibility are both critical when evaluating a transit network, so this seemed like a good fit!

One drawback to the outward accessibility metric is performance and repeatability. Before calculating the actual metric, one must perform a series of random walks of varying distances from each node. For these walks to be representative, the walk count must be high, which can lengthen execution time of the analysis. Due to the nature of random calculations, the answers change every time! This could be solved by using a consistent random number generator seed when running the analysis, or by always running enough random walks for the results to converge.

Outward accessibility gives us the weather map similar to closeness centrality, but are its individual stations ranked similarly?

Outward accessibility gives us the weather map appearance similar to closeness centrality, but are its individual stations ranked similarly?

Outward accessibility picks out hotspots of importance in a graph network. These can vary slightly due to the random nature of this algorithm, but should converge over time with enough random walks.

Outward accessibility picks out hot spots of importance in a graph network. These can vary slightly due to the random nature of this algorithm, but should converge over time with enough random walks.

The results for outward accessibility appear to parallel those of closeness centrality at first glance. However, a closer look at the accessibility results shows hot spots. The metric tells us these are the nodes which allow riders to traverse the most unique routes in a given distance. Translated to the real world, this is valuable to the rider’s perception of a transit network. If I can go to 20 different stations within 10 subway stops (on any route), my location is better served by public transit than if I can only go to 10 stations within 10 stops.

Accessibility also has a strange property of ranking end stations higher. The logic is that if I start from the second from the end station, half of my random walks will go outwards and produce little diversity entropy. Conversely, if I start from the end station, all of my walks will go towards the potentially more diverse part of the graph. I emailed the paper authors to comment on this behavior, but have not heard back. If you are reading this, Travençolo or Costa, please comment with insight!

Next Stop

If you’ve hung with me this long and have noticed I haven’t answered either question posed at the start of this post, I’m going to grant you a short break. In the next post, we’ll discuss how the closeness centrality and outward accessibility results correlate to the NYC subway ridership numbers, as well as how these metrics compare between NYC and Paris. I hope you’ll stay on board!

Graphing Transit Systems

I’ve been away from the blogging world for a while! The last few months included a fantastic and inspiring trip to Transportation Camp NYC and loads of (mostly) fun weekend work on transit graphs.

In a hodgepodge effort to improve on Javascript, learn React, create a generic graph representation of a GTFS feed, and implement a few graph algorithms, I finally have a working TRANSIT GRAPH DEMO.

Why transit graphs?

While reviewing algorithms on Jason Park’s algorithm visualizer, I thought, “WE CAN APPLY THESE TO TRANSIT.” It was a moment of pure destiny. To call it multidisciplinary intrigue would be underselling my excitement. Of course, I was not the first person to connect transit and graphs; Google Maps, Open Trip Planner, and Mapzen’s Valhalla are all built on graph representations.

My original goal was to display an animated graph traversal of the New York City subway system. I’ve ended up with a platform to study graph algorithms on transit maps. (I learned that if I’m unsure what I’m building, just call it a platform. The solutions will follow.)

As is the norm in 2016 JavaScript, I used almost as many tools and libraries as there are NYC subway stations. My goal in all projects is to use as little custom data as possible, so I stuck with my Boston model and loaded the MTA GTFS feed into an Amazon RDS Postgres instance. The backend is a Node.js server which boots up after constructing a graph. I used the new ES6 ‘class’ keyword to create a TransitGraph in the style of the object-oriented languages I was raised on. The original frontend was written using JQuery, but when I reached the point of implementing an autocomplete search box, I knew I needed to up my tool game. Enter: React. Facebook’s documentation on the library is quite comprehensive and I latched on to the object-oriented feel and state-based programming model. All the data (stops and routes/edges) is communicated via WebSocket that persists through an entire client connection.

As you can see when using the graph demo, there are three modes. A bit on each…

Shortest Path

Dijkstra’s is the classic gateway algorithm to finding shortest paths in graphs. Wikipedia’s explanation is as clear-worded as I’ve read, so I’ll defer to them:

It picks the unvisited vertex with the lowest distance, calculates the distance through it to each unvisited neighbor, and updates the neighbor’s distance if smaller.

Fire up the algorithm visualizer for to help picture this. In my graph, the edge weights are the time between stations. After running Dijkstra’s, we have an ordered sequence of nodes which represent the shortest path between the origin and destination and the time it would take to do so.

A sample shortest-path from 50th St to 1 Av. The routes are calculated from the GTFS feed based on the trips that pass through that stop. This can periodically result in slightly different route listings than the official MTA map.

A sample shortest path from 50th St to 1 Av. The routes which serve each station are derived from the GTFS feed based on the trips that pass through that stop. This can periodically result in slightly different route listings than the official MTA map.

The user interface to pick the origin and destination nodes. I studied Pinterest's CSS to help build the stop tokens that populate the input fields when selected. The route details at the bottom uses "display: flex;", a tip I picked up from the Google Maps CSS.

The user interface to pick the origin and destination nodes. I studied Pinterest’s CSS to help build the stop tokens that populate the input fields when selected. The route details at the bottom uses “display: flex”, a tip I picked up from the Google Maps CSS.

Sound familiar? Google Maps transit directions do the exact same time. And much better! Knowing when to switch trains becomes a luxury after using my tool.

Depth-First Search

A depth-first search, or DFS as the real algorithm geeks call it, is a classic traversal method for both graphs and trees. The idea of a traversal is to visit all the nodes in the graph which can be reached given a starting node. The depth-first variety is contrasted with the breadth-first procedure (up next!) in that, given a starting node, one of its neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, and so on. Was there anything weird about that last sentence? This is a recursive algorithm! When a visited node has no unvisited neighbors, the algorithm pops back up the call stack, testing for unvisited neighbors at each level.

A snapshot of visited nodes early in a depth-first search from Yankee Stadium. A red line segment is an edge that has been visited, but not unvisited, while a blue line segment has already been unvisited. As the recursive function pops higher up the call stack, more edges turn blue.

A snapshot of visited nodes early in a depth-first search from 161 St – Yankee Stadium. A red line segment is an edge that has been visited, but not unvisited, while a blue line segment has already been unvisited. As the recursive function pops higher up the call stack, more edges turn blue.

We can see that at the completion of a DFS from 161 St - Yankee Stadium, the entire MTA subway system has been visited. The nodes that have not been visited are the Staten Island railway, which has no rail connections to the subway system and therefore no edges in my graph.

We can see that at the completion of a DFS from 161 St – Yankee Stadium, the entire MTA subway system has been visited. The nodes that have not been visited are the Staten Island railway, which has no rail connections to the subway system and therefore no edges in my graph.

Breadth-First Search

A breadth-first search is another traversal variant whose lofty goal is to identify connected components of a graph while providing zero valuable info to passengers riding transit. (Now would be a good time to say that identifying connected components will play a key role in merging nodes during a later step in this project. Traversals are a necessary part of any graph analyzer’s toolkit!) As you may have guessed, a BFS goes wide before it goes deep. From a given node, all of its neighboring nodes are visited before any of their neighbors are visited. This produces a different exploration pattern, which is illustrated in the following three images.

A snapshot early in a breadth-first search from Queensboro Plaza. We see that the visited nodes are spreading outward from the source. Think: diseases. Depth-first search is how you solve a maze and breadth-first search is how you get sick.

A snapshot early in a breadth-first search from Queensboro Plaza. We see that the visited nodes are spreading outward from the source. Think: diseases. Depth-first search is how you solve a maze and breadth-first search is how you get sick.

A bit farther in the breadth-first search, we can see the disease...err...graph traversal has continued to spread outward.

A bit farther in the breadth-first search, we can see the disease…err…graph traversal has continued to spread outward.

The completion of the breadth-first search. There are no blue edges because this is not a recursive algorithm.

The visited edges at the completion of the breadth-first search. There are no blue edges because this is not a recursive algorithm.

What’s next?

“gtfs-graph” (the GitHub project name for now – please help me come up with a better one!) is built to be system-agnostic. I have graph representations for Boston and Paris in addition to New York City. While the GTFS standard allowed me to construct all three graphs in similar ways, there were still a few quirks, resulting mainly from how the different systems represent sub-stops (parent/child or northbound/southbound).

Recently, I have been implementing centrality algorithms to see how the results varied from system to system. Paris’ RATP heavy rail lines certainly look to have higher connectivity than Boston’s hub-centric design, and I’m working to find the numbers to prove this. If I can indeed prove this, I’d like to use a genetic algorithm to efficiently enhance (add lines and stops) a system to match the connected-ness/centrality distribution/equity/whatever-metric-I-end-up-with of a higher quality system.

After implementing Google’s PageRank algorithm, I decided it is a poor model for transit. The rankings currently displayed are a modified version of closeness centrality. I really enjoyed this white paper on a node importance algorithm and plan to implement this soon. It uses random walks to calculate the entropy of a given node after a given number of steps.

I hope to have a much more detailed most on these metrics in the coming weeks! I would love to hear any thoughts or ideas you might have about any or all of this!

Let’s build awesome things to help transit, cities, and, most of all, people.

Updated: New York City Transit Frequency Visualization

Since I detailed my New York City transit frequency visualization project last month, there have been a few updates. Check out the web tool to view the changes!

What’s new?

  • The frequency buckets have been realigned to better parallel the psychology of how we use transit. The bins now group trips of less than 4 trips per hour, 4 to 8 trips per hour, and more than 8 trips per hour. Less than 4 trips per hour is generally the threshold where riders should consult a schedule before waiting on a curb, so it was important to separate these visually. The thickness of each edge now also increases with frequency.
  • There is now much more coverage in Queens bus data. No, MTA did not see my first update and decide to expand Queens service, though that would be awesome! I communicated with the Transitland team and my tool helped them discover they were previously missing the feed for the MTA Bus Company. It was historically a separate company and still has its own GTFS feed. I came up with some wild conclusions in my previous post on this project, several of which were rendered invalid by the completion of the data set.

What’s up next?

I’d still like to filter the express bus routes, provide finer-grained sorting by mode, and increase the dynamic nature of the tool in general. I’ve been working on an updated Ruby client to pair with the Transitland datastore, and have already updated my project source with the new interface. I’ve also begun dabbling with GTFS-realtime and plan to build a project with this specification soon.

We’re all #InTransit everyday and I hope to have many more updates soon!

What kind of things are you working on? Let me know in the comments below!

The frequency data for subway routes on a Friday morning for New York City transit. The darker the color, the higher the frequency!

The frequency data for subway routes on a Friday morning in New York City. The darker the color, the higher the frequency!

I’m a Millennial and I Don’t Expect WiFi on Buses

MTA’s new fleet of buses will have WiFi! Millennials everywhere rejoice! Or so says the narrative playing out in New York this month.

After New York Governor Andrew Cuomo announced that more than 2,000 WiFi-enabled buses will hit the streets of New York City by 2020, I am left wondering: WhyFi?

One quote circulating the transit web this week from MTA Chairman Thomas Prendergast is particularly interesting: “As more and more millennials enter the system and use it daily, these are expectations, not desires, on their part.” Are they though?

I ride two buses for a total of about 35 minutes each weekday morning. I read books. I read articles on my phone. I write emails. I plan my days. Sure, there have been times where I’ve thought about finishing a nagging programming problem on the bus. The truth is: my overall daily effectiveness is actually higher because of my non-WiFi activities. I schedule that time to finish lingering tasks from the day and catch up on blogs I literally only read while I’m on the bus.

Why can’t there be a different headline headline? “Millennials discover new productivity techniques while riding the bus” has a nice ring to it, don’t you think?

On the other hand, access to cell signal is an enormous benefit. I was excited when Chicago added cell signal to its underground routes last year. On a 20-minute ride, the benefits from cell signal versus nothing are far greater than those of WiFi versus cell signal only.

Oh, and the two times I’ve decided to use WiFi on the MAX, I was greeted with the following result.

Bad experiences with WiFi on buses

Even the document error icon looks sad.

Bad experiences with WiFi on buses

Maybe I forgot to connect to the netwo…nope, that’s not it.

Bad experiences with WiFi on buses

If speedtest.net had ever loaded, I doubt I would have been pleased with its figures.

(At this point in writing this, I conducted a not-quite-full-fledged survey on the 16 bus and asked two people about their MAX WiFi experiences. Jake from An Anxious Extrovert relayed he has numerous underwhelming experiences with the MAX WiFi. “Sometimes a Facebook message will come in, but I can’t read it or reply.” Get the picture? Jake did bring up a fantastic use case I had neglected to consider: those with smart phones but without data plans. (Functional) WiFi on the MAX would give him around 40 minutes round-trip connectivity a day he otherwise goes without. Our nameless seatmate gave a convincing “No!” when we asked if he had gotten the WiFi to work. He asked the driver about this once and the driver even went outside the bus to make a tweak, presumably with the intention of improving the internet access. We digressed to make a full series of jokes about the driver crank starting a dedicated WiFi generator, making subtle hand adjustments to a satellite dish, and even sending HTTP requests manually on behalf of the riders before I remembered our original question. I asked our bus-mate if the internet had worked after the tweak and his response was inconclusive: “I think I got off the bus.” And we’re back…)

I understand that WiFi is not a huge expenditure. Transfort allocates $40,000 a year for WiFi service on the MAX (out of an annual budget of if-I’m-reading-it-correctly $57 million) and it will cost $10.2 million for the MTA. Two things to keep in mind for the MTA: 1) this number also includes installation of USB charging ports and, 2) their annual operating budget is $14.8 BILLION.

Even if an agency can reasonably afford to fund WiFi on buses, the planning and implementation of internet connectivity takes away from improvements on its core mission: to increase urban mobility.

Intercity buses are a different story. When you are in a single seat for more than an hour and don’t have people standing in your face, WiFi offers a real benefit. I have heard of good experiences with the WiFi on Bustang and Express Air Coach, the shuttle my friends used to take between the Chicago airports and Purdue. You could argue this extends to express intracity buses, where passengers remain seated for a longer period of time and there are fewer stops and shuffling inside the bus. I’d be curious to see the average ride length on a NYC express bus verses a non-express bus. Either way, this is not the plan. Cuomo is putting WiFi on the bus you take four blocks after getting off the F train.

From a millennial who loves buses: don’t bother with WiFi. Focus on world-class service and the people will follow.

What do you think about WiFi on buses? Are there any sides to the argument I missed? Let me know in the comments below!

For some solid coverage of the the MTA announcement, I recommend the following:

  • Benjamin Kabak calls the network connectivity on underwhelming bus service, “putting libstick on a pig”.
  • CityLab reports on a TransitCenter study about factors that would increase transit ridership by age. Access to WiFi was 12th out of 12 among those age 30 and above and 9th out of 12 for those below 30. Number one among all ages: trips on transit taking less time.

New York City Transit Depicted With (A New Set Of) Colorful Lines

Update 3/29/16: The transit visualization has been updated! The technical details in this post are still relevant, but some of the conclusions are no longer valid. Read about the updates here!

Stop the buses! Hold the phone! I now have visual proof that buses and subways in the Big Apple run more often on Fridays than Saturdays. How insightful, right? Okay, so maybe not, but I still enjoyed making a New York City transit frequency visualization using Transitland and Mapbox.

VIEW THE TOOL HERE. Try hovering over each route and turning on different days and modes (subway versus bus) of service.

Below are a few images showing the difference in frequency of transit service on Friday and Saturday, followed by a discussion of each component of the project.

Friday service in a New York City Transit Visualization

Friday morning subway and bus frequency. The coverage and frequencies are impressive!

Saturday service in a New York City Transit Visualization

Saturday morning bus and subway service. As to be expected, the coverage is similar to on Friday, but the frequencies drop significantly.

What can we learn from this frequency visualization of New York City transit?

Some items this visualization illustrates are to be expected:

  • Transit runs with higher frequencies during the week.
  • Transit runs with higher frequencies in denser areas (Manhattan, Brooklyn) than less dense areas (Staten Island).

A few things made sense after seeing them, but were ideas I had not anticipated:

  • Even in dense areas, bus frequencies are higher in areas that have less subway service, and vice versa. While this is true in Manhattan (more subways and subway frequency) and Brooklyn (more buses and bus frequency), it is quite noticeable in Queens. When you turn the subway layer off, western Queens appears almost devoid of transit. While its subway connections do not reach to the eastern edge of Queens, they do begin to make up for a lack of bus routes in western Queens. A few images below show this.
  • The inter-borough connections between Queens and Brooklyn that are notoriously absent in all heavy rail maps of the area are almost as weak even when viewing bus data. It just isn’t easy to travel between Long Island’s two boroughs. Maybe the planned streetcar will finally help this.

One thing to keep in mind: the trips per hour numbers that appear when you hover over lines on the map are not specific to a transit route. They encompass all transit services, potentially multiple routes and even modes, between the two stops that create an edge.

Queens bus routes in a New York City Transit Visualization

Bus routes in western Queens. Doesn’t this seem like it’s missing something?

Queens subway routes in a New York City Transit Visualization

Bus and subway routes in western Queens. That’s a bit better.

The Data

Transitland is an open source project that aggregates transit feeds from across the world. You can query its JSON API to create apps and visualizations easier than directly crunching the underlying GTFS data.

I was inspired to dig into Transitland by this similar frequency visualization for San Francisco. We both use the stops and schedule_stop_pairs API endpoints to calculate how often the “edge” between any two consecutive transit stops is visited in a given time frame.

I chose an appropriate bounding box to encompass all the transit stops operated by MTA and picked a window of 7:30am to 8:00am on the mornings of Friday, January 22, 2016, and Saturday, January 23, 2016. In addition to buses and subways, ferry service is also returned by Transitland in this bounding box, which explains the trips to Staten Island and oddly-direct routes to New Jersey.

The data returned by Transitland is not real-time data of actual transit performance, only the scheduled service times on those dates. I was able to extrapolate a “trips per hour” frequency metric by dividing the edge weight by the length of my query’s time frame.

The Map

I considered publishing a map using QGIS, but I was fortunate enough to stumble upon Mapbox. Mapbox does not have the analytical tools that QGIS does, but its ease of creating interactive web-based maps is impressive.

GeoJSON is a standard JSON variant that holds geographical information, such as points and line segments. In addition to its required fields, I loaded the GeoJSON output files with styling from Mapbox’s simplestyle-spec based on the frequency for that line segment. Mapbox interprets these “properties” fields when displays the data on a map.

A good tool should be simple enough to let you spend time solving real problems and I found Mapbox to reach this goal swimmingly (is there a similar term for transit??). The small amount of code needed to plot four GeoJSON files, toggle between them, show a map legend, and allow zooming and a loading screen all on top of a satisfactory OpenStreetsMap was remarkable. I will most definitely be using Mapbox for future transit projects!

The Code

As the JSON Transitland interface language-agnostic, any scripting language could be used. Ruby is by far my favorite, so I stuck with what I know. You can view the visualization in my GitHub repository.  The code is divided into an HTML front-end and Ruby back-end, though they do not connect directly. A few ideas I have for the future of this project:

  • The TransitlandAPIReader class could be generalized into a gem with a decent test suite, similar to one Transitland used to maintain and intends to bring back.
  • The run.rb script could take a job spec input to produce GeoJSON files for multiple days and cities in a single run.
  • The Mapbox front-end could be used to visualize any arbitrary transit system’s GTFS shape data. This would likely be done using a Rails back-end, rather than the offline Ruby script I am currently using.

Other News

I spent another few hours this week getting lost reading about the Cincinnati subway. If you haven’t dove into that tunnel of information before, I’d highly recommend it. Something about using an old canal which had become economically unfeasible due to competition from railroads to build a tunnel system that was halted due to a moratorium on capital bonds during World War I and never successfully revived just fascinates me. Seriously, any single part of that last sentence would make for a good story, but all those together create a sort transit tragedy worthy of a Shakespearean drama.

In the bed of the canal née Erie

doth thou venture to lay parallel rails.

To endure and inspire they began,

ere citizens above were admonished

their Sisyphean ambitions would fail.

I’m getting cold shivers just imagining a chorus reciting that at the opening of a transit conference. Please let me know of any other examples of transit stories told in iambic pentameter.

Until next time, ride on!