Tyler A. Green

In Transit

Tag: gtfs

Graphing Transit Systems, Part III – Centrality Extended

This is the third post diving into the graph structure of the New York City subway system. Read the first two for more background!

At the start of last post, I threw out two questions:

  1. Does the network structure of the New York City subway indicate Times Square is a critical station, or is that just where the most riders board?
  2. Can all stations in a transit network be important?

We discussed the difference between centrality metrics and node importance metrics. The former identify important nodes in a network, while the latter ranks nodes by importance. We’ll use the node importance metrics to answer these questions.

To support our discussion, I whipped up a map showing the MTA subway ridership data by itself using Carto. Here’s the interactive map! The data is from the years 2010 to 2015 and is provided by the MTA.

Does the network structure of the New York City subway indicate Times Square is a critical station, or is that just where the most riders board?

To answer this question, I calculated the correlation between ridership and centrality. In the scatter plots below, the independent variable is the centrality score per station, and the dependent variable is the ridership at that station, averaged over the years 2010 through 2015. This might seem backwards, but I chose this because the centrality metric is a reflection of the network structure and we are studying the effect of network structure on ridership.


The correlation coefficient for these two data sets show a moderate positive correlation.

  • Closeness centrality, r = 0.43
  • Outward accessibility, r = 0.30

Remember, correlation does not imply causation, but these figures suggest that for an increase in the centrality metric, you can expect a moderate increase in ridership.

Did you notice Times Square on the scatter plots? Yep, with an average annual ridership of almost 63 million, it’s the outlier. Based on its position on the horizontal axis, closeness centrality thinks Times Square is an important station in the network, while outward accessibility does not. If you remember from last post, PageRank also finds Times Square to be important and Katz just confused us all. That answers our first question!

Before we go on, I have a theory that any outlier in these plots are the result of externalities. For example, the average ridership at Yankee Stadium – 161 St is 8.7 million, but its neighboring stations have ridership of 1.3, 3, 3, and 4.3 million each on average. What is its externality? The world-famous New York Yankees. Times Square – 42 St is a similar situation. Not only is it a transfer point for 12 NYC subway services, it is also below the mega tourist attraction and its namesake, Times Square. I have no hard data on this outliers theory, but more research could be done on this!

Can all stations in a transit network be important?

Why would we want all stations to be “important”? If our goal is for all citizens to have equal access to quality to public transportation, we would like everyone to live near a station which provides this gateway. A transit network will always have stations which are more centrally located than others, but is it possible to minimize the differences between the most connected and the least connected stations? Let’s see how do our metrics evaluate the structure of another world-class network in this regard. Enter Paris, its minimal geographical constraints, and its lovely radial network.

The two histograms below sit on the same range on the horizontal axis. The count on the vertical axis is the number of stations which fall into the horizontal range represented by its bar. As you can see, Paris has many, many stations which score higher than all of New York City’s.


There is one large caveat here: land area. Officially, the area of New York City is 302.6 square miles, while Paris is only 40.7 square miles. Another metric is longest subway line: New York’s A train extends 31 miles, while Paris’ Line 13 is just over 15 miles. Closeness centrality uses shortest path between station pairs, which in my graph, are the number of seconds for a trip. A 31-mile subway trip will take longer than a 15-mile subway trip, so this metrics are stacked against New York City subway and the large area it covers.

Concerning our question, even though Paris’ stations score much higher than New York’s, they are not all equal. This gets back to my earlier point: there will be an importance continuum among stations, but improving the importance of the least connected stations can still provide a benefit to citizens.

Next, let’s look at this histograms for New York and Paris outward accessibility scores.

There is not as much difference between New York City and Paris in the histograms for outward accessibility. This metric is independent of network area or subway line length, so this does not surprise me. It may hint that more of the difference between the networks for closeness centrality may be due to geographical area.

If you look at the densest parts of the Paris network and see how interconnected it is, the upper bound for its accessibility distribution being higher than New York’s also will not be a surprise.

Next Stop

Now that we have evaluated the centrality of multiple transit networks and performed limited cross-network comparisons, I want to know whether these metrics can tell us the best future subway routes. For example, given the budget for a single new subway line, what is the best route for this new line? It will be a very empirical and barely human analysis, so we may have to take the results with a grain or six of salt, but hopefully the results will have value besides making shapes on maps.

See you then!

Graphing Transit Systems, Part II – Centrality

This post is the second of three four looking into the graph structure of the New York City subway system. In the previous post, I discussed a frontend I built to visualize a depth-first search, breadth-first search, and shortest path algorithm. I ended with a discussion of centrality algorithms. We pick up our hero there…

Centrality metrics identify important nodes in a graph. In the gtfs-graph world, nodes represent subway stations. Why might we want to identify important stations in the NYC subway network? Honestly, my initial reason was I thought it sounded cool. I was curious to see if there are numbers (besides ridership…we’ll get to that in the next post!) to rank stations which align with our human perception of important stations in the system. Meaning: does the network structure indicate Times Square is a critical station, or is that just where the most riders board? That was the first question I wanted to explore. The next question would challenge the Lake Wobegon effect. That is: can all stations in a network be important?

To answer these questions, I created a web app for three cities and their heavy rail networks:

Each city has results for four centrality metrics: PageRank, Katz centrality, closeness centrality, and outward accessibility. I will be discussing the results in terms of the New York City network.

It is worth noting at this point that analyzing a transit network only using stops and edges is a very simplified model. To make any real decisions on the system as it relates to the city and population it serves, we would need to consider population density and employment centers at minimum. Knowing that, let’s proceed!

PageRank

If PageRank sounds familiar to you, it’s likely because it is the algorithm used by book publishers to identify pages, and definitely not because it was invented by Google co-founders Larry Page and Sergey Brin to rank web pages for their search engine. In this algorithm, a node’s importance is derived from the importance of all the nodes which link to it. Mapped over to transit, a station’s importance is derived from the importance of all the stations which have direct connections to it.

The PageRank results look interesting and definitely pick out important stations, but they do not give us insight into the entire distribution of stations.

The PageRank results look interesting and definitely pick out important stations, but they do not give us insight into the entire distribution of stations.

I was giddy while implementing this and my brain swirled with grand visions of unlocking new insights to generations-old transit networks. As it turns out, PageRank is not a great model for a transit network. Let’s look at an example.

In the NYC PageRank view, you can see that Times Square comes out on top. Let’s collectively channel our inner undergrad physics lab student and breathe a sigh of relief that the numbers show us what we expected. Phewwwwwww. However, if we look at one of its neighbors, 34 St – 11 Av AKA the 7 train extension, we see that it ranks last. Not just maybe not top ten or top 100, but dead last. PageRank is saying that the 7 train extension produced a station that is literally the least important in the NYC network.

Have no fear Andrew Cuomo, let’s consider the model again. If you throw in sample numbers using the PageRank formula, you can see that the above behavior is correct. 34 St – 11 Av only has one “link” and that node’s PageRank is high, but it also has a high out-degree. Using the random surfer / random transit rider model, a rider passing through Times Square is not likely to end up at 34 St – 11 Av. Sorry 7 train, but PageRank is just does not do your $2.4 billion price tag justice. Let’s see how the other centrality metrics view the subway network!

Poor 34 St - 11 Av doesn't get any love from PageRank. The data on the right shows the top 10 stations serve several subway routes each. This is not a coincidence; PageRank picks out highly connected nodes.

Poor 34 St – 11 Av doesn’t get any love from PageRank. The data on the right shows the top 10 stations serve several subway routes each. This is not a coincidence; PageRank picks out highly connected nodes.

Katz Centrality

Katz Centrality builds on PageRank by considering all walks between two stops in a network, as opposed to only the shortest path between nodes. This appealed to me in a transit context because in a dense network such as Paris, there are often numerous routes between any two stops. This built-in redundancy has been brought up recently as a weakness of the DC metro during the on-going two-track vs. four-track debate and how it affects the maintenance window for a major heavy rail system.

Now is a good time to mention that I would highly recommend the Wikipedia entry for Katz centrality and all the metrics in this post. The original Katz paper is insightful as well.

The results from Katz are……confusing. If you picked South Ferry as the most important MTA station, you either love platform extenders or misguidedly added the Staten Island Ferry to your subway network. The Staten Island Railway data is included in the MTA subway GTFS feed, so I kept it on my map. Closeness centrality (up next!) requires all nodes to be reachable from every other node, so I threw a fake edge in to the graph to represent the ferry. Believe me: the results were just as confusing before I added the ferry route. Due to the multiplicative nature of Katz centrality, the resulting distribution ranges from 0.00244 (Ozone Park – Lefferts Blvd) to 693,246.863 (St George, just across from South Ferry on the south-bound ferry).

Here’s all the insight I can offer on Katz centrality: all traffic between two well-connected sections of the graph (Staten Island and the entire rest of the MTA subway) has to pass through two stations: South Ferry and St George. Therefore, they are “important” and “central” and I am “confused” and “ready to talk about other metrics.”

Katz says the subway network is equally unimpressive. Except for South Ferry. What a champ.

Katz says the subway network is equally unimpressive. Except for South Ferry. What a champ.

Closeness Centrality

My friend Calvin and I half made-up, half realized-it-was-already-a-thing, a centrality metric which promised a return to the fundamentals. Closeness centrality (or as Cal and I called it, the squiggly-doo) is intuitive in that the closer a node is to all other nodes, the more “central” it is. It does this by ranking a node by the sum of the shortest paths to all other nodes in the network. As you may remember from last post, the distance of each edge in our network is the number of seconds to travel via that route segment according to that system’s GTFS feed.

At this point of confusing results from two metrics, I discovered the term “node influence metrics.” These metrics seek to answer my second question from earlier: can all nodes in a network be important? PageRank and Katz identify important nodes, but only the top of their resulting distribution should be considered. This means the metric results for the bottom half of the distribution are more or less meaningless. Technically, closeness centrality is not a node influence metric, but I treat it as such. Intuition tells me that its results have meaning for the entire distribution of nodes. Please comment if you feel otherwise!

Neapolitan ice cream anyone? Closeness centrality results have no surprises.

Neapolitan ice cream anyone? Closeness centrality results have no surprises.

Manhattan stations are ranked highly by closeness centrality. This uniformity is in contrast to the Manhattan results for outward accessibility.

Manhattan stations are ranked highly by closeness centrality. This uniformity is in contrast to the Manhattan results for outward accessibility.

The closeness centrality results are extremely straightforward. Subway stations on Manhattan score higher because riders can reach all other stations in less time there than elsewhere. The opposite is true for Far Rockaway. This algorithm will play an important role in the next post!

Outward Accessibility

Outward accessibility is one of the primary node influence metrics. It produces a normalized version of diversity entropy proposed in this paper by Travençolo and Costa. A node ranks highly when many unique paths can be taken from it over a course of random walks of varying distances. Sections of a graph which rank highly by this metric are found to have high network redundancy and high accessibility from the rest of the network. Redundancy and accessibility are both critical when evaluating a transit network, so this seemed like a good fit!

One drawback to the outward accessibility metric is performance and repeatability. Before calculating the actual metric, one must perform a series of random walks of varying distances from each node. For these walks to be representative, the walk count must be high, which can lengthen execution time of the analysis. Due to the nature of random calculations, the answers change every time! This could be solved by using a consistent random number generator seed when running the analysis, or by always running enough random walks for the results to converge.

Outward accessibility gives us the weather map similar to closeness centrality, but are its individual stations ranked similarly?

Outward accessibility gives us the weather map appearance similar to closeness centrality, but are its individual stations ranked similarly?

Outward accessibility picks out hotspots of importance in a graph network. These can vary slightly due to the random nature of this algorithm, but should converge over time with enough random walks.

Outward accessibility picks out hot spots of importance in a graph network. These can vary slightly due to the random nature of this algorithm, but should converge over time with enough random walks.

The results for outward accessibility appear to parallel those of closeness centrality at first glance. However, a closer look at the accessibility results shows hot spots. The metric tells us these are the nodes which allow riders to traverse the most unique routes in a given distance. Translated to the real world, this is valuable to the rider’s perception of a transit network. If I can go to 20 different stations within 10 subway stops (on any route), my location is better served by public transit than if I can only go to 10 stations within 10 stops.

Accessibility also has a strange property of ranking end stations higher. The logic is that if I start from the second from the end station, half of my random walks will go outwards and produce little diversity entropy. Conversely, if I start from the end station, all of my walks will go towards the potentially more diverse part of the graph. I emailed the paper authors to comment on this behavior, but have not heard back. If you are reading this, Travençolo or Costa, please comment with insight!

Next Stop

If you’ve hung with me this long and have noticed I haven’t answered either question posed at the start of this post, I’m going to grant you a short break. In the next post, we’ll discuss how the closeness centrality and outward accessibility results correlate to the NYC subway ridership numbers, as well as how these metrics compare between NYC and Paris. I hope you’ll stay on board!

Graphing Transit Systems

I’ve been away from the blogging world for a while! The last few months included a fantastic and inspiring trip to Transportation Camp NYC and loads of (mostly) fun weekend work on transit graphs.

In a hodgepodge effort to improve on Javascript, learn React, create a generic graph representation of a GTFS feed, and implement a few graph algorithms, I finally have a working TRANSIT GRAPH DEMO.

Why transit graphs?

While reviewing algorithms on Jason Park’s algorithm visualizer, I thought, “WE CAN APPLY THESE TO TRANSIT.” It was a moment of pure destiny. To call it multidisciplinary intrigue would be underselling my excitement. Of course, I was not the first person to connect transit and graphs; Google Maps, Open Trip Planner, and Mapzen’s Valhalla are all built on graph representations.

My original goal was to display an animated graph traversal of the New York City subway system. I’ve ended up with a platform to study graph algorithms on transit maps. (I learned that if I’m unsure what I’m building, just call it a platform. The solutions will follow.)

As is the norm in 2016 JavaScript, I used almost as many tools and libraries as there are NYC subway stations. My goal in all projects is to use as little custom data as possible, so I stuck with my Boston model and loaded the MTA GTFS feed into an Amazon RDS Postgres instance. The backend is a Node.js server which boots up after constructing a graph. I used the new ES6 ‘class’ keyword to create a TransitGraph in the style of the object-oriented languages I was raised on. The original frontend was written using JQuery, but when I reached the point of implementing an autocomplete search box, I knew I needed to up my tool game. Enter: React. Facebook’s documentation on the library is quite comprehensive and I latched on to the object-oriented feel and state-based programming model. All the data (stops and routes/edges) is communicated via WebSocket that persists through an entire client connection.

As you can see when using the graph demo, there are three modes. A bit on each…

Shortest Path

Dijkstra’s is the classic gateway algorithm to finding shortest paths in graphs. Wikipedia’s explanation is as clear-worded as I’ve read, so I’ll defer to them:

It picks the unvisited vertex with the lowest distance, calculates the distance through it to each unvisited neighbor, and updates the neighbor’s distance if smaller.

Fire up the algorithm visualizer for to help picture this. In my graph, the edge weights are the time between stations. After running Dijkstra’s, we have an ordered sequence of nodes which represent the shortest path between the origin and destination and the time it would take to do so.

A sample shortest-path from 50th St to 1 Av. The routes are calculated from the GTFS feed based on the trips that pass through that stop. This can periodically result in slightly different route listings than the official MTA map.

A sample shortest path from 50th St to 1 Av. The routes which serve each station are derived from the GTFS feed based on the trips that pass through that stop. This can periodically result in slightly different route listings than the official MTA map.

The user interface to pick the origin and destination nodes. I studied Pinterest's CSS to help build the stop tokens that populate the input fields when selected. The route details at the bottom uses "display: flex;", a tip I picked up from the Google Maps CSS.

The user interface to pick the origin and destination nodes. I studied Pinterest’s CSS to help build the stop tokens that populate the input fields when selected. The route details at the bottom uses “display: flex”, a tip I picked up from the Google Maps CSS.

Sound familiar? Google Maps transit directions do the exact same time. And much better! Knowing when to switch trains becomes a luxury after using my tool.

Depth-First Search

A depth-first search, or DFS as the real algorithm geeks call it, is a classic traversal method for both graphs and trees. The idea of a traversal is to visit all the nodes in the graph which can be reached given a starting node. The depth-first variety is contrasted with the breadth-first procedure (up next!) in that, given a starting node, one of its neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, then one of it’s neighbors is visited, and so on. Was there anything weird about that last sentence? This is a recursive algorithm! When a visited node has no unvisited neighbors, the algorithm pops back up the call stack, testing for unvisited neighbors at each level.

A snapshot of visited nodes early in a depth-first search from Yankee Stadium. A red line segment is an edge that has been visited, but not unvisited, while a blue line segment has already been unvisited. As the recursive function pops higher up the call stack, more edges turn blue.

A snapshot of visited nodes early in a depth-first search from 161 St – Yankee Stadium. A red line segment is an edge that has been visited, but not unvisited, while a blue line segment has already been unvisited. As the recursive function pops higher up the call stack, more edges turn blue.

We can see that at the completion of a DFS from 161 St - Yankee Stadium, the entire MTA subway system has been visited. The nodes that have not been visited are the Staten Island railway, which has no rail connections to the subway system and therefore no edges in my graph.

We can see that at the completion of a DFS from 161 St – Yankee Stadium, the entire MTA subway system has been visited. The nodes that have not been visited are the Staten Island railway, which has no rail connections to the subway system and therefore no edges in my graph.

Breadth-First Search

A breadth-first search is another traversal variant whose lofty goal is to identify connected components of a graph while providing zero valuable info to passengers riding transit. (Now would be a good time to say that identifying connected components will play a key role in merging nodes during a later step in this project. Traversals are a necessary part of any graph analyzer’s toolkit!) As you may have guessed, a BFS goes wide before it goes deep. From a given node, all of its neighboring nodes are visited before any of their neighbors are visited. This produces a different exploration pattern, which is illustrated in the following three images.

A snapshot early in a breadth-first search from Queensboro Plaza. We see that the visited nodes are spreading outward from the source. Think: diseases. Depth-first search is how you solve a maze and breadth-first search is how you get sick.

A snapshot early in a breadth-first search from Queensboro Plaza. We see that the visited nodes are spreading outward from the source. Think: diseases. Depth-first search is how you solve a maze and breadth-first search is how you get sick.

A bit farther in the breadth-first search, we can see the disease...err...graph traversal has continued to spread outward.

A bit farther in the breadth-first search, we can see the disease…err…graph traversal has continued to spread outward.

The completion of the breadth-first search. There are no blue edges because this is not a recursive algorithm.

The visited edges at the completion of the breadth-first search. There are no blue edges because this is not a recursive algorithm.

What’s next?

“gtfs-graph” (the GitHub project name for now – please help me come up with a better one!) is built to be system-agnostic. I have graph representations for Boston and Paris in addition to New York City. While the GTFS standard allowed me to construct all three graphs in similar ways, there were still a few quirks, resulting mainly from how the different systems represent sub-stops (parent/child or northbound/southbound).

Recently, I have been implementing centrality algorithms to see how the results varied from system to system. Paris’ RATP heavy rail lines certainly look to have higher connectivity than Boston’s hub-centric design, and I’m working to find the numbers to prove this. If I can indeed prove this, I’d like to use a genetic algorithm to efficiently enhance (add lines and stops) a system to match the connected-ness/centrality distribution/equity/whatever-metric-I-end-up-with of a higher quality system.

After implementing Google’s PageRank algorithm, I decided it is a poor model for transit. The rankings currently displayed are a modified version of closeness centrality. I really enjoyed this white paper on a node importance algorithm and plan to implement this soon. It uses random walks to calculate the entropy of a given node after a given number of steps.

I hope to have a much more detailed most on these metrics in the coming weeks! I would love to hear any thoughts or ideas you might have about any or all of this!

Let’s build awesome things to help transit, cities, and, most of all, people.

LIVE: The Boston T Party

I’m a few months late on this one, but I recently wanted to learn about WebSockets and GTFS-realtime feeds. The result: a real-time Boston transit map! I apologize if you were expecting a historical reenactment.

Try clicking on a marker for more information on the subway/bus/light rail/commuter rail vehicle it represents!

The sidebar of the application appears when you click on a vehicle. The area could be populated with tons more info from the GTFS static feed!

The sidebar of the application appears when you click on a vehicle. This area could be populated with tons more info from the GTFS static feed!

How It’s Built

The app runs on a Node.js server that accepts both a socket connection and an API call. Why both? Ask two-months-ago me. A new socket connection is formed when a client (web browser) connects to the server. The server periodically polls for the latest GTFS-realtime update (which the MBTA posts updates to http://developer.mbta.com/lib/GTRTFS/Alerts/VehiclePositions.pb every ~18 seconds) and decodes the resulting protocol buffer using the Google gtfs-realtime-bindings. The decoded data is then broadcast to all socket connections. The frontend client is a simple AngularJS controller which manages the socket connection and updates the markers with the latest vehicle position information.

The basic architecture described until this point can operate completely independent from a GTFS static feed, but this would only produce a bunch of dots on a map which move periodically. Which, don’t get me wrong, made me ecstatic when that was all I had. But linking up a GTFS static feed gives each dot context. I decided to load the MBTA feed into a Postgres database on Amazon’s Relation Database Service using this schema. The GTFS static connection allows for two features: 1) the client issues an API call to fetch the route and headsign when you click on a vehicle, which is fulfilled by the server through a database query, and 2) the colored route lines are pre-generated into a GeoJSON file using a Node.js script which runs a database query to fetch the official MBTA color for each route.

The purple lines are the commuter rail routes. I chuckled the first time these lines loaded and I kept have to zoom out to see where they stop. To Providence and beyond!

The purple lines represent the commuter rail routes. I chuckled the first time these lines loaded and I kept having to zoom out to see where they stop. To Providence and beyond!

Up Next

The map doesn’t have nearly the feature set of NextBus, with gobs of detail about every bus and stop you click on. I do find it clumsy that you have to select routes to view in NextBus, leading me to make sure all the lines appear at load time in my map (or shortly after (#LargeGeoJSONFile)). Feel free to check out the code or even add features yourself; the code lives on GitHub!

A big maintenance issue with the app as constructed is that it requires a manual reload of the GTFS feed after each update by MBTA which changes any trip IDs. The Green Line trains do not have valid trip_ids in the GTFS-realtime feed, so I programmed the app to display any vehicle with an unknown trip_id (one that did not match with the GTFS static feed) as a Green Line trip. After a GTFS static update, you will often see many vehicle markers say they represent a Green Line train, when we really just need to load the new GTFS feed into the Postgres database. Who wants to automate this for me?

You may not need this map to plan your commute from Back Bay to South Station, but it was certainly a fantastic learning experience for me. Many of its components will make an appearance in my next project! (Hint: it involves representing transit networks as a connectivity graph!)

Until next time, ride on!

I never get tired of staring at these colored lines until the markers all jump to their next position! The yellow is the official color specified for the bus routes in the MBTA GTFS static feed. Anyone know the reason for this? It also look like the Silver Line goes a bit crazy right after exiting the Ted Williams Tunnel.

I never get tired of staring at these colored lines until the markers all jump to their next position! The yellow is the official color specified for the bus routes in the MBTA GTFS static feed. Anyone know the reason for this? It also looks like the Silver Line goes a bit crazy right after exiting the Ted Williams Tunnel. Correct me if I’m wrong, but I think this is where Silver Line buses switch from diesel power to trolleybuses?

A Ruby Gem for GTFS to GeoJSON Conversion

I published my first Ruby gem: gtfs-geojson! You can view the source on GitHub. gtfs-geojson is a Ruby utility to convert a GTFS feed to a GeoJSON file. It’s a simple endeavor, for sure, but I’m pleased with what I learned along the way.

Let’s start out with some before-and-after views of the data. These images were created using QGIS, OpenStreetMap, Transfort’s GTFS feed, and the gtfs-geojson library.

The Transfort GTFS data loaded in QGIS before applying the Ruby gem for GTFS to GeoJSON conversion.

This map displays the shapes.txt file from Transfort’s GTFS feed loaded into QGIS. The seemingly-inconsistent shading on the lines is because there are no lines at all; each “line” is made up of a sequence of points. Each point contains a route ID and is ordered relative to the other points in its route by a point sequence value.

The Transfort GTFS data loaded in QGIS after applying the Ruby gem for GTFS to GeoJSON conversion.

After running the GTFS feed through gtfs-geojson, you now have a GeoJSON file whose features are each route from the original feed. I used “Categorized” styles in QGIS to quickly apply a unique color to each route.

As with most transit projects, the input to gtfs-geojson is a GTFS feed. GTFS is the standard format published by transit agencies worldwide to make their routes, stops, and even fares usable by developers. The data is a series of comma-separated text files. To validate a GTFS feed, I used an existing gem. gtfs will fail gracefully if the shapes.txt file is not present, which is the only file I actually need for the conversion to GeoJSON.

gtfs-geojson implements the same algorithm as the “Points to path” QGIS tool I used when looking at Transfort bus data. The main trick is that the points within each route ID must be sorted by their point sequence value. Several other QGIS plugins I tried did not do this correctly, so don’t forget this if implementing this yourself!

While QGIS tools output shapefiles, gtfs-geojson produces a GeoJSON file, which is a JSON stream with geospatial points and polylines data served up in a standard format. I have previously loaded GeoJSON files in Mapbox applications, and they are also useful in a GIS context. The following three lines will load the library, validate the GTFS feed, convert its shapes.txt file to GeoJSON format, and write the GeoJSON to a file.

require 'gtfs-geojson'
geojson = GTFS::GeoJSON.generate("gtfs.zip")
File.open("gtfs.geojson",'w') do { |f| f.write(geojson) }

That’s it! Let me know if you have any suggestions! The README on the GitHub repo gives installation instructions.

The most valuable tip I learned while creating this gem was the use of the $RUBYLIB environment variable. This isn’t necessary when installing a gem onto your system using bundler, but it is extremely helpful during development. $RUBYLIB lets you specify the path searched when the require keyword is used. To add paths dynamically to $RUBYLIB, you can push items to the ‘$:‘ array. $: is shorthand for $LOAD_PATH within a Ruby program. My require_relative days are over!

If you are considering writing your own gem, I highly recommend RubyGems.org’s “Make Your Own Gem” guide. It is comprehensive and just generally fantastic.

I plan to use gtfs-geojson in a Rails project in the future. And speaking of gems, I’ve also been dabbling on a Ruby API client for Transitland. I hope to have more to share on both fronts soon!

Until then, ride on!

Have any transit projects to share? Let me know!

Transfort Bus Stops Through the Lens of GIS

To better understand the Fort Collins population and what percentage of it is adequately served by Transfort bus stops, I decided to jump on board the GIS-hype train. I downloaded QGIS, read a bit at qgistutorials.com, and felt ready to dive in.

You’re talkin’ about Transfort bus stops?

You bet I am! To begin (and prove to myself this wouldn’t be the most manual project I’d ever taken on), I collected data from several sources. I have included a Links section with paths to download the data yourself. You can also jump straight to the data, though you’ll miss some sweet graphics along the way.

  1. Transfort – I could not find shapefiles for either the Transfort stops or routes, so I began with the GTFS feed. Data in this transit agency standard format is in a series of comma-separated text files. The three of interest to me were the stops.txt, shapes.txt, and routes.txt.
  2. City of Fort Collins – I used two shapefiles provided by the city, ones of city limits and street centerlines.
  3. Colorado Information Marketplace – Fortuitously, Colorado publishes population data on the census block level. These correspond to city blocks, which were necessary for analyzing the population within Fort Collins.

To visualize the population density, I began with a heatmap. The census blocks shapefile is essentially a table of polygons, each with an attribute containing the population of that block in 2010. I filtered the layer to only include blocks within Larimer County and then created a layer of the census block centroids, which turned each polygon into a point. At this point, the default layer unit was degrees. To analyze this layer in meters, I reprojected the layer to SIRGAS 2000 / UTM Zone 13N. I then created a raster heatmap with a radius of 402 meters, which corresponds to a quarter-mile radius. This is an area of approximately 0.2 square miles, which is also listed in the map legend.

Before analyzing bus stops, I wanted to visually present each Transfort route. This required converting the shapes.txt file into a routes shapefile. QGIS can do this with the “Points to path” tool under Vector Creation algorithms. I have uploaded the resulting shapefile, along with PDFs of the following maps, in the Downloads section.

Population Density with Transfort Bus Stops

In addition to population density, I wanted to study walking distance from Transfort bus stops. Latitude and longitude information for each stop is contained within the stops.txt file. The QGIS plugin MMQGIS allows imports of these using “Geometry Import from CSV File.” I again needed to reproject the resulting layer to SIRGAS 2000 / UTM Zone 13N to ensure the layer units were meters. I wanted see the results of a 10-minute walk radii, so I created 804 meter (half-mile) buffers around each bus stop.

Ten-Minute Walk Radii from Tranfort Bus Stops

Since the half-mile coverage seemed surprisingly complete, I created a layer of 402 meter (quarter-mile) buffers around each bus stop to show the area within a 5-minute walk.

Five-Minute Walk Radii from Transfort Bus Stops

To allow the population density layer to blend with the walk distance buffers, I changed the layer blending mode to Darken.  The shades of green in the image below show dense areas overlapping with a 5-minute walk radius from a bus stop.

Population Density with Five-Minute Walk Radii from Transfort Bus Stops

Do you have any numbers I can ‘wow’ my friends with?

Fort Collins Population within Walking Distance of Transfort Bus Stops

I’m glad you asked! Another powerful feature of GIS is quantitative analysis. I used “Basic statistics” on the 2010 population field of the census block centroids layer to calculate the population of Larimer county. For the population of Fort Collins, I selected the census blocks that lie within the city limits polygon using a Spatial Query. Running statistics on these selected census block centroids produced the city population number in the table above.

You can see that the first half-mile buffer population is larger than the city population. I calculated the population within the walking distance buffers using two methods to adjust for this:

  1. No Flex Route – The FLEX is a commuter route operated by Transfort whose northern terminus is the South Transit Center. Several of its northern-most stops are inside Fort Collins city limits. I made the decision to remove the population near FLEX stops as commuter bus service has lower frequencies, and therefore different usage patterns, than a typical city bus. This was accomplished by joining the stops shapefile with stop_times.txt and trips.txt to give each route a column with its route name. I then used the Query Builder to select all stops whose route name was not “FLEX”.
  2. City Limits – The northwest corner of the Transfort routes actually runs outside of Fort Collins city limits, meaning the people living close to these stops were not included in my calculated city population. I performed a Spatial Query to select the bus stops within the city limits polygon boundary. These are the only stops I calculated buffers around when selecting census blocks for this method.

In the figure below, the bottom center red circle shows the location of the first adjustment and top left red circle the second.

Exceptions Made When Analyzing Data for Transfort Bus Stops

What does this mean?

You can see that the east side of Fort Collins contains both fewer dense areas and fewer routes, especially north-south routes. I have seen a pre-MAX Transfort map (pre-May 2014) that contained a north-south route on Timberline Road, the easternmost arterial in Fort Collins. While it is disappointing that ridership supposedly did not justify keeping this route, the density numbers back up this service change.

The half-mile buffers confirm that the city is broken into a square mile grid. The two east-west routes (Horsetooth and Harmony) in the southeast corner of the map show the half-mile circles bumping against each other, creating a distance of a mile between the two roads.

Depending on the metric, between 60% and 63% of Fort Collins residents are within a five-minute walk of a Transfort bus stop. This is significantly higher than I would have guessed. However, being near a bus stop is only part of the story; frequency of service and driving disincentives also play a major role in whether a resident will ride the bus or not. Parking is quite easy in most of Fort Collins and the areas where it is harder, mainly Old Town, provide markedly sub-market value parking. The headway on the routes, excluding the MAX, is either 30 or 60 minutes. And there is no Sunday service. All this goes to say living within 5 minutes of a bus stop does not necessarily make for a transit heaven. It should also be noted that the block-level populations may have significantly changed since the 2010 census.

Regardless, having 87% of your population live within a 10-minute walk of a bus stop indicates an overall lack of transit deserts and a fairly comprehensive bus system. I think Fort Collins is very close to a big shift in transit culture!

For better or worse, I will now scrutinize the density surrounding my Transfort bus stops and routes even more. Here’s what I can tell you from my observations thus far: it’s not great. Here’s what the data says: it’s not great. And here’s to using data to continue to improve our city and its bus service!

Let me know if there is more I can do with this data! I’d also enjoy seeing analysis of this type in your own city.

Until next time, ride on!

Downloads

Transfort Routes Shapefile

Population Density with Transfort Bus Routes PDF

Ten-Minute Walk Radii from Tranfort Bus Routes PDF

Five-Minute Walk Radii from Transfort Bus Routes PDF

Population Density with Five-Minute Walk Radii from Transfort Bus Routes PDF

Links

Transfort GTFS Feed
http://www.ridetransfort.com/developers OR
https://code.google.com/p/googletransitdatafeed/wiki/PublicFeeds OR
http://transitfeeds.com/

City of Fort Collins
http://www.fcgov.com/gis/downloadable-data.php

Colorado Information Marketplace
https://data.colorado.gov/Demographics/Census-Blocks-2010/xipb-k5bu