Friday, June 13, 2014

Intermediate GTFS: How to prevent stop_times.txt from hogging all your RAM

In a GTFS file for a large provider (example: Chicago), the stop_times.txt file can be huge.

Here's an example:

$ unzip -v 20140611.cta.gtfs

 Archive:  20140611.cta.gtfs
  Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
 --------  ------  ------- ---- ---------- ----- --------  ----
      235  Defl:N      153  35% 2014-06-10 22:49 6dfdcbd7  agency.txt
     4889  Defl:N      554  89% 2014-06-10 22:49 38544274  calendar.txt
     2139  Defl:N      335  84% 2014-06-10 22:49 f69f2cb4  calendar_dates.txt
    13000  Defl:N     2279  83% 2014-06-10 22:49 59bf453e  routes.txt
 29896363  Defl:N  7158190  76% 2014-06-10 22:54 21c7d003  shapes.txt
  1332298  Defl:N   264714  80% 2014-06-10 23:00 48a736df  stops.txt
355900629  Defl:N 57437701  84% 2014-06-10 23:24 e13831ff  stop_times.txt
     2514  Defl:N      575  77% 2014-06-10 23:24 51dca8cf  transfers.txt
  5654714  Defl:N   656156  88% 2014-06-10 23:25 7056aa15  trips.txt
       42  Defl:N       39   7% 2014-06-10 22:49 87676593  frequencies.txt
    18959  Defl:N     6180  67% 2011-06-20 11:07 bcf138d1  developers_license_agreement.htm
 --------          -------  ---                            -------
392825782         65526876  83%                            11 files

Uncompressed, the stop_times.txt file is 356 MB, 91% of the entire archive.

You can see right away that loading that behemoth into memory will cause a pause (30 seconds, on my hardware) that a user will probably notice. And the way Python dicts use memory, 356 MB of CSV data can easily explode into 3 or 4 GB of a comparable python dict.  And that leaves us with a problem: We can't load the file on-demand because it takes too darn long, we can't keep it stored in an easily-accessible form in memory because it's too darn big. While we could keep it in CSV format in RAM, that's not a fomat we can do anything with - every query would need to iterate through the entire file, and that's sort of the worst of both worlds.




So what can we do?


First, we can filter data immediately upon load, and discard big chunks of it right away. This means the structure of the program must change from:

cal = load_table(calendar.txt)
rou = load_table(routes.txt)
tri = load_table(trips.txt)
sto = load_table(stops.txt)
tim = load_table(stop_times.txt)
svcid = valid_service_ids_for_today(cal)
trips = trips_that_stop_there_today(tri, svc_id, tim)

to something more like:

cal = load_table(calendar.txt)
rou = load_table(routes.txt)
tri = load_table(trips.txt)
sto = load_table(stops.txt)
tim = load_and_filter_stop_times_table(stop_times.txt, filter_criteria)
svcid = valid_service_ids_for_today(cal)
trips = trips_that_stop_there_today(tri, svc_id, tim)

Instead of loading the entire table and holding it in memory, a customized loader filters the data right away, and chuck out the 90% we don't actually want.


Second, we can use Python's read size parameter to prevent a spike in RAM usage (and consequent slowdown of the entire system).

For example, normally you read data with something like:

with gtfs_file.open('stop_times.txt', mode='r') as infile:
    input = infile.read()
output = do_stuff_to(input.decode())
return output


And we can rearrange it to something like:

eof    = False
output = {}
with gtfs_file.open('stop_times.txt', mode='r') as infile:
    while eof == False:
        input_block  = infile.readlines(4096)
        if len(input_block) == 0:
            eof = True

        output_block = do_stuff_to(input_block)
        output.update(output_block)

return output

The EOF flag controls the process, and an empty read triggers the flag.
The input RAM requirements are tiny, since each block re-uses the same memory over and over. Only the filtered output grows.

The performance penalty of this kind of loading seems about 30% (10 seconds) longer than a single read(), but now the process can run happily in the background without slowing other processes. Filtering the huge file down to a manageable size and searchable format also make subsequent use of the data _much_ faster.




Working examples

Here's working sample code of a generic load_table(), which I called map_gtfs_table_to_dict()

Here's working sample code of a load_stop_times(), which I called map_stop_times_to_dict()


Let's use the generic sample code to load trips.txt from a CTA GTFS file:

>>> import map_gtfs_table_to_dict as load_table
>>> import zipfile
>>>
>>> # You must provide your own GTFS file, of course!
>>> gtfs_path = "/home/ian/gtfs/20140611.cta.gtfs"
>>> gtfs_file = zipfile.ZipFile(gtfs_path, mode='r')
>>> all_trips = load_table(gtfs_file, 'trips.txt')
>>>
>>> len(all_trips)
90428
>>>
>>> for trip_id in list(all_trips)[0:5]:
...      print(trip_id, all_trips[trip_id])
... 
45072053995 {'route_id': 'Y', 'direction_id': '0', 'wheelchair_accessible': '1',
'service_id': '104509', 'direction': '0', 'shape_id': '304500033',
'block_id': '45007747580', 'schd_trip_id': 'R501'}
437076941189 {'route_id': '147', 'direction_id': '1', 'wheelchair_accessible': '1',
'service_id': '43701', 'direction': 'North', 'shape_id': '4374519',
'block_id': '437008159276', 'schd_trip_id': '76941189'}
437076941185 {'route_id': '97', 'direction_id': '1', 'wheelchair_accessible': '1',
'service_id': '43701', 'direction': 'East', 'shape_id': '4374369',
'block_id': '437008158984', 'schd_trip_id': '76941185'}
437076941184 {'route_id': '97', 'direction_id': '1', 'wheelchair_accessible': '1',
'service_id': '43701', 'direction': 'East', 'shape_id': '4374369',
'block_id': '437008159008', 'schd_trip_id': '76941184'}
437076941186 {'route_id': '97', 'direction_id': '1', 'wheelchair_accessible': '1',
'service_id': '43701', 'direction': 'East', 'shape_id': '4374369',
'block_id': '437008158986', 'schd_trip_id': '76941186'}
>>> 

There we are, all 90,000 trips in a few seconds. Each line of the GTFS table broken into a dict for easy access.



Now, let's try using the same generic load function to load stop_times.txt

>>> import map_gtfs_table_to_dict as load_table
>>> import zipfile
>>>
>>> # You must provide your own GTFS file, of course!
>>> gtfs_path = "/home/ian/gtfs/20140611.cta.gtfs"
>>> gtfs_file = zipfile.ZipFile(gtfs_path, mode='r')
>>> all_stops = load_table(gtfs_file, 'stop_times.txt')
Traceback (most recent call last):
  File "", line 1, in 
  File "./test.py", line 56, in map_gtfs_table_to_dict
    line_dict[column] = line.split(',')[columns[column]].strip('" ')
MemoryError
>>> 

It took a few minutes to consume all that memory and fail, and everything else on the system slowed to a crawl.

Finally, let's use the custom load_stop_table() to load-and-filter stop_times.txt without reducing the rest of the system to a memory-starved crawl:

>>> import map_stop_times_to_dict as load_table
>>> import zipfile
>>>
>>> gtfs_path = "/home/ian/gtfs/20140611.cta.gtfs"
>>> gtfs_file = zipfile.ZipFile(gtfs_path, mode='r')
>>> stop_list = ['3661', '3698', '15026', '17433']
>>> filtered_stops = load_table(gtfs_file, stop_list)
4379
>>> for line in list(filtered_stops)[0:5]:
...     print(line, filtered_stops[line])
... 
4521987 {'stop_headsign': '79th Red Line', 'departure_time': '13:53:57',
'shape_dist_traveled': '22219', 'arrival_time': '13:53:57',
'pickup_type': '0', 'stop_id': '3661', 'stop_sequence': '35',
'trip_id': '437077075442'}
4538374 {'stop_headsign': '79th Red Line', 'departure_time': '15:04:01',
'shape_dist_traveled': '37847', 'arrival_time': '15:04:01',
'pickup_type': '0', 'stop_id': '3680', 'stop_sequence': '58',
'trip_id': '437077077228'}
4325384 {'stop_headsign': '127th/Lowe', 'departure_time': '17:26:00',
'shape_dist_traveled': '27644', 'arrival_time': '17:26:00',
'pickup_type': '0', 'stop_id': '5992', 'stop_sequence': '42',
'trip_id': '437077000628'}
475149 {'stop_headsign': '95th Red Line', 'departure_time': '18:01:18',
'shape_dist_traveled': '14028', 'arrival_time': '18:01:18',
'pickup_type': '0', 'stop_id': '15026', 'stop_sequence': '20',
'trip_id': '436073915060'}
4112403 {'stop_headsign': '95th Red Line', 'departure_time': '22:08:53',
'shape_dist_traveled': '29370', 'arrival_time': '22:08:53',
'pickup_type': '0', 'stop_id': '15026', 'stop_sequence': '42',
'trip_id': '437076995010'}
>>> 

It's still slow to load the entire file. On my equipment, I can load stop_times.txt in about 30 seconds this way. Varying the block size has some benefit, but you can play with that yourself.

In the final example, the 'line' variable -- the key to that subdict of one trip serving a stop -- is just a counter, the (approximate) CSV row number. stop_times.txt has no unique item on each line that can be used as a dict key.