Friday, December 19, 2014

Wordpress and LAMP in an LXC Container

Container

I need to test some html running in Wordpress.
But I don't use Wordpress...obviously.

In Linux, that's easy to fix. Just install Wordpress.
Oops, not so easy: Wordpress pulls in an entire LAMP stack with it.
That's a lot to pollute my laptop with just to do some testing.

Containers to the rescue!
Let's spin up a container, install LAMP and Wordpress inside it, run the tests, then destroy the container.

Cargo

The Router


I'm going to open a whole new port on my router's firewall for this, so others can help me test.

On the router, I want to forward port 112233 to the similar port on the server.

On the server, I want to forward port 112233 to the container's port 80. (That part is later)


Creating the container


Thanks the the amazing Stephane Graber for his detailed instructional series on how to create and use a container This is a slightly different setup than he did. Instead of installing directly on, say, a laptop, I'm installing the container onto a server that I access via ssh.

Three moving pieces: Laptop (me), Server (headless), Container (added to server)

So, from Laptop, I ssh into Server normally.

On SERVER:
sudo apt-get install lxc               # Install the container system
sudo lxc-create -t ubuntu-cloud -n c1  # Download and install a 195MB cloud image of Ubuntu 14.10
sudo lxc-start -n wp1 -d               # Boot the image in the background (name is 'wp1')
sudo lxc-info -n wp1                   # Discover the IP of the image
    Name:           c1
    State:          RUNNING
    IP:             10.0.3.201         # <-- Ooh, there is the IP
ping 10.0.3.201                        # Check network connectivity on the container
    PING 10.0.3.201 (10.0.3.201) 56(84) bytes of data.
    64 bytes from 10.0.3.201: icmp_seq=1 ttl=64 time=0.081 ms
    64 bytes from 10.0.3.201: icmp_seq=2 ttl=64 time=0.085 ms

# Forward port 112233 to the container
sudo iptables -t nat -A PREROUTING -p tcp -i eth0 --dport 112233 -j DNAT --to-destination 10.0.3.201:80
sudo iptables -A FORWARD -p tcp -d 10.0.3.201 --dport 80 -m state --state NEW,ESTABLISHED,RELATED -j ACCEPT


Set up LAMP and Wordpress within the container


The container is installed, started, and responds to ping.

On SERVER:
sudo lxc-console -n wp1    # Login to the console of wp1 (username: ubuntu, password: ubuntu)

A couple words about the container.
I'm going to skip the part where you would delete the 'ubuntu' user and add your own admin user and password. But you should do it.
Also, two pieces of information you should have handy before going farther.
  1. Server's IP address on the LAN (mine is 192.168.0.101)
  2. The real, fully qualified domain name (freds_blog.fred.com) that readers on the internet will use
The FQDN will be used to generate the MySQL account for the blog, and the create the wordpress config file. The LAN address will be used to link to the same accounts.

Special thanks to the Ubuntu community for putting together this fantastic tutorial about how to install wordpress in Ubuntu.

# On CONTAINER
sudo apt-get install mysql-server wordpress   # 49 packages, 28MB download

Let's jump over to Laptop for a moment and check that Server's port forwarding and Container's apache service are working: Open a browser window, and look for http://192.168.0.101:112233. You should get the apache default page. Success! Okay, now back to Container:

# On CONTAINER
sudo ln -s /usr/share/wordpress /var/www/html/wordpress   # Make Wordpress accessible from apache

# Create mysql account, and link it to Wordpress
sudo gzip -d /usr/share/doc/wordpress/examples/setup-mysql.gz
sudo bash /usr/share/doc/wordpress/examples/setup-mysql -n freds_blog_fred_com freds_blog.fred.com  

# Prevent a Wordpress error (can't locate config file) from within the LAN
# by linking the LAN addr to existing FQDN config file
sudo ln /etc/wordpress/config-freds_blog.fred.com.php /etc/wordpress/config-192.168.0.101.php

exit
<ctrl+a, q> to exit the console


Use Wordpress


The setup is complete. Since I'm on my LAN, I point my Laptop browser to http://192.168.1.101/wordpress/ and I get the Wordpress setup screen.

Outside, on the big wide internet, I would point it to http://freds_blog.fred.com/wordpress/ (er, that's an example - you already know that's not my real blog).

Wordpress is ready for my data.


          Bye, Container!

Destroying the container


This is one of the true joys of LXC

# On SERVER
sudo lxc-stop -n wp1
sudo lxc-destroy -n wp1


And all the work is wiped out forever....

Remember to clean up:
  • Uninstall lxc from the server
  • Delete the iptables rules on the server
  • Close the port on the router firewall

Saturday, December 13, 2014

Say, what are you doing this afternoon?

Bus Stop - Under The Rain by Leonid Afremov
Here's a happy little afternoon project for new users trying to play with a new
language or script, and getting their feet in the open-source ecosystem.

Your phone's handy weather app depends upon the goodwill of a for-profit data provider, and their often-opaque API (14 degrees? Where was it observed? When?) That's a shame because most data collection is paid for by you, the taxpayer.

Let's take the profit-from-data out of that system. Several projects have tried to do this before (including libgweather), but each tried to do too much and replicate the one-data-provider-to-rule-them-all model. And most ran aground on that complexity.


Here's where you come in

One afternoon, look up your weather service's online data sources. And knock together a script to publish them in a uniform format.

Here's the one I did for the United States:
Worldwide METAR observation sites
US DOD and NOAA weather radars
US Forecast/Alert zones

  • Looking for data on non-METAR (non-airport) observation stations, weather radar sites, and whatever forecast and alert areas your country uses.

  • Use the same format I did: Lat (deg.decimal), Lon (deg.decimal), Location Code, Long Name. Use the original source's data, even if it's wrong. Area and zones should use the lat/lon of the centroid.

  • The format is simple CSV, easy to parse and publish.

  • Publish on GitHub, for easy version control, permalinking, free storage, and uptime.

  • Here's the key: Your data must be automatically-updated. Regularly, your program must check the original source and update your published version. How I did it with a cron job. Publish both the data and your method on GitHub. 

  • When you have published, drop me an e-mail so I can link to your data and source.

If you do it right, one afternoon to setup your country's self-updating database. Not a bad little project, you learn a little, and you help make the world a better place.


My country doesn't have online weather data

Sorry to hear that. You're missing some great stuff.

If you live in a country with a reasonably free press and reasonably fair elections, make a stink about it. You are probably already paying for it through taxes, why can't you have it?

If you live somewhere else, then next time you have a revolution or coup, add 'open data' to the long list of needed reforms.


What will this accomplish?

This will create a free, sustainably updated, uniform, crowdsourced set of accurate worldwide data that will be easy to compile into a single global database. If you drop out, your online code will ensure another volunteer can step in.

This is one fundamental tool that other free-weather projects have lacked. And any weather project can use this.

The global database of locations is really small by most database standards. Small enough to easily fit on a phone. Small enough to be bundled with apps that can request data directly from original sources...once they can look up the correct source to use.


How will this change the world?

It's about simple tools that make
it easy to create free, cool software.
And it's about ensuring free access to data you already paid for.

Not bad for one afternoon's contribution.

Friday, December 12, 2014

How to set up automatic Github uploads

Back to basics

For years, I have been playing with US Weather Data.

A couple years ago, I came up with a lookup server for US weather locations, and -of course- had all kinds of plans for how to expand it to more services and more countries.

So, of course, it went nowhere.

This week, I took a look at the issue again, and came up with a simple set of lookup tables (yes, a tiny database) to replace the server. It reduced the amount of code by about 70%, and it's finally in a maintainable, expandable form. It's simple, and ready for a few drive-by patches.

Which leads to the next question...how to share it with the world?

Well, in this decade, sharing software and data means putting it on GitHub.


Creating the GitHub repositories

We have a Python 3 script that creates the database, another Python 3 example script that consumes the database, and the three CSV files that make up the database tables. Most consumers will probably just want the data for their own purposes - they really don't need the scripts (and it would be quite rude for millions of users to hit NOAA servers to produce their own databases each week)

So I created two repositories on GitHub, one for the data and another for the optional scripts.

If you haven't used GitHub, you first go online and create your account and create the repositories (with initial README file) in a web browser. Then:

cd /tmp
git clone https://github.com/ian-weisser/data   # Create a clone of the online repository
cd data
cp /home/somewhere/else/data/* ./               # Add existing data files into the local clone
git add ./*                                     # Tell git to track the new data files
git commit -m "Add initial data"
git push origin master                          # Push the changes back to GitHub
  Username for 'https://github.com':
  Password for 'https://me@github.com': 
cd /tmp 
rm -rf /tmp/data                                # Clean up

Overall, it's pretty simple. For the next iteration:
  • We must automate the upload credentials so it doesn't prompt for a username and password each time.
  • We must add a cron job that creates the data for upload.
  • Let's move this project from /tmp on my laptop to the server where it will live permanently. This also means that the server should pull the latest Python script...which is also hosted on GitHub.

Let's setup the server:


sudo apt-get install git    # On this server, about 3 MB of download
sudo apt-get install python3-httplib2  # Script dependency
mkdir uploads
cd uploads
git clone https://github.com/ian-weisser/data          # Setup the local 'data' repository
git clone https://github.com/ian-weisser/weather       # Setup the local 'software' repository
weather/nws_database_creator.py                        # Use the script to update CSVs in ~/uploads/data
cd data
git config --global user.email "my_email@example.com"  # First time setup
git config --global user.name "My Name"                # First time setup
git config credential.helper store                     # First time using this repo setup
git add -A
git commit -m "Test update"
git push origin master
  Username for 'https://github.com': my-github-login-name   # First time using this repo setup
  Password for 'https://my-github-login-name@github.com':   # First time using this repo setup


Automating the process

So now subsequently, without all the first time setup:

#!/bin/sh

dir=/home/ian/uploads
hub="https://github.com/ian-weisser"
now=$(/bin/date +%F)

cd $dir/weather
/usr/bin/git pull "$hub"/weather      # Update the database_creator script

cd $dir/data
/usr/bin/git pull "$hub"/data         # Need to pull to be able to push later

$dir/weather/nws_database_creator.py  # Do the update
/usr/bin/git add -A                   # Record any changes
/usr/bin/git commit -m "Scheduled update $now"
/usr/bin/git push origin master
exit 0


And now we can run the shell script weekly from a user cron job:

5 6 * * 7 /home/ian/uploads/cron/weekly_weather > /home/ian/uploads/weather.log 2>&1


A couple of tests, and it works.

It's completely automatic.

Once a week, my server downloads the latest make-a-database script, builds the database, and uploads the database to GitHub for the world to use.


The data


If you want to see the final product (the lat/lon lookup tables for US weather), you can get them at:
https://raw.githubusercontent.com/ian-weisser/data/master/metar.csv
https://raw.githubusercontent.com/ian-weisser/data/master/radar.csv
https://raw.githubusercontent.com/ian-weisser/data/master/zone.csv

They will, of course, be automatically updated weekly.

Tuesday, July 22, 2014

Etherape as root

This has popped up as an occasional question in the forums.
If you monitor network activity using Etherape, it's worthless as a user level application. It must be run as root. But running as root requires opening a terminal and entering a password.

Here's how to change the settings in Ubuntu 14.04 to create an Etherape (as root) launcher in Unity that won't require a password. It has only been tested with 14.04. Unity development moves fast - this method may not work for a different version of Ubuntu.

It requires a minor edit to the /etc/sudoers file, and minor edits to two .desktop files. It's easy to undo if you change your mind.


1) I like to document my changes. I have three changes to keep track of, in case I change my mind. So I'm adding a new subdirectory to keep track of this change.

mkdir my_customizations/etherape_launcher




2) Edit the sudoers file. This is the dangerous part. Pay close attention.

You need to know your login account name (fred, Jane, dog, etc.)
You need to know your hostname as shown in /etc/hostname.

$ sudo visudo


At the bottom of the file, add one line. Spacing, spelling, and CAPS are important.

account_name hostname = NOPASSWD: /usr/bin/etherape


Use your account_name and hostname, of course.

Do a CTRL+X to exit, and hit Y to confirm the exit.
Now look carefully at the screen.

If visudo is happy with your changes, it will print no output, but simply return you to a command prompt.

If the screen shows an error, type 'edit' and go right back in to fix the error. If you don't know how to fix the error, delete your added line and save. You MUST NOT exit with a broken sudoers file - a broken sudoers file is 100% guaranteed to break your system! It's very important that you pay close attention to any error messages visudo prints when you try to exit.



3) Test: Open a NEW terminal window, and try 'sudo etherape'.
It should no longer ask for a password.

If you get sudo errors instead, you have broken your system! Stop! Save all your data! Reboot into a recovery console (since you cannot use sudo) and edit /etc/sudoers to undo your changes. Go to an Ubuntu help channel and ask for help on how to do this!



4) Now that we know the sudo change works, move the sudoers line to a separate file

We used visudo to ensure the correct syntax and a working sudoers file. But it's not a great idea to leave undocumented system changes laying around in system-installed (and system-rewritable) files. So let's move the customization to a separate file so we can keep track of it.

$ sudo nano /etc/sudoers.d/etherape

File contents:

# Allows 'sudo etherape' to be run, including from a launcher, wthout password

my_account my_hostname = NOPASSWD: /usr/bin/etherape


Remember to change my_account and my_hostname to the proper values!

Test your change using 'sudo etherape'. It should work without a password, and without an error message.

Go back into visudo and delete the etherape line at the bottom.

Test your change once more using 'sudo etherape'. It should work without a password, and without an error message.


Now /etc/sudoers is back to it's original state, and our change is safely in a separate file. If we want to delete the change, we simply delete /etc/sudoers.d/etherape.



5) Let's link this new sudoers file to our documentation so we can find it again.

$ sudo ln /etc/sudoers.d/etherape \
          my_customizations/etherape_launcher/




6) Edit the etherape-root .desktop file

$ sudo nano /usr/share/applications/etherape-root.desktop


File contents:

[Desktop Entry]
Name=EtherApe (as root)
Type=Application
Categories=GNOME;System;Network;
Comment=Graphical Network Monitor
Comment[es]=Monitor Gráfico de Red
#TryExec=su-to-root                            <-- Comment out
#Exec=su-to-root -X -c /usr/bin/etherape        <-- Comment out
Exec=/usr/bin/sudo /usr/bin/etherape -i wlan0     <-- Add line
Icon=etherape
Terminal=false




7) Edit the etherape (user) .desktop file to hide it. We don't want to delete it; the package manager will expect it to be there.

$ sudo nano /usr/share/applications/etherape.desktop


#[Desktop Entry]           <-- Comment out
Version=1.0
Encoding=UTF-8
Name=EtherApe
GenericName=EtherApe
Comment=Graphical Network Monitor
Comment[es]=Monitor Gráfico de Red
Exec=etherape
Terminal=false
Type=Application
Icon=etherape
Categories=GNOME;System;Network;




8) Let's save links to the changed desktop files so I can find them again someday:

$ sudo ln /usr/share/applications/etherape-root.desktop \
          my_customizations/etherape_launcher/
$ sudo ln /usr/share/applications/etherape.desktop \
          my_customizations/etherape_launcher/




9) Test the new .desktop file.

Open Unity's Dash.
Close any existing search.
Open a new search for Etherape. There should be only one option now: Etherape (as root).
Open it. It should show all interfaces without requiring a password or prompting an error.
Lock it to the Launcher (right-click).
Close etherape. The launcher icon should remain.
Open it again from the launcher. It should work

How to recover from a broken sudo

You cannot fix a broken sudoers file while logged in. You would need access to sudo for that...and it's broken. You can't get to the root account because it's locked in normal use.

So you need to reboot into a condition where root is unlocked and available.

1) Save all your work and then reboot.

2) Hold down the left SHIFT button to get to the GRUB prompt.
If you get a pretty splash screen, reboot and hold down the button earlier.

3) At the Grub prompt, select Advanced Options

4) At the Advanced Options prompt, select any Recovery Console

5) Watch the boot technobabble run across your screen.

6) At the Recovery Console prompt, select Drop To A Root Shell

7) You should see a warning about a read-only filesystem.
The prompt should look like '#'

8) Remount the root filesystem as read-write instead of read-only. We want to make changes that stick.

# mount -o remount /


9) Try a test write. It should not cause an error:

# date > /tmp/test


10) Okay, now go into visudo and fix your mistake.

11) When complete, reboot into normal Ubuntu.

# shutdown -r now

Tuesday, July 8, 2014

Simple geolocation in Ubuntu 14.04

Geolocation means 'figuring out where a spot on the Earth is'.

Usually, it's the even more limited question 'where am I?'


GeoClue


The default install of Ubuntu includes GeoClue, a dbus service that checks IP address and GPS data. Since 2012, when I last looked at GeoClue, it's changed a bit, and it has more backends available in the Ubuntu Repositories.

 

Privacy

Some commenters on the interwebs have claimed that GeoClue is privacy-intrusive. It's not. It merely tries to figure out your location, which can be handy for various services on your system. It doesn't share or send your location to anybody else.

dbus introspection and d-feet

You would expect that a dbus application like GeoClue would be visible using a dbus introspection tool like d-feet (provided by the d-feet package).

But there's a small twist: D-feet can only see dbus applications that are running. It can only see dbus applications that active, or are inactive daemons.

It's possible (and indeed preferable in many circumstances) to write a dbus application that is not a daemon - it starts at first connection, terminates when complete, and restarts at the next connection. D-feet cannot see these when they are not running.

Back in 2012, GeoClue was an always-on daemon, and always visible to d-feet.
But in 2014 GeoClue is (properly) no longer a daemon, and d-feet won't see GeoClue if it's not active.

This simply means we must trigger a connection to GeoClue to make it visible.
Below are two ways to do so: The geoclue-test-gui application, and a Python3 example.




geoclue-test-gui


One easy way to see GeoClue in action, and to make it visible to d-feet, is to use the geoclue-test-gui application (included in the geoclue-examples package)

$ sudo apt-get install geoclue-examples
$ geoclue-test-gui





GeoClue Python3 example


Once GeoClue is visible in d-feet (look in the 'session' tab), you can see the interfaces and try them out.

Here's an example of the GetAddress() and GetLocation() methods using Python3:

>>> import dbus

>>> dest           = "org.freedesktop.Geoclue.Master"
>>> path           = "/org/freedesktop/Geoclue/Master/client0"
>>> addr_interface = "org.freedesktop.Geoclue.Address"
>>> posn_interface = "org.freedesktop.Geoclue.Position"

>>> bus        = dbus.SessionBus()
>>> obj        = bus.get_object(dest, path)
>>> addr_iface = dbus.Interface(obj, addr_interface)
>>> posn_iface = dbus.Interface(obj, posn_interface)

>>> addr_iface.GetAddress()
(dbus.Int32(1404823176),          # Timestamp
 dbus.Dictionary({
     dbus.String('locality')   : dbus.String('Milwaukee'),
     dbus.String('country')    : dbus.String('United States'),
     dbus.String('countrycode'): dbus.String('US'),
     dbus.String('region')     : dbus.String('Wisconsin'), 
     dbus.String('timezone')   : dbus.String('America/Chicago')}, 
     signature=dbus.Signature('ss')),
 dbus.Struct(                 # Accuracy
     (dbus.Int32(3),
      dbus.Double(0.0),
      dbus.Double(0.0)),
      signature=None)
)

>>> posn_iface.GetPosition()
(dbus.Int32(3),               # Num of fields
 dbus.Int32(1404823176),      # Timestamp
 dbus.Double(43.0389),        # Latitude
 dbus.Double(-87.9065),       # Longitude
 dbus.Double(0.0),            # Altitude
 dbus.Struct((dbus.Int32(3),  # Accuracy
              dbus.Double(0.0),
              dbus.Double(0.0)),
              signature=None))

>>> addr_dict = addr_iface.GetAddress()[1]
>>> str(addr_dict['locality'])
'Milwaukee'

>>> posn_iface.GetPosition()[2]
dbus.Double(43.0389)
>>> posn_iface.GetPosition()[3]
dbus.Double(-87.9065)
>>> lat = float(posn_iface.GetPosition()[2])
>>> lon = float(posn_iface.GetPosition()[3])
>>> lat,lon
(43.0389, -87.9065)

Note: Geoclue's accuracy codes



Ubuntu GeoIP Service


When you run geoclue-test-gui, you discover that only one backend service is installed with the default install of Ubuntu - the Ubuntu GeoIP service.

The Ubuntu GeoIP service is provided by the geoclue-ubuntu-geoip package, and is included with the default install of Ubuntu 14.04. It simply queries an ubuntu.com server, and parses the XML response.

You can do it yourself, too:

$ wget -q -O - http://geoip.ubuntu.com/lookup

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Ip>76.142.123.22</Ip>
  <Status>OK</Status>
  <CountryCode>US</CountryCode>
  <CountryCode3>USA</CountryCode3>
  <CountryName>United States</CountryName>
  <RegionCode>WI</RegionCode>
  <RegionName>Wisconsin</RegionName>
  <City>Milwaukee</City>
  <ZipPostalCode></ZipPostalCode>
  <Latitude>43.0389</Latitude>
  <Longitude>-87.9065</Longitude>
  <AreaCode>414</AreaCode>
  <TimeZone>America/Chicago</TimeZone>
</Response>




GeoIP


The default install of Ubuntu 14.04 also includes (the confusingly-named) GeoIP. While it has the prefix 'Geo', it's not a geolocator. It's completely unrelated to the Ubuntu GeoIP service. Instead, GeoIP is a database the IP addresses assigned to each country, provided by the geoip-database package. Knowing the country of origin of a packet or server or connection can be handy.

geoip-database has many bindings, including Python 2.7 (but sadly not Python 3). Easiest is the command line, provided by the additional geoip-bin package.

$ sudo apt-get install geoip-bin
$ geoiplookup 76.45.203.45
GeoIP Country Edition: US, United States




GeocodeGlib


Back in 2012, I compared the two methods of geolocation in Ubuntu: GeoClue and GeocodeGlib. GeocodeGlib was originally intended as a smaller, easier to maintain replacement for GeoClue. But as we have already seen, GeoClue has thrived instead of withering. The only two packages that seem to require GeocodeGlib in 14.04 are gnome-core-devel and gnome-clocks
GeocodeGlib, provided by the libgeocode-glib0 package, is no longer included with a default Ubuntu installation anymore, but it is easily available in the Software Center.

sudo apt-get install gir1.2-geocodeglib-1.0


That is the GTK introspection package for geocodeglib, and it pulls in libgeocode-glib0 as a dependency. The introspection package is necessary.

Useful documentation and code examples are non-existent. My python code sample from 2012 no longer works. It's easy to create a GeocodeGlib.Place() object, and to assign various values to it (town name, postal code, state), but I can't figure out how to get GeocoddeGlib to automatically determine and fill in other properties. So even though it seems maintained, I'm not recommending it as a useful geolocation service.

Friday, June 13, 2014

Intermediate GTFS: How to prevent stop_times.txt from hogging all your RAM

In a GTFS file for a large provider (example: Chicago), the stop_times.txt file can be huge.

Here's an example:

$ unzip -v 20140611.cta.gtfs

 Archive:  20140611.cta.gtfs
  Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
 --------  ------  ------- ---- ---------- ----- --------  ----
      235  Defl:N      153  35% 2014-06-10 22:49 6dfdcbd7  agency.txt
     4889  Defl:N      554  89% 2014-06-10 22:49 38544274  calendar.txt
     2139  Defl:N      335  84% 2014-06-10 22:49 f69f2cb4  calendar_dates.txt
    13000  Defl:N     2279  83% 2014-06-10 22:49 59bf453e  routes.txt
 29896363  Defl:N  7158190  76% 2014-06-10 22:54 21c7d003  shapes.txt
  1332298  Defl:N   264714  80% 2014-06-10 23:00 48a736df  stops.txt
355900629  Defl:N 57437701  84% 2014-06-10 23:24 e13831ff  stop_times.txt
     2514  Defl:N      575  77% 2014-06-10 23:24 51dca8cf  transfers.txt
  5654714  Defl:N   656156  88% 2014-06-10 23:25 7056aa15  trips.txt
       42  Defl:N       39   7% 2014-06-10 22:49 87676593  frequencies.txt
    18959  Defl:N     6180  67% 2011-06-20 11:07 bcf138d1  developers_license_agreement.htm
 --------          -------  ---                            -------
392825782         65526876  83%                            11 files

Uncompressed, the stop_times.txt file is 356 MB, 91% of the entire archive.

You can see right away that loading that behemoth into memory will cause a pause (30 seconds, on my hardware) that a user will probably notice. And the way Python dicts use memory, 356 MB of CSV data can easily explode into 3 or 4 GB of a comparable python dict.  And that leaves us with a problem: We can't load the file on-demand because it takes too darn long, we can't keep it stored in an easily-accessible form in memory because it's too darn big. While we could keep it in CSV format in RAM, that's not a fomat we can do anything with - every query would need to iterate through the entire file, and that's sort of the worst of both worlds.




So what can we do?


First, we can filter data immediately upon load, and discard big chunks of it right away. This means the structure of the program must change from:

cal = load_table(calendar.txt)
rou = load_table(routes.txt)
tri = load_table(trips.txt)
sto = load_table(stops.txt)
tim = load_table(stop_times.txt)
svcid = valid_service_ids_for_today(cal)
trips = trips_that_stop_there_today(tri, svc_id, tim)

to something more like:

cal = load_table(calendar.txt)
rou = load_table(routes.txt)
tri = load_table(trips.txt)
sto = load_table(stops.txt)
tim = load_and_filter_stop_times_table(stop_times.txt, filter_criteria)
svcid = valid_service_ids_for_today(cal)
trips = trips_that_stop_there_today(tri, svc_id, tim)

Instead of loading the entire table and holding it in memory, a customized loader filters the data right away, and chuck out the 90% we don't actually want.


Second, we can use Python's read size parameter to prevent a spike in RAM usage (and consequent slowdown of the entire system).

For example, normally you read data with something like:

with gtfs_file.open('stop_times.txt', mode='r') as infile:
    input = infile.read()
output = do_stuff_to(input.decode())
return output


And we can rearrange it to something like:

eof    = False
output = {}
with gtfs_file.open('stop_times.txt', mode='r') as infile:
    while eof == False:
        input_block  = infile.readlines(4096)
        if len(input_block) == 0:
            eof = True

        output_block = do_stuff_to(input_block)
        output.update(output_block)

return output

The EOF flag controls the process, and an empty read triggers the flag.
The input RAM requirements are tiny, since each block re-uses the same memory over and over. Only the filtered output grows.

The performance penalty of this kind of loading seems about 30% (10 seconds) longer than a single read(), but now the process can run happily in the background without slowing other processes. Filtering the huge file down to a manageable size and searchable format also make subsequent use of the data _much_ faster.




Working examples

Here's working sample code of a generic load_table(), which I called map_gtfs_table_to_dict()

Here's working sample code of a load_stop_times(), which I called map_stop_times_to_dict()


Let's use the generic sample code to load trips.txt from a CTA GTFS file:

>>> import map_gtfs_table_to_dict as load_table
>>> import zipfile
>>>
>>> # You must provide your own GTFS file, of course!
>>> gtfs_path = "/home/ian/gtfs/20140611.cta.gtfs"
>>> gtfs_file = zipfile.ZipFile(gtfs_path, mode='r')
>>> all_trips = load_table(gtfs_file, 'trips.txt')
>>>
>>> len(all_trips)
90428
>>>
>>> for trip_id in list(all_trips)[0:5]:
...      print(trip_id, all_trips[trip_id])
... 
45072053995 {'route_id': 'Y', 'direction_id': '0', 'wheelchair_accessible': '1',
'service_id': '104509', 'direction': '0', 'shape_id': '304500033',
'block_id': '45007747580', 'schd_trip_id': 'R501'}
437076941189 {'route_id': '147', 'direction_id': '1', 'wheelchair_accessible': '1',
'service_id': '43701', 'direction': 'North', 'shape_id': '4374519',
'block_id': '437008159276', 'schd_trip_id': '76941189'}
437076941185 {'route_id': '97', 'direction_id': '1', 'wheelchair_accessible': '1',
'service_id': '43701', 'direction': 'East', 'shape_id': '4374369',
'block_id': '437008158984', 'schd_trip_id': '76941185'}
437076941184 {'route_id': '97', 'direction_id': '1', 'wheelchair_accessible': '1',
'service_id': '43701', 'direction': 'East', 'shape_id': '4374369',
'block_id': '437008159008', 'schd_trip_id': '76941184'}
437076941186 {'route_id': '97', 'direction_id': '1', 'wheelchair_accessible': '1',
'service_id': '43701', 'direction': 'East', 'shape_id': '4374369',
'block_id': '437008158986', 'schd_trip_id': '76941186'}
>>> 

There we are, all 90,000 trips in a few seconds. Each line of the GTFS table broken into a dict for easy access.



Now, let's try using the same generic load function to load stop_times.txt

>>> import map_gtfs_table_to_dict as load_table
>>> import zipfile
>>>
>>> # You must provide your own GTFS file, of course!
>>> gtfs_path = "/home/ian/gtfs/20140611.cta.gtfs"
>>> gtfs_file = zipfile.ZipFile(gtfs_path, mode='r')
>>> all_stops = load_table(gtfs_file, 'stop_times.txt')
Traceback (most recent call last):
  File "", line 1, in 
  File "./test.py", line 56, in map_gtfs_table_to_dict
    line_dict[column] = line.split(',')[columns[column]].strip('" ')
MemoryError
>>> 

It took a few minutes to consume all that memory and fail, and everything else on the system slowed to a crawl.

Finally, let's use the custom load_stop_table() to load-and-filter stop_times.txt without reducing the rest of the system to a memory-starved crawl:

>>> import map_stop_times_to_dict as load_table
>>> import zipfile
>>>
>>> gtfs_path = "/home/ian/gtfs/20140611.cta.gtfs"
>>> gtfs_file = zipfile.ZipFile(gtfs_path, mode='r')
>>> stop_list = ['3661', '3698', '15026', '17433']
>>> filtered_stops = load_table(gtfs_file, stop_list)
4379
>>> for line in list(filtered_stops)[0:5]:
...     print(line, filtered_stops[line])
... 
4521987 {'stop_headsign': '79th Red Line', 'departure_time': '13:53:57',
'shape_dist_traveled': '22219', 'arrival_time': '13:53:57',
'pickup_type': '0', 'stop_id': '3661', 'stop_sequence': '35',
'trip_id': '437077075442'}
4538374 {'stop_headsign': '79th Red Line', 'departure_time': '15:04:01',
'shape_dist_traveled': '37847', 'arrival_time': '15:04:01',
'pickup_type': '0', 'stop_id': '3680', 'stop_sequence': '58',
'trip_id': '437077077228'}
4325384 {'stop_headsign': '127th/Lowe', 'departure_time': '17:26:00',
'shape_dist_traveled': '27644', 'arrival_time': '17:26:00',
'pickup_type': '0', 'stop_id': '5992', 'stop_sequence': '42',
'trip_id': '437077000628'}
475149 {'stop_headsign': '95th Red Line', 'departure_time': '18:01:18',
'shape_dist_traveled': '14028', 'arrival_time': '18:01:18',
'pickup_type': '0', 'stop_id': '15026', 'stop_sequence': '20',
'trip_id': '436073915060'}
4112403 {'stop_headsign': '95th Red Line', 'departure_time': '22:08:53',
'shape_dist_traveled': '29370', 'arrival_time': '22:08:53',
'pickup_type': '0', 'stop_id': '15026', 'stop_sequence': '42',
'trip_id': '437076995010'}
>>> 

It's still slow to load the entire file. On my equipment, I can load stop_times.txt in about 30 seconds this way. Varying the block size has some benefit, but you can play with that yourself.

In the final example, the 'line' variable -- the key to that subdict of one trip serving a stop -- is just a counter, the (approximate) CSV row number. stop_times.txt has no unique item on each line that can be used as a dict key.


Thursday, May 15, 2014

ssmtp to email cron output

Cron output in Ubuntu gets discarded. The default install of Ubuntu does not install a Mail Transport Agent (MTA), so cron cannot mail you the output.

You can install an MTA -several are easily available in the Ubuntu repositories- but configuration is not trivial. MTAs date from before the IMAP- and SMTP-based e-mail systems we use today, which makes them highly flexible and configurable...but not necessarily the easiest solution.

ssmtp is a limited-purpose MTA intended to replace the 'mail' command and send all output to an SMTP mailserver. ssmtp is provided by the 'ssmtp' package.

Install


sudo apt-get install ssmtp

Configuration file #1: /etc/ssmtp/ssmpt.conf

This tells ssmtp how to send email to my account.
It's not the best idea to store your e-mail password in world-readable plain text. See how to protect the password properly.

# Config file for sSMTP sendmail
#
# The person who gets all mail for userids < 1000
# Make this empty to disable rewriting.
root=me@example.com

# The place where the mail goes. The actual machine name is required no
# MX records are consulted. Commonly mailhosts are named mail.domain.com
mailhub=my_mail_server.net:465
AuthUser=me@example.com
AuthPass=my_email_password
UseTLS=Yes

# Where will the mail seem to come from?
rewriteDomain=example.com

# The full hostname
hostname=hey_its_my_computer

# Are users allowed to set their own From: address?
# YES - Allow the user to specify their own From: address
# NO - Use the system generated From: address
FromLineOverride=YES


Configuration File #2: /etc/ssmtp/

This directs root mail to my account.

# sSMTP aliases
#
# Format:       local_account:outgoing_address:mailhub
#
# Example: root:your_login@your.domain:mailhub.your.domain[:port]
# where [:port] is an optional port number that defaults to 25.
root:me@example.com:my_mail_server.net:465



And now cron output lands in my normal inbox.

Tuesday, April 1, 2014

tkinter clock

I did some playing around with tkinter after seeing this post.

Here is my solution: How to easily display a running clock (text, not graphical) using tkinter.



The Clock Class

A running clock requires two functions: One to create the tkinter widget, and one to continuously update the widget with the current time.

If we house these in a class, then creating a clock (or multiple clocks) gets much easier. All the complexity is hidden from the mainloop and the rest of your script.

A newly created Clock widget will *automatically* update itself. Both functions call tick(). Yeah, tick() calls itself upon completion. Cool little trick tkinter offers for precisely uses like this.

Here's an example class:

import tkinter
import time

class Clock():
    """ Class that contains the clock widget and clock refresh """

    def __init__(self, parent):
        """
        Create the clock widget
        It's an ordinary Label element
        """
        self.time = time.strftime('%H:%M:%S')
        self.widget = tkinter.Label(parent, text=self.time)
        self.widget.after(200, self.tick)      # Wait 200 ms, then run tick()


    def tick(self):
        """ Update the display clock """
        new_time = time.strftime('%H:%M:%S')
        if new_time != self.time:
            self.time = new_time
            self.widget.config(text=self.time)
        self.widget.after(200, self.tick)      # 200 =  millisecond delay
                                               #        before running tick() again 

And you implement it rather like this:

if __name__ == "__main__":
    """
    Create a tkinter window and populate it with elements
    One of those elements merely happens to include the clock.
    """

    # Generic window elements

    window = tkinter.Tk()
    frame  = tkinter.Frame(window, width=400, height=400 )
    frame.pack()

    # Add the frame elements, including the clock

    label1 = tkinter.Label(frame, text="Ordinary label")
    label1.pack()
    clock2  = Clock(frame)             # Create the clock widget
    clock2.widget.pack()               # Add the clock widget to frame
    label3 = tkinter.Label(frame, text="Ordinary label")
    label3.pack()

    window.mainloop()


We created and placed the clock2 widget almost like any other widget. We simply used the Clock class instead of the tkinter.Label class. There's no need to start() or stop() the clock widget - once created, it automatically updates itself.

We can place multiple clock widgets within the same frame without problems.




Advanced Placement

The Clock class works will all geometry managers, just like any widget:

    clock2.widget.pack()
    clock2.widget.grid(row=5,column=3)
    clock2.widget.place(x=150,y=200,anchor=center)

If we don't need to retain the 'clock' variable, or customize it's appearance, then we can simply place it like any other widget:

    tkinter.Label(frame, text="Ordinary label").pack()
    Clock(frame).widget.pack()





Customizing

One easy way to customize the clock's appearance is to use .configure
All Label configure options will work. The clock is simply a label.

    clock2 = Clock(frame)         # Create the clock widget
    clock2.widget.pack()          # Add the clock widget to frame
    clock2.widget.configure(bg='green',fg='blue',font=("helvetica",35))




Simplifying


There is one more simplification we can make, detailed as part of this tkinter example. We can change the class to inherit Label properties instead of creating a .widget attribute.If you're new to Python classes, the concept of inheriting from a class may take a few tries to wrap your brain around.

Attribute:

class Clock():
    def __init__(self, parent):
        self.widget = tkinter.Label(parent, text=self.time)

if __name__ == "__main__":
    foo()
    clock2.widget.do_something()
    more_foo()

Inherited:

class Clock3(tkinter.Label):
    def __init__(self, parent=None):
        tkinter.Label.__init__(self, parent)
        self.configure(text=self.time)

if __name__ == "__main__":
    foo()
    clock2.widget.do_something()
    more_foo()




Final result:


The final clock class looks like:

class Clock3(tkinter.Label):
    """ Class that contains the clock widget and clock refresh """

    def __init__(self, parent=None):
        tkinter.Label.__init__(self, parent)
        """
        Create and place the clock widget into the parent element
        It's an ordinary Label element.
        """
        self.time = time.strftime('%H:%M:%S')
        self.configure(text=self.time, bg='yellow')
        self.after(200, self.tick)


    def tick(self):
        """ Update the display clock every 200 milliseconds """
        new_time = time.strftime('%H:%M:%S')
        if new_time != self.time:
            self.time = new_time
            self.config(text=self.time)
        self.after(200, self.tick)

And you implement it rather like this:

if __name__ == "__main__":
    """
    Create a tkinter window and populate it with elements
    One of those elements merely happens to include the clock.
    """

    # Generic window elements

    window = tkinter.Tk()
    frame  = tkinter.Frame(window, width=400, height=400 )
    frame.pack()

    # Add the frame elements, including the clock

    label1 = tkinter.Label(frame, text="Ordinary label")
    label1.pack()
    clock2 = Clock(frame)             # Create the clock widget
    clock2.configure(bg='yellow')     # Customize the widget
    clock2.pack()                     # Add the clock widget to frame
    label3 = tkinter.Label(frame, text="Ordinary label")
    label3.pack()

    window.mainloop()

Wednesday, March 12, 2014

Importing GTFS files into SQLite

A great comment in my previous post about demystifying GTFS transit schedule data pointed out that the various files in a GTFS file are simply database tables. Each file can be imported into a relational database as a separate table, and queried using SQL instead of the custom scripts I used.

In fact, I found SQL to be faster and easier to maintain than the Python script...for a while. Eventually I rewrote my Python code, added a preprocessor, and found just as fast and easier to maintain than mixed Python/SQLite.
Nevertheless, thanks to Stefan, for the tip!

Here's a little more detail about exactly how to do it.

We will use the very simple, fast application SQLite for this, since our tables and queries will be rather simple and straightforward. Other possible databases include MongoDB and CouchDB. Indeed, for the very simple queries we used before, a series of good-old gdbm key-value databases could work.


Setup and importing GTFS tables into SQLite


In Ubuntu, installing SQLite3 is very simple:

sudo apt-get install sqlite3


Next, let's manually download the GTFS file for the Milwaukee County Transit System, uncompress it, create a new database, add a table to the database for the stops file, import stops file into the database, and save the database.


$ mkdir /home/me/GTFS                               # Create a working directory 
$ wget -O /home/me/GTFS/mcts.gtfs http://kamino.mcts.org/gtfs/google_transit.zip
                                                    # Download the GTFS file
$ unzip -d /home/me/GTFS /home/me/GTFS/mcts.gtfs    # Unzip the GTFS file
$ sqlite3

sqlite> attach /home/me/GTFS/mcts.db as mcts        # Create a new database
sqlite> create table stops(stop_id TEXT,stop_code TEXT,stop_name TEXT,
                           stop_desc TEXT,stop_lat REAL,stop_lon REAL,
                           zone_id NUMERIC,stop_url TEXT,timepoint NUMERIC);
sqlite> .separator ","                              # Tell SQLite that it's a CSV file
sqlite> .import /home/me/GTFS/stops.txt stops       # Import the file into a db table
sqlite> .dump                                       # Test the import
sqlite> delete from main.stops where stop_id like 'stop_id';  # Delete the header line
sqlite> select * from mcts.stops where stop_id == 5505;       # Test the import
sqlite> .backup mcts /home/me/GTFS/mcts.db          # Save the database
sqlite> .quit


Scripting imports

We can also script it. Here's a more robust script that creates multiple tables. The column names are explained on the first line of each table (which is why we must delete that line). The data types -TEXT, REAL, and NUMERIC-, and conversions from various Java, Python, C, and other datatypes are clearly explained in the SQLite documentation. The explanation of the field names, and expected datatype, is explained in the GTFS documentation. Each provider's GTFS file can include many optional fields, and may use different optional fields over time, so you are *likely* to need to tweak this script a bit to get it to work:

create table agency(agency_id TEXT,agency_name TEXT,agency_url TEXT,
                    agency_timezone TEXT,agency_lang TEXT, agency_phone TEXT);
create table calendar_dates(service_id TEXT,date NUMERIC,exception_type NUMERIC);
create table routes(route_id TEXT,agency_id TEXT,route_short_name TEXT,
                    route_long_name TEXT,route_desc TEXT,route_type NUMERIC,
                    route_url TEXT,route_color TEXT,route_text_color TEXT);
create table shapes(shape_id TEXT,shape_pt_lat REAL,shape_pt_lon REAL,
                    shape_pt_sequence NUMERIC);
create table stops(stop_id TEXT,stop_code TEXT,stop_name TEXT,
                   stop_desc TEXT,stop_lat REAL,stop_lon REAL,
                   zone_id NUMERIC,stop_url TEXT,timepoint NUMERIC);
create table stop_times(trip_id TEXT,arrival_time TEXT,departure_time TEXT,
                        stop_id TEXT,stop_sequence NUMERIC,stop_headsign TEXT,
                        pickup_type NUMERIC,drop_off_type NUMERIC);
create table trips(route_id TEXT,service_id TEXT,trip_id TEXT,
                   trip_headsign TEXT,direction_id NUMERIC,
                   block_id TEXT,shape_id TEXT);
.separator ','
.import /home/me/GTFS/agency.txt agency
.import /home/me/GTFS/calendar_dates.txt calendar_dates
.import /home/me/GTFS/routes.txt routes
.import /home/me/GTFS/shapes.txt shapes
.import /home/me/GTFS/stops.txt stops
.import /home/me/GTFS/stop_times.txt stop_times
.import /home/me/GTFS/trips.txt trips
delete from agency where agency_id like 'agency_id';
delete from calendar_dates where service_id like 'service_id';
delete from routes where route_id like 'route_id';
delete from shapes where shape_id like 'shape_id';
delete from stops where stop_id like 'stop_id';
delete from stop_times where trip_id like 'trip_id';
delete from trips where route_id like 'route_id';
select * from stops where stop_id == 5505;

And run that script using:

$ sqlite3 mcts.db < mcts_creator_script



Reading GTFS data from SQLite


Now, like in the previous GTFS post, let's find the next buses at the intersection of Howell and Oklahoma. There are four stops at that location: 658, 709, 5068, and 5152.


First, let's find the appropriate service codes for today's date:

# Query: The list of all service_ids for one date.
sqlite> SELECT service_id FROM calendar_dates WHERE date == 20140310;
14-MAR_CY-AON_0
[...long list...]
14-MAR_WN-PON_0


Scripting reads

These queries can also be scripted. Here's an example script that looks up the four stops we care about for a two-hour window:

-- Usage:  $ sqlite3 GTFS/mcts.db < GTFS/mcts_lookup.sh
-- Usage:  sqlite> .read GTFS/mcts_lookup.sh 

-- List of the valid service_id codes for the current date
CREATE VIEW valid_service_ids AS
   SELECT service_id 
   FROM calendar_dates 
   WHERE date == strftime('%Y%m%d', 'now', 'localtime')
   ;

SELECT stop_times.arrival_time, trips.route_id, trips.trip_headsign
   FROM trips, stop_times

   -- Match the trip_id field between the two tables
   WHERE stop_times.trip_id == trips.trip_id

   -- Limit selection to the stops we care about 
   AND stop_times.stop_id IN (658,709,5068,5152)

   -- Limit selection to service_ids for the correct day
   AND trips.service_id IN valid_service_ids

   -- Limit selection to the next hour from now
   AND stop_times.arrival_time > strftime(
                                 '%H:%M:%S', 'now', 'localtime', '-5 minutes')
   AND stop_times.arrival_time < strftime(
                                 '%H:%M:%S', 'now', 'localtime', '+1 hour')
   ORDER BY stop_times.arrival_time
   ;

-- Clean Up
DROP VIEW valid_service_ids;

And there are two ways to run the script:

sqlite&gt> .read lookup_script         # Within sqlite
$ sqlite3 mcts.db < lookup_script      # Shell script



Results

I found that importing GTFS files into SQLite requires a lot of memory and CPU...but is faster than Python, and the scripts are smaller and easier to maintain. SQLite is a good, fast processor or pre-processor.

File sizes:

  • mcts.gtfs: 5.6M
  • mcts.db: 79M
  • mtcs.gtfs unzipped files: 86M
I think that preprocessing or shrinking those files will be important for low-power or low-bandwidth applications.

Query times:

Here are the query times for buses within a two-hour window at Howell and Oklahoma:
  • Original Python: 13 sec
  • SQLite: 2.5 sec
  • Rewritten Python w/Preprocessor: 0.2 sec

Code size:

To do those queries,
The original Python 3 script: 206 lines.
SQLite script to import the GTFS file: 33 lines.
SQLite script to lookup the stops: 32 lines.

Friday, January 24, 2014

Transit schedule data demystified - using GTFS

General Transit Feed Specification (GTFS) is the Google-originated standard format for transit route, stop, trip, schedule, map, and fare data. Everything except realtime.

It's called a feed because it (usually) includes an RSS update for changes.
There are lists of feeds on the Google wiki, and on the separate GTFS data website.

Each organization's GTFS file includes all their services, so some agency files can get pretty big, and get updated often. Any schedule change or route adjustment means a new release of the entire GTFS file. The file itself is merely a big zipfile, containing several csv files that are strangely required to be mislabelled as .txt.

Here's the contents of Milwaukee County Transit System's GTFS file:

$ unzip -l mcts.zip 
Archive:  mcts.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      169  2014-01-10 05:01   agency.txt
    40136  2014-01-10 05:00   calendar_dates.txt
     5746  2014-01-10 05:01   routes.txt
   307300  2014-01-10 05:00   stops.txt
 35198135  2014-01-10 05:00   stop_times.txt
   650622  2014-01-10 05:01   trips.txt
  8369736  2014-01-10 05:01   shapes.txt
     3490  2014-01-10 05:01   terms_of_use.txt
---------                     -------
 44575334                     8 files

Yeah, 44MB unzipped.
But only 5MB zipped. Still not something you want to download every day to your phone.

Let's find a stop at Mitchell International Airport:

$ cat stops.txt | grep AIRPORT
7168,7168,AIRPORT,,  42.9460473, -87.9037345,,,1
7162,7162,AIRPORT & ARRIVALS TERMINAL,,  42.9469597, -87.9030569,,,0

It's right, there are two stops at the airport. Each stop has a latitude and longitude, a unique ID number, and a descriptive name. The final field designates a timepoint (1=Timepoint, 0=Not).

Let's try an intersection where two routes cross:

$ cat stops.txt | grep "HOWELL & OKLAHOMA"
709,709,HOWELL & OKLAHOMA,,  42.9882051, -87.9043319,,,1
658,658,HOWELL & OKLAHOMA,,  42.9885464, -87.9045333,,,1
$ cat stops.txt | grep "OKLAHOMA & HOWELL"
5152,5152,OKLAHOMA & HOWELL,,  42.9881561, -87.9046550,,,1
5068,5068,OKLAHOMA & HOWELL,,  42.9883466, -87.9041176,,,1

Here's a problem that will require some logic to solve. I consider the intersection to be one place (not a GTFS term). Many trips and routes can use the same stop. Multiple stops (GTFS terms) can exist at the same place. In this case, northbound, southbound, eastbound, and westbound buses each have a different stop at the same place.

This might make your job easier...or harder.

GTFS cares about trips and stops. It doesn't care that Stops #709 and #5152 are twenty meters apart, and serve different routes - that it's a transfer point. Nothing in GTFS explicitly links the two stops. Generally, you must figure out the logic to do that - you have the lat/lon and the name to work with.

GTFS does have an optional transfers.txt file, that fills in the preferred transfer locations for you. But that's for a more advanced exercise.


Let's see what stops at #709:

$ grep -m 5 ,709, stop_times.txt 
4819177_1560,06:21:00,06:21:00,709,         14,,0,0
4819179_1562,06:49:00,06:49:00,709,         14,,0,0
4819180_1563,07:02:00,07:02:00,709,         14,,0,0
4819181_1564,07:15:00,07:15:00,709,         14,,0,0
4819182_1565,07:28:00,07:28:00,709,         14,,0,0


These fields are trip_id, arrival_time, departure_time, and stop-sequence (14th).

Let's see the entire run of trip 4819177_1560:

$ grep 4819177_1560 stop_times.txt 
4819177_1560,06:09:00,06:09:00,7162,          2,,0,0  # Hey, look - stops out of sequence in the file
4819177_1560,06:09:00,06:09:00,7168,          1,,0,0  # Begin Trip
4819177_1560,06:11:00,06:11:00,7178,          3,,0,0
[...]
4819177_1560,06:20:00,06:20:00,8517,         13,,0,0
4819177_1560,06:21:00,06:21:00,709,         14,,0,0   # Howell & Oklahoma
4819177_1560,06:22:00,06:22:00,711,         15,,0,0
[...]
4819177_1560,07:17:00,07:17:00,1371,         66,,0,0
4819177_1560,07:19:00,07:19:00,6173,         67,,0,0
4819177_1560,07:20:00,07:20:00,7754,         68,,0,0  # End of trip 

We can also look up more information about trip 4819177_1560:

$ grep 4819177_1560 trips.txt 
  GRE,13-DEC_WK,4819177_1560,N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS,0,515111,13-DEC_GRE_0_12

This needs a little more explanation
  • route_id: Green Line (bus)
  • service_id (weekday/days-of-service): 13-DEC_WK
  • headsign: N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS
  • direction_id (binary, 0 or 1): 0
  • block_id (useful only if the same bus changes routes): 515111
  • shape_id (useful for route maps): 13-DEC_GRE_0_12

Let's look up the route_id:

$ grep GRE routes.txt
  GRE,MCTS,  GRE,MetroEXpress GreenLine,,3,http://www.ridemcts.com/Routes-Schedules/Routes/GRE/,,

The full route name is MetroEXpress GreenLine, it's a bus (type-3 = bus) route, and we have the operator website for it.

Let's look up the service_id:

$ grep -m 10 13-DEC_WK calendar_dates.txt 
13-DEC_WK,20140113,1
13-DEC_WK,20140114,1
13-DEC_WK,20140115,1
13-DEC_WK,20140116,1
13-DEC_WK,20140117,1
13-DEC_WK,20140120,1
13-DEC_WK,20140121,1
13-DEC_WK,20140122,1
13-DEC_WK,20140123,1
13-DEC_WK,20140124,1

Ah, this specific trip is a weekday (Monday-Friday) only trip.


Let's look up the route map shapefile for the trip:

$ grep 13-DEC_GRE_0_12 shapes.txt 
13-DEC_GRE_0_12,  42.946054, -87.903810,10001
13-DEC_GRE_0_12,  42.946828, -87.903659,10002
13-DEC_GRE_0_12,  42.946824, -87.903588,10003
13-DEC_GRE_0_12,  42.946830, -87.903472,10004
[...]
13-DEC_GRE_0_12,  43.123137, -87.915431,670004
13-DEC_GRE_0_12,  43.123359, -87.915228,670005
13-DEC_GRE_0_12,  43.124016, -87.914535,670006
13-DEC_GRE_0_12,  43.124117, -87.914440,670007

The line for this trip has 520 points. That's pretty detailed.



So what do we know?

We know that Stop #709 is served by the GreenLine route, it's the 14th stop in direction 0, it's a bus line, we have all the times the stop is served, and we have the route website. We know the route map and all the other stops of any trip serving that stop.

How can we find the next scheduled bus at stop #709?

One way is to start with all trips that stop at #709 from stop_times.txt.

Since we probably know what time it is, we can filter out all the past times, and most of the future times. This leaves us with a nice, small list of, say, 10 possibles that include trips that don't run today at all (we must delve deeper to determine).

We can look up each of those trips in trips.txt, and get the route.

Each trip also includes a service_id code. The calendar_dates.txt file tells us which dates each service_id code is valid.

Right, we need to do three lookups.

The shell code gets a bit complex with three lookups, so I shifted to Python and wrote a basic next-vehicle-at-stop-lookup in about 160 lines. Python lists are handy, since it can handle all the stops at a location just as easily as a single stop. Python's zip module is also handy, so I can read data directly from the zipfile. But at 13 seconds, Python is probably too slow for this kind of application:

$ time ./next_bus.py 

Next departures from Howell & Okahoma
16:16   GRE N AIRPORT - VIA OAKLAND-HOWELL METROEXPRESS
16:22   GRE N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS
16:26    51 OKLAHOMA - TO LAKE DRIVE
16:28    51 TO 124TH ST. - VIA OKLAHOMA
16:30   GRE N AIRPORT - VIA OAKLAND-HOWELL METROEXPRESS
16:35   GRE N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS
16:43    51 TO 124TH ST. - VIA OKLAHOMA
16:44   GRE N AIRPORT - VIA OAKLAND-HOWELL METROEXPRESS
16:45    51 TO NEW YORK
16:45   GRE N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS
16:56   GRE N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS

real 0m13.171s   # Ugh. If I had started 13 seconds sooner, I wouldn't be bored now.
user 0m10.740s
sys 0m0.260s



All that time crunching the GTFS file has not gone unnoticed.

Trip planners (like Google) pre-process the data, mapping out and caching link-node and transfer relationships, limiting the trip data to the next hour or two (as appropriate), and using rather fancy algorithms to prune the link-node map to a likely set of  possibilities before looking at trips along those links.

That's one reason Google Transit is much faster than 13 seconds.

But that's all advanced stuff.

Also advanced is how to integrate real-time data, which uses one of several different formats. Next time...



Sunday, January 5, 2014

Upstart Jobs at login

Login is not the same is startup. Let's just get that out of the way first.
  • Startup is the time between boot and the login screen. It's the habitat of system jobs.
  • Login is the time after you enter your password. It's the habitat of user jobs.

The easy way to run a task at login is to run a script from your .bashrc.
And the (deceptively not-) easy way to run a task at logout is to run a script from your .bash_logout

But today we're not doing it the easy way. Today we're going to use dbus and Upstart.

Emitting Upstart Signals from your .bashrc

It's terribly easy.

1) Emit a user-level Upstart signal by adding a line to .bashrc:

# Upstart signal that .bashrc is running
initctl emit "I_AM_LOGGING_IN"

2) Add a user-level Upstart job to ~.config/upstart/ for one user, or to /usr/share/upstart/sessions/ for all users:

# /home/$USER/.config/upstart/login_test.conf
description "login test"
start on I_AM_LOGGING_IN            # Start criteria
exec /bin/date > /tmp/login_test    # Do something

3) Open a new terminal window (to load the new .bashrc). When you open the window, the Upstart job creates the tempfile at /tmp/login_test.

Clean up: Restore your bashrc, and delete the sample Upstart job.

Can I emit system-level Upstart signals from .bashrc?

Not directly. The script runs as a user, not as root.

You can use a secondary method of triggering system-level Upstart signals, like sending a Dbus signal, or manipulating a file, or connecting to a socket.


Can I emit Upstart signals from .bash_logout?

No.

Using initctl emit in .bash_logout will merely result in an error. The user-level Upstart daemon seems to be terminated before .bash_logout is run. The command will return a cryptic "Rejected send message"error from PID 1 (system Upstart). Since .bash_logout is not running as root, it cannot emit system-level signals.

Also, GUI terminal programs do not not run .bash_logout, unless you specify compatibility (with a flag) when you start.

That easy way of doing login actions is still too hard

Boy, are you difficult to please.

Okay, there is an even easier way, but it's more complicated to explain: Instead of .bashrc emitting an Upstart event, let Upstart listen for a dbus signal.

Here is an example of the dbus message that occurs when I login via SSH to a new session. This signal is emitted by systend-logind every time a new TTY, ssh, or X-based GUI login occurs.

The signal is not emitted when you are in a GUI environment and simply open a terminal window - that's not a login, that's a spawn of your already-existing GUI environment:

signal sender=:1.3 -> dest=(null destination) serial=497 
  path=/org/freedesktop/login1; interface=org.freedesktop.login1.Manager;
  member=SessionNew
    string "4"
    object path "/org/freedesktop/login1/session/_34"


The important elements are the source, the "SessionNew" signal, and the path of the new session.

Aside, let's query systemd-logind to find if the login is to a TTY, X session, or SSH. logind has lots of useful information about each session:

$ dbus-send --system                              \ 
            --dest=org.freedesktop.login1         \
            --print-reply                         \
            --type=method_call                    \
            /org/freedesktop/login1/session/_34   \ # Path from the signal
            org.freedesktop.DBus.Properties.Get   \
            string:org.freedesktop.login1.Session \
            string:Service
method return sender=:1.3 -> dest=:1.211 reply_serial=2
   variant       string "sshd"

It's right. I did connect using ssh.

Now let's construct an Upstart job that runs when I login via a TTY, X Session, or SSH. We will use Upstart's built-in dbus listener.

# /home/$USER/.config/upstart/login_test.conf
description "login test"
start on dbus SIGNAL=SessionNew     # Listen for the dbus Signal
exec /bin/date > /tmp/login_test    # Do something

  • Now, whenever you login to a TTY, X session, or SSH session, the job will run.
  • If your job needs to tell the difference between those sessions, you know how to find out using dbus.
  • If *everybody* needs the job, place it in /usr/share/upstart/sessions/ instead of each user's .config/upstart/


What about super-easy logout jobs?

Logout jobs are harder, and generally not recommended. Not super-easy. They are hard because you can't guarantee they will run. Maybe the user will hold down the power button. Or use the "shutdown -h now" command. Or the power supply sent a message that the battery only has 60 seconds of life left. Or the user absolutely cannot miss that bus....

Here's the dbus signal that systemd-logind emits when a TTY, X, or SSH user session ends:

signal sender=:1.21 -> dest=(null destination) serial=286 
  path=/org/freedesktop/Accounts/User1000; 
  interface=org.freedesktop.Accounts.User; member=Changed

All this tells me is that User1000 now has a different number of sessions running. Maybe it's a login (yes, it emits the same signal upon login). Maybe it's a logout.

Sure, we can do a login-and-logout Upstart job...

# /home/$USER/.config/upstart/login_test.conf
description "login and logout test"
start on dbus SIGNAL=Changed INTERFACE=org.freedesktop.Accounts.User
exec /bin/date > /tmp/login_test

...but then you need logic to figure out who logged in or logged out, and whether it's an event you care about. Certainly doable, but probably not worthwhile for most users.

In other words, if you want to backup-at-logout, you need to structure it as a backup-then-logout sequence. Logout is not an appropriate trigger to start the sequence...from the system's point of view.

But I really want to do a job a logout!

Okay, here's how to do a job when you log out of the GUI environment. Logging out is the trigger. This won't work for SSH or TTY sessions.

The Upstart jobs /etc/init/lightdm.conf and/etc/init/gdm.conf emit a system-level "desktop-shutdown" signal when the X server is stopped. You can use that job as your start criteria.

# /etc/init/logoff_test.conf
description "logout test"
setuid some_username        # Your script probably doesn't need to run as root
start on stopping lightdm   # Run *before* it is stopped
exec /bin/date > /tmp/logoff_test



Friday, January 3, 2014

Searching for the right Upstart signal or job

If you want to use Upstart to start/stop a job on any of the not-obvious triggers (like "startup"), then you need to do some digging to find the right trigger.


initctl show-config


Be careful, there are TWO sets of Upstart jobs: System jobs and user jobs. Use sudo to distinguish between them.

$ sudo initctl show-config dbus   # Use sudo for system-level jobs
dbus
  start on local-filesystems
  stop on deconfiguring-networking

$ initctl show-config dbus        # Omit sudo for user-level jobs
dbus
  start on starting xsession-init 


Searching for a job or a signal using grep


The initctl show-config command without any job name prints all the jobs. That means you can use grep on the full list. Here is an example of using grep to look for all root jobs that care about the system "startup" signal:

$ sudo initctl show-config | grep -B8 startup
  start on (starting mountall or (runlevel [016] and ((desktop-shutdown or stopped xdm) or stopped uxlaunch)))
resolvconf
  start on mounted MOUNTPOINT=/run
  stop on runlevel [06]
ssh
  start on runlevel [2345]
  stop on runlevel [!2345]
udev-fallback-graphics
  start on (startup and (((graphics-device-added PRIMARY_DEVICE_FOR_DISPLAY=1 or drm-device-added PRIMARY_DEVICE_FOR_DISPLAY=1) or stopped udevtrigger) or container))
--
mountall
  emits virtual-filesystems
  emits local-filesystems
  emits remote-filesystems
  emits all-swaps
  emits filesystem
  emits mounting
  emits mounted
  start on startup
--
acpid
  start on runlevel [2345]
  stop on runlevel [!2345]
checkfs.sh
  start on mounted MOUNTPOINT=/
checkroot-bootclean.sh
  start on mounted MOUNTPOINT=/
kmod
  start on (startup and started udev)
--
  start on runlevel S
  stop on runlevel [!S]
wait-for-state
  stop on (started $WAIT_FOR or stopped $WAIT_FOR)
flush-early-job-log
  start on filesystem
friendly-recovery
  emits recovery
  emits startup
--
  start on runlevel [2345]
  stop on runlevel [!2345]
socket-test
  start on socket PROTO=inet PORT=34567 ADDR=127.0.0.1
tty2
  start on (runlevel [23] and ((not-container or container CONTAINER=lxc) or container CONTAINER=lxc-libvirt))
  stop on runlevel [!23]
udevtrigger
  start on ((startup and started udev) and not-container)
--
  emits not-container
  start on mounted MOUNTPOINT=/run
mounted-dev
  start on mounted MOUNTPOINT=/dev
tty3
  start on (runlevel [23] and ((not-container or container CONTAINER=lxc) or container CONTAINER=lxc-libvirt))
  stop on runlevel [!23]
udev-finish
  start on ((((startup and filesystem) and started udev) and stopped udevtrigger) and stopped udevmonitor)
alsa-state
  start on runlevel [2345]
cryptdisks-udev
  start on block-device-added ID_FS_USAGE=crypto
hostname
  start on startup
--
network-interface
  emits net-device-up
  emits net-device-down
  emits static-network-up
  start on net-device-added
  stop on net-device-removed INTERFACE=$INTERFACE
plymouth-ready
  emits plymouth-ready
  start on (startup or started plymouth-splash)
--
  start on (started plymouth and ((graphics-device-added PRIMARY_DEVICE_FOR_DISPLAY=1 or drm-device-added PRIMARY_DEVICE_FOR_DISPLAY=1) or stopped udev-fallback-graphics))
plymouth-upstart-bridge
  start on (started dbus or runlevel [06])
  stop on stopping plymouth
tty1
  start on (stopped rc RUNLEVEL=[2345] and ((not-container or container CONTAINER=lxc) or container CONTAINER=lxc-libvirt))
  stop on runlevel [!2345]
udevmonitor
  start on (startup and starting udevtrigger)

We found one job that emits startup (friendly-recovery).
We found seven jobs that listen for it: udev-fallback-graphics, mountall, kmod, udevtrigger, hostname, plymouth-ready, and udevmonitor


Searching for a signal using upstart-monitor


The upstart-monitor application is a handy GUI and command-line tool to listen to all the signal chatter in Upstart. The application is provided by the upstart-monitor package in the Ubuntu repositories. A bug in 13.10 prevents it from running on a non-GUI system like Ubuntu Server, but it's also easy to fix the bug yourself...

Here are the signals emitted by Upstart when I switch over to a TTY, login, wait ten seconds, and then logout. This isn't an example of monitoring logins (do that using consolekit or logind) - this is an example of monitoring the Upstart signals emitted by a change in tty2.

$ upstart-monitor --no-gui --destination=system-bus
# Upstart Event Monitor (console mode)
#
# Connected to D-Bus system bus
#
# Columns: time, event and environment

2014-01-03 23:23:43.013436 stopping JOB='tty2' INSTANCE='' RESULT='ok'
2014-01-03 23:23:43.020309 starting JOB='tty2' INSTANCE=''
2014-01-03 23:23:43.031193 starting JOB='startpar-bridge' INSTANCE='tty2--started'
2014-01-03 23:23:43.033055 started JOB='startpar-bridge' INSTANCE='tty2--started'
2014-01-03 23:23:43.040671 stopping JOB='startpar-bridge' INSTANCE='tty2--started' RESULT='ok'
2014-01-03 23:23:43.042496 stopped JOB='startpar-bridge' INSTANCE='tty2--started' RESULT='ok'
2014-01-03 23:23:43.044271 started JOB='tty2' INSTANCE=''
^C

You can see the progression of signals: starting, started, stopping, stopped.
You can also see the nesting of jobs. startpar-bridge starts on starting tty2, and runs it's entire task of starting-started-stopping-stopped for tty2 to transition from starting to started.

If you want to trigger a job when tty2 is starting or started, you now know the signals that get emitted. Your job can listen for those signals.


Drawing out relationships using dotfiles


Dot diagram of Upstart user jobs
The initctl2dot application creates dotfile graphics of xdot application. initctl2dot is included with the upstart package, part of all Ubuntu installations (even ubuntu-minimal). xdot is a separate package available in the Ubuntu repositories (Software Center).


As the name implies, initctl2dot's input is initctl's output.You can manually trim an initctl show-show-config output, and input that to initctl2dot if you really want a specific diagram.

You can easily diagram and display the entire system job tree...though it's perhaps less useful than you may expect:

$ initctl2dot --system --outfile /tmp/upstart_root_tree.dot
$ xdot /tmp/upstart_root_tree.dot


You can also diagram the user job tree:

$ initctl2dot --user --outfile /tmp/upstart_user_tree.dot
$ xdot /tmp/upstart_user_tree.dot

Limiting the dotfile size


The initctl2dot manpage includes options for showing/hiding various relationship types (emit, start on, stop on, etc) for clarity.

Another handy option is the --restrict-to-jobs flag, to draw much smaller charts.

For example, let's diagram the system "startup" signal relationships we already discovered using grep:

$ initctl2dot --system --outfile /tmp/upstart_startup_tree.dot \
              --restrict-to-jobs=friendly-recovery,udev-fallback-graphics,\
                                 mountall,kmod,udevtrigger,hostname,\
                                 plymouth-ready,udevmonitor 
$ xdot /tmp/upstart_startup_tree.dot 


And there you have it. How to search system jobs and user jobs for useful signals, and how to easily diagram out the relationships among signals and jobs.