Tag Archives: sqlite

pyspatialite – spatial queries in python, built on sqlite3

I’ve been doing some work with spatial queries recently, and I really didn’t feel like setting up a full blown PostGIS setup, which is the “traditional” way, nor did I really feel like going and trying out these trendy NoSQL stores, like CouchDb and MongoDb, both of which apparently now support spatial work. Then I heard about pyspatialite.

I like python, and sqlite is lovely all those times you want to stuff a bunch of data somewhere, and not have to think too much about it. But, pyspatialite has virtually zero documentation, other than the hint that’s basically just a sqlite3 “standard” api on pyspatialite. Ok, well, let’s give it a go anyway, using Ubuntu 10.04,

Simple installation via sudo [easy_install pyspatialite|pip install pyspatialite] will fail miserably. The distribution includes the C for spatialite, and builds it’s own extension. I have no particular opinion on that either way but it feels a bit rough around the edges. Anyway, on a clean Ubuntu, you’ll need at least the following three packages, libgeos-dev libproj-dev python-dev Try and reinstall and now you should be good to go!

Ok, but now what? As there’s no real python examples, you need to follow the raw spatialite examples. I found these ok, but I wasn’t thrilled. Too much time spent with WKB and hex dumps of geometry and explaining sql.

So, here’s a couple of examples that actually use python, and some notes I found awkward. (I’m not a geo expert, and these are just some of the ways I’m doing it.)

First, actually creating some data, you can run these commands via pyspatialite, or from the command line. (For the command line tool, spatialite you can install spatialite-bin on Ubuntu 10.04, it’s not exactly the same version as comes built into pyspatialite, but it seems to work ok. It’s virtually identical to running sqlite3 from the command line)

                            name text NOT NULL, TYPE text, geometry BLOB NOT NULL);
SELECT RecoverGeometryColumn('bars', 'geometry', 4326, 'POINT', 2);

This will create a table, and create some magic behind the scenes to tell spatialite that the column named “geometry” in the “bars” table contains geometry data, which are POINTs, in the spatial reference id 4326 (good ol WGS84) POINTs could be something else, have a good read on OGC well known text stuff if you want to get into this. You can also create the geometry column by adding the column to an existing table, but I found that actually harder to understand. However, for completeness, here’s the same table as above, created the other way….

                            name text NOT NULL, TYPE text);
SELECT AddGeometryColumn('bars', 'geometry', 4326, 'POINT', 2);

Your choice. Anyway, moving on. That’s just a table, we want to put some data in…

from pyspatialite import dbapi2 as sqlite3
conn = sqlite3.connect("/home/karl/src/karltemp.sqlite")
c = conn.cursor()
# some data from somewhere....
bar = Bar(name="the pig and whistle", type="pub", lat="64.123", lon="-21.456")
c.execute("""insert into bars (name, type, geometry) 
             values (?, ?, ?, geomFromText('POINT(%f %f)', 4326))""" % (bar.lon, bar.lat),
             (bar.name, bar.type))

Well, that’s not as pretty as we’d like. See how we used both ? substitution and python % substitution? geomFromText() is a function provided by spatialite for making the right sort of background magic happen to put spatial data into that blob column we defined on our table. It seems sqlite doesn’t know how to put params inside the quoted strings. Oh well. sucks to be us. But we move on and insert a bunch of data.

Now, to use it, first, selecting the nearest 10 bars.

from pyspatialite import dbapi2 as sqlite3
conn = sqlite3.connect("/home/karl/src/karltemp.sqlite")
conn.row_factory = sqlite3.Row
c = conn.cursor()
lon = -21.9385482
lat =  64.1475511
c.execute("""select b.name, b.type, 
                              geomfromtext('point(%f %f)', 4326))
                  as distance from bars b
                  order by distance desc limit 10""" % (lon, lat))
for row in c:
    log.info("bar: %s, type: %s, distance: %f", row['name'], row['type'], row['distance'])

4326 is the EPSG SRID for WGS84. Did that sentence mean nothing to you? You’re not alone. Suffice to say, if you want to work with data all over the world, and share your data with people who know even less about this, just stick with WGS84. (FIXME? Actually, after playing with this a bit more, this seems to have NO affect whatsoever for me)

Firstly, again note that we can’t use regular sql parameter escapes via the ? character. Secondly, this actually has terrible performance. The order by distance, even if limited to 10, means that it has to actually calculate the distance for all the bars in the table, then throw away all but the last 10 results.

Thirdly, the distances returned, what are those in? A little bit of probing with some fake data, and it seems that at least how I’m using it so far, this is NOT geospatial at ALL! This is purely cartesian!

spatialite> select distance(geomfromtext('point(60 60)', 4326), geomfromtext('point(59 60)', 4326));
spatialite> select distance(geomfromtext('point(60 0)', 4326), geomfromtext('point(59 0)', 4326));

I’m clearly doing something wrong. The distance between 60N 60E and 60N 59E is most definitely NOT the same distance as between 0N 60E and 0N 59E. It is however exactly 1 unit on a cartesian plane.

So, this post will have to be continued! Even if it’s only cartesian, if it does the indexes right, this could still be very useful, stay tuned, or, if you know what I’m missing, please feel very free to comment :)

Database size, column types, and retention policies

So, earlier I mentioned that I had a 700meg database of samples, and wasn’t even looking at them.

I started trying to look at them, and quickly had to add some indexes, ballooning the db out to about 950meg. This was clearly going in the wrong direction.

I looked at the table schema, and quickly realised that keeping the sensor sample time (a unix timestamp) as a string instead of a number was pretty wasteful, and likewise for the sensor type and node id. Although both of those can be given meaningful names, all firmware on the nodes have those as numbers as well, because it has to go over the wire(less)

So, simply changing the database schema from some string columns to some integer columns, and my calculator tells me I’ve gone from 90 bytes per sample, down to 32 bytes per sample.

Much better. And now that the date sample is a proper number, searching the db for date windows is much faster, without even needing the index.

And for added deliciousness, the python consumer, that writes into this database didn’t have to change at all, is it a string, is it an integer? It’s both!

That’s all well and good, but it’s still going to just keep on growing forever and ever, and with samples every 5 seconds, and (currently) 3 nodes with 2-3 sensors on each one, this isn’t going to slow down.

Do I really really care about that resolution of data much longer than a few days/weeks/month in the past? Shouldn’t I really just be averaging the older data out to get some general trends? Perhaps :) And that’s exactly what RRDtool is for, I just find it a little obtuse to use at times. (Mostly, I have to decide, permanently and forever, right now, exactly how long I do want to keep things. I didn’t want to have to decide that)

Footnote: For the curious, the schemas for sqlite3 were

CREATE TABLE karlnet_sensor (sampleTime text, node text, sensorType text, sensorRaw real, sensorValue real);
CREATE TABLE karlnet_sensor2 (sampleTime integer, node integer, sensorType integer, sensorRaw real, sensorValue real);

And 32bytes per sample, with 4.7 million samples for the last three months, is still 150meg (I tossed some old data, when the node firmware was all in flux anyway)