Archive for technology

CSW and repository thoughts

CSW allows for querying various metadata models (e.g. Dublin Core, ISO).  In pycsw, our current model is to manage one repository per metadata model (or ‘typename’ in CSW speak).  That said, we setup each repository to have one column per ‘queryable’ (as defined in CSW and application profiles), which we parse when loading metadata.  We also store the full metadata record as is (for GetRecords ElementSetName=’full’ requests).

Complexity increases as we start thinking about support for more information models, and transforming to/from requested information models (via CSW GetRecords/GetRecordById ‘outputSchema’ parameter).  Having said this, I’ve started to think about a core, agnostic information model which any metadata format could map to (for lowest common denominator).  This way, pycsw will always know the core information model queryables, which could be stored in columns as we currently do now.  The underlying queries would always query against the queryable columns.  Aside: it would be great to have a GDAL for metadata (MDAL anyone?).

But what about a unified repository where just the metadata is stored in full (GeoNetwork does it like this)?  In this scenario, we would need heavy use of XPath queries on the full XML document in realtime.  The advantage would be a.) less parsing on metadata loading b.) one repository is always loaded/queried c.) less configuration for the catalog administrator.

I like the use of XPath, but wonder about how this scales as additional databases are supported.  We currently support SQLite, which is great for simplicity (and Python SQLite bindings allow for mapping Python functions).  SQLite has no XPath support (but we could support this with Python bindings).  PostgreSQL does (if you build with libxml2), as does MySQL.  As well, I’m not sure about the performance implications (and how deep XPath queries are in the database fetching, i.e. the entire XML document would have to be serialized before XPath queries are executed).

Thoughts on a Friday morning.  Anyone have any advice/insight?

 

validating XML requests with Python and lxml

While working on pycsw, we found that there was a significant amount of code involved in processing the HTTP POST requests coming across as XML.  Since lxml is used as for XML support, why not use its native XML validation facilities?  We implemented this rather quickly, but found validation was taking up to 10 seconds.  Why?

In lxml, you have to specify an XML Schema to parse against, even if it is specified in xsi:schemaLocation.  Being a purist, I set this to fetch the schema on the fly from http://schemas.opengis.net.  The fetch was causing much of the bottleneck, so I decided to download all required OGC CSW schemas locally and have them as part of the implementation.  That should work right?  Validation was down to about 6 seconds.

The issue here was that even though the schemas were local, many xs:import definitions within them were pointing back to absolute URLs at schemas.opengis.net.  After modifying the schemas to point to relative locations, validation was extremely fast (way under a second).

Lesson learned: just because XML schemas are local, doesn’t mean they don’t point to remote URLs (though I’m not exactly sure why one would build a schema with non-local imports if they don’t have to).

help wanted: baking a CSW server in Python

Seemingly buried in geospatial metadata and discovery, I’ve been developing a my share of CSW/ISO/Dublin Core parsers, generators and clients.  OWSLib is able to interact with CSW servers, handling csw:Record, ISO 19139:2007, as well as DIF.  OWSLib is also the underlying library used by the NextGIS folks in developing a QGIS CSW Client (big thanks to Maxim and Alex for contributing the code back to qgcsw).  I’ve also used genshi to generate ISO 19139:2007 and North American Profile.

Part of this adventure has involved testing these metadata within various OGC CSW server implementations.  What I quickly noticed is that many foss4g CSW servers are written in Java.  Wouldn’t it be great to have a trimmer CSW server in Python?  Which can be used easily with an existing Apache install type of thing?

Enter pycsw.  I started with the following goals:

  • lightweight and easy to stand up: a standalone catalogue, no GUI or metadata editing front end, designed for the use case of exposing ready-to-go metadata (files or in existing DB) through a CSW interface, with as little heavy lifting as possible.  Plug and play
  • extensible: the ability to add metadata formats and mapping them to a common information model and core / additional queryables
  • OGC compliant: against the CITE test assertions

Technology bits (thanks to Sean for the initial inspiration):

  • Python: code is written as CGI for now.  Welcome to ideas for WSGI, etc.
  • Database:  SQLite3 is used as the underlying database.  No reason why things couldn’t be abstracted enough to handle other DB’s
  • DB API: SQLAlchemy makes it easy to bind database models to Python classes, and especially easy to do transparent queries
  • XML: lxml is used to parse requests, traverse XPath nodes and marshall responses.  lxml’s Schematron support will make it easy for Harvest/Transaction operations / validation
  • Spatial predicates: I originally supported ogc:BBOX, which is easy enough to code by hand.  Shapely gives access to the full suite of predicates, and will be the way forward

Progress: I’m using the OGC CITE tests here as the benchmark.  So far it passes 91/103 assertions.

Todo:

  • fully pass the CITE assertions
  • support of ISO Application Profile
  • firm up core information model to allow easier extensibility
  • fix spatial queries to fully use Shapely
  • harmonize GetRecords and GetRecordById response handler (for writing out csw:Record)
  • documentation: install / setup / configuration / testing

pycsw is up on Sourceforge and is open source.  It would be great to have more hands here.  If you are interested, and enjoy contributing to foss4g, don’t hesitate and get in touch!

OWSLib CSW action

(sorry, I’ve been busy in the last few months, hence no blogposts)

Update: almost at the same time I originally posted this, Sean set me up on the http://gispython.org blog, so I’ve moved this post there.

Geoprocessing with OGR and PyWPS

PyWPS is a neat Python package supporting the OGC Web Processing Service standard.  Basic setup and configuration can be found in the documentation, or Tim’s useful post.

I’ve been working on a demo to expose the OGR Python bindings for geoprocessing (buffer, centroid, etc.).

Here’s an example process to buffer a geometry (input as WKT), and output either GML, JSON, or WKT:

from pywps.Process import WPSProcess
import osgeo.ogr as ogr

class Buffer(WPSProcess):
 def __init__(self):
  WPSProcess.__init__(self,
  identifier='buffer',
  title='Buffer generator',
  metadata=['http://www.kralidis.ca/'],
  profile='OGR / GEOS geoprocessing',
  abstract='Buffer generator',
  version='0.0.1',
  storeSupported='true',
  statusSupported='true')

  self.wkt = self.addLiteralInput(identifier='wkt', \
   title='Well Known Text', type=type('string'))
  self.format = self.addLiteralInput(identifier='format', \
   title='Output format', type=type('string'))
  self.buffer = self.addLiteralInput(identifier='buffer', \
   title='Buffer Value', type=type(1))
  self.out = self.addLiteralOutput(identifier='output', \
   title='Buffered Feature', type=type('string'))

 def execute(self):
  buffer = ogr.CreateGeometryFromWkt( \
   self.wkt.getValue()).Buffer(self.buffer.getValue())
  self.out.setValue(_genOutputFormat(buffer, self.format.getValue()))
  buffer.Destroy()
 def _setNamespace(xml, prefix, uri):
  return xml.replace('>', ' xmlns:%s="%s">' % (prefix, uri), 1)
 def _genOutputFormat(geom, format):
 if format == 'gml':
  return _setNamespace(geom.ExportToGML(), 'gml', \
     'http://www.opengis.net/gml')
 if format == 'json':
  return geom.ExportToJson()
 if format == 'wkt':
  return geom.ExportToWkt()

Notes:

  • _setNamespace is a workaround, as OGR’s ExportToGML doesn’t declare a namespace prefix / uri in the output, which would make the ExecuteResponse XML choke parsers
  • _genOutputFormat is a utility method, which can be applied to any OGR geometry object

As you can see, very easy to pull off, integrates and extends easy.  Kudos to the OGR and PyWPS teams!

Displaying GRIB data with MapServer

I recently had the opportunity to prototype WMS visualization of meteorological data.  MapServer, GDAL and Python to the rescue!  Here are the steps I took to make it happen.

The data, (GRIB), is a GDAL supported format, so MapServer can handle processing as a result.  The goal here was to create a LAYER object.  First thing was to figure out the projection, then figure out the band pixel values/ranges and correlate to MapServer classes (in this case I just used a simple greyscale approach).

Here’s the hack:

import sys
import osgeo.gdal as gdal
import osgeo.osr as osr

if len(sys.argv) < 3:
 print 'Usage: %s <file> <numclasses>' % sys.argv[0]
 sys.exit(1)

cvr = 256  # range of RGB values
numclasses = int(sys.argv[2])  # number of classifiers

ds = gdal.Open(sys.argv[1])

# get proj4 def and write out PROJECTION object
p = osr.SpatialReference()
s = p.ImportFromWkt(ds.GetProjection())
p2 = p.ExportToProj4().split()

print '  PROJECTION'
for i in p2:
 print '   "%s"' % i.replace('+','')
print '  END'

# get band pixel data ranges and classify
band = ds.GetRasterBand(1)
min = band.GetMinimum()
max = band.GetMaximum()

if min is None or max is None:  # compute automagically
 (min, max) = band.ComputeRasterMinMax(1)

# calculate range of pixel values
pixel_value_range = float(max - min)
# calculate the intervals of values based on classes specified
pixel_interval = pixel_value_range / numclasses
# calculate the intervals of color values
color_interval = (pixel_interval * cvr) / pixel_value_range

for i in range(numclasses):
 print '''  CLASS
  NAME "%.2f to %.2f"
  EXPRESSION ([pixel] >= %.2f AND [pixel] < %.2f)
  STYLE
   COLOR %s %s %s
  END
 END''' % (min, min+pixel_interval, min, min+pixel_interval, cvr, cvr, cvr)
 min += pixel_interval
 cvr -= int(color_interval)

Running this script outputs various bits for MapServer mapfile configuration.  Passing more classes to the script creates more CLASS objects, resulting in a smoother looking image.

Here’s an example GetMap request:

Meteorological data in GRIB format via MapServer WMS

Users can query to obtain pixel values (water temperature in this case) via GetFeatureInfo.  Given that these are produced frequently, we can use the WMS GetMap TIME parameter to create time series maps of the models.

OWSLib CSW Updates and Implementation Thoughts

I’ve had some time to work on CSW support in OWSLib in the last few days.  Some thoughts and updates:

FGDC Support Added

Some CSW endpoints out there serve up GetRecords responses in FGDC CSDGM format.  This has now been added to trunk (mandatory elements + eainfo).  Note that csw:Record (DMCI + ows:BoundingBox) and ISO 19139 are already supported.  One tricky bit here is that FGDC was/is mostly implemented without a namespace, which CSW requires as an outputSchema parameter value.  I’ve used http://www.fgdc.gov for now.

Both FGDC and ISO (moreso ISO) have deep and complex content models, so if there are elements that you don’t see supported in OWSLib, please file an feature request ticket in trac, and I’ll make sure to implement (and add to the doctests).

Metadata Identifiers are Important!

When parsing GetRecords requests, we store records in a Python dict, using /gmd:MD_Metadata/gmd:fileIdentifier (for ISO), /csw:Record/dc:identifier (for CSW’s baseline) or /metadata/idinfo/datasetid (for FGDC).  Some responses return metadata without these ids for whatever reason.  This sets the dict key to Python’s None type, which ends up overwriting the dict key’s entry.  Not good.  I implemented a fix to set a random, non-persistent identifier so as not to lose data.

Of course, the best solution here would be for providers to set identifiers accordingly from the start.  Then again, what happens when CSW endpoints harvest other CSWs and identifiers are the same?  Perhaps a namespace of some sort for the CSW would help.  This would be an interesting interoperability experiment.

Harvest Support Added

I implemented a first pass of supporting Harvest operations.  CSW endpoints usually require authentication here, which may vary by the implementation.  Therefore, I’ve left this logic out of OWSLib as it’s not part of the standard per se.

Transaction Support Coming

Sebastian Benthall of OpenGeo has indicated that they are using OWSLib’s CSW support for some of their projects (awesome!), and has kindly submitted a patch for an optional element issue (thanks Sebastian!).  He also indicated that Transaction support would be of interest, so I’ve started to think about this one.  As with the Harvest operation, Transaction support also requires some sort of authentication, which we’ll leave to the client implementation.  Most of the work will be with marshalling the request, as response handling is very similar to Harvest responses.

Give it a go, submit bugs and enhancements to the OWSLib trac.  Enjoy!

Batch Centroid Calculations with Python and OGR

I recently had a question on how to do batch centroid calculations against GIS data. OGR to the rescue again!

Using OGR’s Python bindings (GDAL/OGR needs to be built –with-geos=yes), one can process, say, an ESRI Shapefile, and calculate a centroid for each feature.

The script below does exactly this, and writes out a new dataset (any input / output format supported by OGR).

import sys
import osgeo.ogr as ogr

# process args
if len(sys.argv) < 4:
 print 'Usage: %s <format> <input> <output>' % sys.argv[0]
 sys.exit(1)

# open input file
dataset_in = ogr.Open(sys.argv[2])
if dataset_in is None:
 print 'Open failed.\n'
 sys.exit(2)

layer_in = dataset_in.GetLayer(0)
feature_in = layer_in.GetNextFeature()

# create output
driver_out = ogr.GetDriverByName(sys.argv[1])
if driver_out is None:
 print '%s driver not available.\n' % sys.argv[1]
 sys.exit(3)

dataset_out = driver_out.CreateDataSource(sys.argv[3])
if dataset_out is None:
 print 'Creation of output file failed.\n'
 sys.exit(4)

layer_out = dataset_out.CreateLayer(sys.argv[3], None, ogr.wkbPoint)
if layer_out is None:
 print 'Layer creation failed.\n'
 sys.exit(5)

# setup attributes
feature_in_defn = layer_in.GetLayerDefn()

for i in range(feature_in_defn.GetFieldCount()):
 field_def = feature_in_defn.GetFieldDefn(i)
 if layer_out.CreateField(field_def) != 0:
  print 'Creating %s field failed.\n' % field_def.GetNameRef()

layer_in.ResetReading()
feature_in = layer_in.GetNextFeature()

# loop over input features, calculate centroid and output features
while feature_in is not None:
 feature_out_defn = layer_out.GetLayerDefn()
 feature_out = ogr.Feature(feature_out_defn)
 for i in range(feature_out_defn.GetFieldCount()):
  feature_out.SetField(feature_out_defn.GetFieldDefn(i).GetNameRef(), \
  feature_in.GetField(i))
  geom = feature_in.GetGeometryRef()
  centroid = geom.Centroid()
  feature_out.SetGeometry(centroid)
  if layer_out.CreateFeature(feature_out) != 0:
   print 'Failed to create feature.\n'
   sys.exit(6)
  feature_in = layer_in.GetNextFeature()

# cleanup
dataset_in.Destroy()
dataset_out.Destroy()

easy CSW with eXcat

If you have existing metadata and simply want a CSW interface as a means to search and discover your geospatial metadata, eXcat provides a simple solution.  Following the installation steps, it’s quite simple to populate your CSW:

$ cd excat/csw/WEB-INF/harvest
$ tar zxf my_metadata_files.tgz
$ for i in *.xml
> do
> lwp-download "http://localhost/excat/csw?request=Harvest&service=CSW&\
> version=2.0.2\&namespace=xmlns(csw=http://www.opengis.net/cat/csw)&\
> source=$i&resourceFormat=application/xml\
> &resourceType=http://www.isotc211.org/2005/gmd"
> done

That’s pretty much it.  Lightweight simple approach, particularly for those who already have metadata management tools in place and need CSW.

/me thinks it would be nice to have a Python port of this for those in Java-less environments (why does it seem most CSW server implementations are in Java?)

NYC Sprint is Upon Us

Building on the Toronto Code Sprint 2009 (I had the honour of helping Paul set this up), this year MapServer, GDAL, PostGIS, etc. devs are headed to to the Big Apple for the New York Code Sprint 2010.  Having participated in last year’s event, I can say that it is a fun, spirited and productive event.  Though I won’t be able to make it there in person this year, I will be among those ‘present in spirit’ on #tosprint over the weekend.

Keep an eye on Paul’s blog for sprint updates.  Have fun guys!

Modified: 19 February 2010 15:28:07 EST