It’s been almost two years since GeoHealthCheck was initially developed (en route to FOSS4G in PDX). Since then, GHC has been deployed in numerous environments in support of monitoring of (primarily) OGC services (canonical demo at http://geohealthcheck.osgeo.org).
Project communications have been relatively low key, with GitHub issues being the main discussion. The project has setup a Gitter channel as a means to discuss GeoHealthCheck in a public forum more easily. It’s open and anyone can join.
CSW has a good presence on the server side (pycsw, GeoNetwork Opensource, deegree, ESRI Geoportal are some FOSS packages). From the client side, OWSLib is the go to library for Python folks. QGIS has MetaSearch (which uses OWSLib).
– geospatial libs like OpenLayers, Leaflet
– web frameworks like jQuery, AngularJS, and so on
As the number of pycsw deployments increase, we’ve started to keep a living document of live deployments on the pycsw wiki. Being a geogeek, naturally I said to myself, “hmm, would be cool to plot these all on a map”. Embedding maps has become easier than ever, and projects like MapServer and GeoServer have cool maps right on their homepages, which demo their maps against a theme like the next FOSS4G conference, etc.
pycsw is a bit different in that it doesn’t do maps, but certainly catalogues them and makes them discoverable via OGC:CSW, OpenSearch and SRU. And putting a sample GetRecords output on the website as a demo is boring. So mapping live deployments seemed like a cool idea for a quick hack with reproducible workflow so it doesn’t become a pain to keep things up to date.
The pycsw website is managed using reStructuredText and Sphinx; source code, issue tracker and wiki are hosted on GitHub. The first thing was to update each deployment on the wiki page with a lat/long pair (the lat/long pair being loosely based the location of the CSW itself, or the content of the CSW. Aside: it would be cool if CSW Capabilities XML specified a BBOX like WMS does to give folks an idea of the location of records).
After this, I wrote a Python script to fetch (and cache) the raw wiki page content. Then, using Leaflet, setup a simple map and create markers foreach live deployment.
That’s pretty much it. So now whenever the live deployment page is updated, a simple make clean && make html will keep things up to date. Reproducible workflow!
UPDATE 26 January 2012: the benchmarks on the improvements below were done against my home dev server (2.8 GHz, 1GB RAM). Benchmarking recently on a modern box yielded 3.6 seconds with maxrecords=10000 (!).
pycsw does a pretty good job of implementing OGC CSW. All CITE tests pass, configuration is painless, and performance is great. To date, testing has been done on repositories of < 5000 records.
Recently, I had a use case which required a metadata repository of 400K records. After loading the records, I found that doing GetRecords searches against 400K records brought things to a halt (Houston, we have a problem). So off I went on a performance improvement adventure.
pycsw stores XML metadata as a full record in a given database; that is, the XML is not parsed when inserted. Queries are then done using XPath queries using lxml and called as embedded SQL functions (for SQLite, these are realized using connection.create_function(); for PostgreSQL, we declare the same functions via plpythonu. SQLAlchemy is used as the DB abstraction layer.
Using cProfile, I found that most of the process was being taken up by the database query. I started thinking that the Python functions being called from the database got expensive as volume scaled (init’ing an XML parser to evaluate and match on each and every row).
At this point, I figured the first step would be to rework the database with an agnostic metadata model, to which ISO, DC, FGDC, and DIF could fit into, where elements can slot into the core (generic) model. Each profile then maps the queryables to (instead of an XPath) a database column in the codebase.
At this point, I loaded 16000 Dublin Core documents as a first test. Results:
– GetCapabilities and GetDomain were instant, and I mean instant (these use the underlying database as well)
– GetRecords: I tried with and without filters. Performance is improved (5 seconds to return 15700 records matching a query [title = ‘%Lor%’], presenting 5 records)
This is a big improvement, but still I thought this would have been faster. I profiled the code again. The cost of the SQL fetch was reduced.
I then ran tests without using sqlalchemy in the codebase (i.e. SQL scripting as opposed to the SQLAlchemy way). I used the Python sqlite3 module, and that’s it. Queries got faster.
Still, this was only 16000 records. As well, I started thinking/worrying about taking away sqlalchemy; it does give us great abstraction into different underlying databases, and helps us greatly with transactional (insert/update/delete).
Then I started thinking more about bottlenecks and the fetch of data. How can we have fast queries and keep sqlalchemy for ease of interacting with the underlying repo??
Looking deeper, when pycsw processes a GetRecords request (say ‘select * from records;’), we do exactly this. So say the DB has 100K records, sqlalchemy gets ALL 100K records. When I bring them back from server/repository.py to server/server.py, that’s an sqlalchemy object with 100K members we’re working with. Then, in that code, I page through the results using maxrecords and startposition as requested by the client / set by the server processing.
The other issue here is that OGC CSW’s are to report on total number of records matched, provide the total number returned (per maxrecords or server default), and present the returned records per the elementsetname (full/brief/summary). So applying a paging approach without getting the number of records matched was not an option.
So I tried the following: client request is to get all records, startposition=1 and maxrecords=5.
– one query which ONLY gets the COUNT of records which satisfy the query (i.e. ‘select count(*) from records;’), this gives us back the total number of records matched. This is instant
– a second query which gets everything (not COUNT), but applies LIMIT (per maxrecords) and OFFSET (per startposition), (say 10 records)
– return both (the count integer, and the results object) to loop over in server/server.py:getrecords()
So the slicing is now done in the SQL which is more powerful. So on 100K records, this approach only pushes back the results per LIMIT and OFFSET (10 records).
Results come back in less than 1 second. Of course, as you increase maxrecords, this is more work for the server to return the records. But still good performance; even when maxrecords=5000, the response is 3 seconds.
So the moral of the story is that smart paging saves us here.
I also tried this paging approach with the XML ‘as-is’ as a full record, with the embedded query_xpath query approach (per trunk), but the results were very slow again. So the embedded xpath queries were hurting us there too.
At this point, the way forward was clearer:
– keep using sqlalchemy for flexibility; yes, if we remove sqlalchemy it will improve performance, but I think the flexibility it gives us, as well as we still get good performance, makes sense for us to keep it at this point
– update data model to deconstruct the XML and put into columns
– use paging techniques to query and present results
– XML databases: looking for a non-Java solution, I found Berkeley DB XML to be interesting. I haven’t done enough pycsw integration yet to assess the pros/cons. Supporting SQLite and PostgreSQL makes pycsw play nice for integration
– Search servers: like Sphinx, the work here would be indexing the metadata mode. Again, the flexibility of using an RDBMS and SQLAlchemy was still attractive
Perhaps the above approaches could be supported as additional db stores. Currently, pycsw code has some ties to what the underlying data model looks like. We could add layer of abstraction between the DB model and the records object model.
I think I’ve exhausted the approaches here for now. These changes are committed to svn trunk. None of these changes will impact end user configuration, just a bit more code behind the scenes.
CSW allows for querying various metadata models (e.g. Dublin Core, ISO). In pycsw, our current model is to manage one repository per metadata model (or ‘typename’ in CSW speak). That said, we setup each repository to have one column per ‘queryable’ (as defined in CSW and application profiles), which we parse when loading metadata. We also store the full metadata record as is (for GetRecords ElementSetName=’full’ requests).
Complexity increases as we start thinking about support for more information models, and transforming to/from requested information models (via CSW GetRecords/GetRecordById ‘outputSchema’ parameter). Having said this, I’ve started to think about a core, agnostic information model which any metadata format could map to (for lowest common denominator). This way, pycsw will always know the core information model queryables, which could be stored in columns as we currently do now. The underlying queries would always query against the queryable columns. Aside: it would be great to have a GDAL for metadata (MDAL anyone?).
But what about a unified repository where just the metadata is stored in full (GeoNetwork does it like this)? In this scenario, we would need heavy use of XPath queries on the full XML document in realtime. The advantage would be a.) less parsing on metadata loading b.) one repository is always loaded/queried c.) less configuration for the catalog administrator.
I like the use of XPath, but wonder about how this scales as additional databases are supported. We currently support SQLite, which is great for simplicity (and Python SQLite bindings allow for mapping Python functions). SQLite has no XPath support (but we could support this with Python bindings). PostgreSQL does (if you build with libxml2), as does MySQL. As well, I’m not sure about the performance implications (and how deep XPath queries are in the database fetching, i.e. the entire XML document would have to be serialized before XPath queries are executed).
Thoughts on a Friday morning. Anyone have any advice/insight?
While working on pycsw, we found that there was a significant amount of code involved in processing the HTTP POST requests coming across as XML. Since lxml is used as for XML support, why not use its native XML validation facilities? We implemented this rather quickly, but found validation was taking up to 10 seconds. Why?
In lxml, you have to specify an XML Schema to parse against, even if it is specified in xsi:schemaLocation. Being a purist, I set this to fetch the schema on the fly from http://schemas.opengis.net. The fetch was causing much of the bottleneck, so I decided to download all required OGC CSW schemas locally and have them as part of the implementation. That should work right? Validation was down to about 6 seconds.
The issue here was that even though the schemas were local, many xs:import definitions within them were pointing back to absolute URLs at schemas.opengis.net. After modifying the schemas to point to relative locations, validation was extremely fast (way under a second).
Lesson learned: just because XML schemas are local, doesn’t mean they don’t point to remote URLs (though I’m not exactly sure why one would build a schema with non-local imports if they don’t have to).
Seemingly buried in geospatial metadata and discovery, I’ve been developing a my share of CSW/ISO/Dublin Core parsers, generators and clients. OWSLib is able to interact with CSW servers, handling csw:Record, ISO 19139:2007, as well as DIF. OWSLib is also the underlying library used by the NextGIS folks in developing a QGIS CSW Client (big thanks to Maxim and Alex for contributing the code back to qgcsw). I’ve also used genshi to generate ISO 19139:2007 and North American Profile.
Part of this adventure has involved testing these metadata within various OGC CSW server implementations. What I quickly noticed is that many foss4g CSW servers are written in Java. Wouldn’t it be great to have a trimmer CSW server in Python? Which can be used easily with an existing Apache install type of thing?
lightweight and easy to stand up: a standalone catalogue, no GUI or metadata editing front end, designed for the use case of exposing ready-to-go metadata (files or in existing DB) through a CSW interface, with as little heavy lifting as possible. Plug and play
extensible: the ability to add metadata formats and mapping them to a common information model and core / additional queryables