pygeometa is a handy little metadata generator tool which is flexible, extensible, and composable. Command line, or via the API, users can generate config files, or pass plain old Python dicts, ConfigParser objects, etc.
We’ve just released 0.2.0 which supports WMO Core Metadata Profile output, as well as better multilingual support. At this point we’re embarking on breaking changes in master led by moving to YAML as the configuration format.
Given pygeometa is pre-1.0 in theory changes can be breaking without support. Still, I’ve cut a 0.2 branch in case anyone’s existing workflows depend on the (now) old pygeometa functionality.
As always, bug reports, feature requests are more than welcome. Hopefully the new enhancements will make metadata management even easier for agile workflows.
There was lots of discussion on refactoring pycsw’s filter support to enable NoSQL backends. While we are still in discussion, this enhancement should open the doors for any backend (ElasticSearch, SOLR, a GitHub repository, another API, etc.). In addition, Frank Warmerdam started writing a pycsw OGR backend to support CSW exposure of the Planet Scenes API via OGR. This also presents exciting possibilities given OGR’s support of numerous underlying formats. Frank also provided valuable advice and feedback on interacting with pycsw as a developer/contributor. Thank you Frank!
There has been long discussion on a next generation GHC including a renewed architecture with core work on the model as well as an API. A basic architecture has surfaced as a result which focuses on having the UI exclusively work with the API, as well as a plugin framework which Just van den Broecke has started working on. I also worked on tagging which will be the last piece before cutting a release and forging ahead on the new architecture.
While I couldn’t get to everything I planned for, I think significant steps were made in moving the above projects forward along their respective roadmaps. It was also great to see some familiar faces as well as new contributors and projects!
I sat in shock for the remainder of the presentation thinking of the complexity and all the math involved. After their presentation, I mentioned this to the presenter offline, who replied “it’s very hard and complex work, yes”.
Fast forward around 2002 and it turns out they were indeed using proj.4 which initially made me think, “ah, that’s easy, then”.
These days, I would say well it’s not that easy. Integration, upstream changes, versions, packaging and deployment. Moving parts. Different issues. It’s smart, strategic and preferable not to re-invent the wheel and use existing libs, but the work certainly doesn’t end there.
(For what it’s worth, the vendor [it doesn’t matter who they are] and their product are still around and going strong)
It’s been almost two years since GeoHealthCheck was initially developed (en route to FOSS4G in PDX). Since then, GHC has been deployed in numerous environments in support of monitoring of (primarily) OGC services (canonical demo at http://geohealthcheck.osgeo.org).
Project communications have been relatively low key, with GitHub issues being the main discussion. The project has setup a Gitter channel as a means to discuss GeoHealthCheck in a public forum more easily. It’s open and anyone can join.
A sincere thanks to Richard Duivenvoorde, Angelos Tzotsos, Alexander Bruy, Tim Sutton and the rest of the QGIS developers/community for helping bring MetaSearch into QGIS to help move the search / discovery workflow forward!
As far as a roadmap, here’s a laundry list of future items:
OWSLib dependency cleanup: currently we manage a copy of OWSLib in QGIS proper. This is because there is a gap in packaging across supported platforms. It would be great to have approved OWSLib packages (see issue)
Metadata publishing and management: it would be great to manage and publish better metadata directly from MetaSearch. The end result will be a more streamlined, deeper integration and support of metadata within QGIS. No movement on these yet, but there are QEPs proposed
ISO based servers: MetaSearch supports the OGC Core CSW model. Most CSWs implement the CSW ISO Application Profile which supports more detailed metadata
add data functionality: it would also be very great to directly add raw data from a metadata record’s access links into QGIS. We already support this for OGC services, and supporting direct data downloads to visualize in QGIS would complete the “publish/find/bind” workflow
Do you have any enhancements you would like to see in MetaSearch? Feel free to bring them in the MetaSearch issue tracker or the QGIS mailing lists! Do you have fixes or features to contribute? Feel free to fork and send pull requests!
CSW has a good presence on the server side (pycsw, GeoNetwork Opensource, deegree, ESRI Geoportal are some FOSS packages). From the client side, OWSLib is the go to library for Python folks. QGIS has MetaSearch (which uses OWSLib).
– geospatial libs like OpenLayers, Leaflet
– web frameworks like jQuery, AngularJS, and so on
It’s great to see QGIS rising to fame in terms of a great desktop GIS tool. Part of what makes QGIS so great is the vast ecosystem of plugins. And Python support makes it easy to write plugins fast, especially atop existing libraries.
CSW client support in QGIS has been via the excellent CSWClient plugin. The MetaSearch project forks CSWClient and will make the following initial improvements:
QGIS 2.0 support
added Catalogue types in addition to CSW (JSON APIs, OpenSearch, etc.)
documentation using Sphinx
i18n/continuous localization for both UI and docs, using Transifex
code maintenance (easy to deploy for developers, automated build, packaging and dependency management)
As the number of pycsw deployments increase, we’ve started to keep a living document of live deployments on the pycsw wiki. Being a geogeek, naturally I said to myself, “hmm, would be cool to plot these all on a map”. Embedding maps has become easier than ever, and projects like MapServer and GeoServer have cool maps right on their homepages, which demo their maps against a theme like the next FOSS4G conference, etc.
pycsw is a bit different in that it doesn’t do maps, but certainly catalogues them and makes them discoverable via OGC:CSW, OpenSearch and SRU. And putting a sample GetRecords output on the website as a demo is boring. So mapping live deployments seemed like a cool idea for a quick hack with reproducible workflow so it doesn’t become a pain to keep things up to date.
The pycsw website is managed using reStructuredText and Sphinx; source code, issue tracker and wiki are hosted on GitHub. The first thing was to update each deployment on the wiki page with a lat/long pair (the lat/long pair being loosely based the location of the CSW itself, or the content of the CSW. Aside: it would be cool if CSW Capabilities XML specified a BBOX like WMS does to give folks an idea of the location of records).
After this, I wrote a Python script to fetch (and cache) the raw wiki page content. Then, using Leaflet, setup a simple map and create markers foreach live deployment.
That’s pretty much it. So now whenever the live deployment page is updated, a simple make clean && make html will keep things up to date. Reproducible workflow!
UPDATE 26 January 2012: the benchmarks on the improvements below were done against my home dev server (2.8 GHz, 1GB RAM). Benchmarking recently on a modern box yielded 3.6 seconds with maxrecords=10000 (!).
pycsw does a pretty good job of implementing OGC CSW. All CITE tests pass, configuration is painless, and performance is great. To date, testing has been done on repositories of < 5000 records.
Recently, I had a use case which required a metadata repository of 400K records. After loading the records, I found that doing GetRecords searches against 400K records brought things to a halt (Houston, we have a problem). So off I went on a performance improvement adventure.
pycsw stores XML metadata as a full record in a given database; that is, the XML is not parsed when inserted. Queries are then done using XPath queries using lxml and called as embedded SQL functions (for SQLite, these are realized using connection.create_function(); for PostgreSQL, we declare the same functions via plpythonu. SQLAlchemy is used as the DB abstraction layer.
Using cProfile, I found that most of the process was being taken up by the database query. I started thinking that the Python functions being called from the database got expensive as volume scaled (init’ing an XML parser to evaluate and match on each and every row).
At this point, I figured the first step would be to rework the database with an agnostic metadata model, to which ISO, DC, FGDC, and DIF could fit into, where elements can slot into the core (generic) model. Each profile then maps the queryables to (instead of an XPath) a database column in the codebase.
At this point, I loaded 16000 Dublin Core documents as a first test. Results:
– GetCapabilities and GetDomain were instant, and I mean instant (these use the underlying database as well)
– GetRecords: I tried with and without filters. Performance is improved (5 seconds to return 15700 records matching a query [title = ‘%Lor%’], presenting 5 records)
This is a big improvement, but still I thought this would have been faster. I profiled the code again. The cost of the SQL fetch was reduced.
I then ran tests without using sqlalchemy in the codebase (i.e. SQL scripting as opposed to the SQLAlchemy way). I used the Python sqlite3 module, and that’s it. Queries got faster.
Still, this was only 16000 records. As well, I started thinking/worrying about taking away sqlalchemy; it does give us great abstraction into different underlying databases, and helps us greatly with transactional (insert/update/delete).
Then I started thinking more about bottlenecks and the fetch of data. How can we have fast queries and keep sqlalchemy for ease of interacting with the underlying repo??
Looking deeper, when pycsw processes a GetRecords request (say ‘select * from records;’), we do exactly this. So say the DB has 100K records, sqlalchemy gets ALL 100K records. When I bring them back from server/repository.py to server/server.py, that’s an sqlalchemy object with 100K members we’re working with. Then, in that code, I page through the results using maxrecords and startposition as requested by the client / set by the server processing.
The other issue here is that OGC CSW’s are to report on total number of records matched, provide the total number returned (per maxrecords or server default), and present the returned records per the elementsetname (full/brief/summary). So applying a paging approach without getting the number of records matched was not an option.
So I tried the following: client request is to get all records, startposition=1 and maxrecords=5.
– one query which ONLY gets the COUNT of records which satisfy the query (i.e. ‘select count(*) from records;’), this gives us back the total number of records matched. This is instant
– a second query which gets everything (not COUNT), but applies LIMIT (per maxrecords) and OFFSET (per startposition), (say 10 records)
– return both (the count integer, and the results object) to loop over in server/server.py:getrecords()
So the slicing is now done in the SQL which is more powerful. So on 100K records, this approach only pushes back the results per LIMIT and OFFSET (10 records).
Results come back in less than 1 second. Of course, as you increase maxrecords, this is more work for the server to return the records. But still good performance; even when maxrecords=5000, the response is 3 seconds.
So the moral of the story is that smart paging saves us here.
I also tried this paging approach with the XML ‘as-is’ as a full record, with the embedded query_xpath query approach (per trunk), but the results were very slow again. So the embedded xpath queries were hurting us there too.
At this point, the way forward was clearer:
– keep using sqlalchemy for flexibility; yes, if we remove sqlalchemy it will improve performance, but I think the flexibility it gives us, as well as we still get good performance, makes sense for us to keep it at this point
– update data model to deconstruct the XML and put into columns
– use paging techniques to query and present results
– XML databases: looking for a non-Java solution, I found Berkeley DB XML to be interesting. I haven’t done enough pycsw integration yet to assess the pros/cons. Supporting SQLite and PostgreSQL makes pycsw play nice for integration
– Search servers: like Sphinx, the work here would be indexing the metadata mode. Again, the flexibility of using an RDBMS and SQLAlchemy was still attractive
Perhaps the above approaches could be supported as additional db stores. Currently, pycsw code has some ties to what the underlying data model looks like. We could add layer of abstraction between the DB model and the records object model.
I think I’ve exhausted the approaches here for now. These changes are committed to svn trunk. None of these changes will impact end user configuration, just a bit more code behind the scenes.
CSW allows for querying various metadata models (e.g. Dublin Core, ISO). In pycsw, our current model is to manage one repository per metadata model (or ‘typename’ in CSW speak). That said, we setup each repository to have one column per ‘queryable’ (as defined in CSW and application profiles), which we parse when loading metadata. We also store the full metadata record as is (for GetRecords ElementSetName=’full’ requests).
Complexity increases as we start thinking about support for more information models, and transforming to/from requested information models (via CSW GetRecords/GetRecordById ‘outputSchema’ parameter). Having said this, I’ve started to think about a core, agnostic information model which any metadata format could map to (for lowest common denominator). This way, pycsw will always know the core information model queryables, which could be stored in columns as we currently do now. The underlying queries would always query against the queryable columns. Aside: it would be great to have a GDAL for metadata (MDAL anyone?).
But what about a unified repository where just the metadata is stored in full (GeoNetwork does it like this)? In this scenario, we would need heavy use of XPath queries on the full XML document in realtime. The advantage would be a.) less parsing on metadata loading b.) one repository is always loaded/queried c.) less configuration for the catalog administrator.
I like the use of XPath, but wonder about how this scales as additional databases are supported. We currently support SQLite, which is great for simplicity (and Python SQLite bindings allow for mapping Python functions). SQLite has no XPath support (but we could support this with Python bindings). PostgreSQL does (if you build with libxml2), as does MySQL. As well, I’m not sure about the performance implications (and how deep XPath queries are in the database fetching, i.e. the entire XML document would have to be serialized before XPath queries are executed).
Thoughts on a Friday morning. Anyone have any advice/insight?