Archive for web

Cheers to 2022

That’s a bit more like it, 2022! We finally saw some COVID restrictions lifted and a sense of normalcy (including a new normal) arose. It was fantastic to once again meet with people in person (for dinner, for a visit, for a meeting, you name it!). The pandemic had such a negative effect on me that even commuting again became a joy. Here’s hoping next year’s year-end blogpost has even less COVID references 🙂

Having said this, 2022 proved to be a busy year, here are some highlights.

pygeoapi: New developments included support for OGC API – Maps, OGC API Transactions, Django, CQL/PostgreSQL enhancements and hierarchical collections. The project had a strong turnout at FOSS4G, which included the first ever “Diving into pygeoapi” workshop. Oh, and pygeoapi is now an official OSGeo project!

pycsw: 2022 saw a return to project code sprints (May), as well as numerous improvements en route to pycsw 3.0 (XSLT support, JSON storage, SOLR backend). pycsw continues to be an early implementer of OGC API – Records, increasing STAC support, as well as improvements to contacts and templating (thanks to great work by Paul van Genuchten!).

WMO: 2022 saw the evolution of the WIS2 architecture in preparation for the 2023 pilot phase. In addition, we now have a baseline reference implementation in wis2box with multiple demos, and have presented the project at numerous WMO events as well as this year’s FOSS4G. Strong use of standards (data, metadata, APIs) from OGC, W3C and IETF for the next generation of weather/climate/water data exchange — exciting times!

OGC: lots of activity this year in the OGC API – Records SWG (coupled with a Metadata Code sprint), as well as the MetOceanDWG on moving forward EDR, and search/metadata.

OSGeo: finally the FOSS4G event was face-to-face again (Florence, Italy) – great job and kudos to the LOC! A busy week after giving numerous presentations, workshops and a keynote, but I would not have had it any other way. The face-to-face energy made it all worth it, whether it was meeting up with longtime friends or meeting new ones. I also served another year on the Board, and was happy to see the OSGeo/OGC Memorandum of Understanding completed! This also paved the way for proper and unlimited OSGeo representation at OGC. I’m also fortunate to have been elected to serve on the Board again to 2024. Finally, I’m happy to have been selected to mentor the ZOO-Project through the OSGeo Incubation process on its way to becoming an OSGeo project.

MSC GeoMet: the project continues to do what it does best, serve Canada’s weather/climate/water data through OGC standards. Yup, powered by MapServer and pygeoapi.

Health: another year (circa 2012) of not smoking. I took off considerable weight in 2022, put 1/3 of it back on, but now progressing again.

Looking forward to 2023:

  • pygeoapi: as we inch towards a 1.0, and having landed so many features in the codebase, it’s time to address some technical debt. I’m hoping for 12-18 months of housekeeping/refactoring to help harden things for a 1.0 release (target 2024) and sustainable future moving forward. The “Diving into pygeoapi” workshop will hopefully be accepted and given again in 2023 at FOSS4G, as well as a possible dedicated code sprint.
  • pycsw: we are targeting a 3.0 this year, pending progress on OGC API – Records. Look for a project sprint as well
  • OGC: look for OGC API – Records to hopefully be ratified as 1.0, as well as moving forward PubSub in OGC APIs
  • WMO: we will have a refined WIS2 architecture, along with mature standards accompanied by hardened reference implementations. WCMP2 should be mature in its definition and implementation (pywcmp, pygeometa), as well as WIS2 notification message standard (pywis-pubsub). Look for a wis2box 1.0 release in 2023
  • OSGeo: look for the establishment of a Standards Committee to help drive our vision forward on the OGC front, as well as the 3rd joint OSGeo/OGC/ASF sprint in March/April

Wishing everyone a safe and happy 2023!

Sayonara 2021

So 2021 wasn’t much better than 2020. Another year of endless virtual meetings and the 24 hour office. Here are some updates from WFH life:

pygeoapi: both OGC API – Records and OGC API – Environmental Data Retrieval support were added to the codebase. The project also saw both CQL and i18n support, which is a positive indicator of contributions from various developers. Thanks Sander Schaminee and Francesco Bartoli!

pycsw: OGC API – Records and STAC API were both implemented. In addition, CQL support was added with the help of the impressive pygeofilter package — great work by Fabian Schindler!

QGIS MetaSearch: standards implementation needs both servers and clients, and so OGC API – Records support made it into MetaSearch. A nice by product of this enhancement is the implementation in OWSLib, which MetaSearch uses as its discovery library.

OGC API (Records, EDR): EDR is now an adopted standard! Records also made great strides in 2020, and helping clarify the relationship with STAC has proved valuable for all communities involved.

WMO: Lots of fun work this year on the Task Team on WIS Metadata: new KPIs, an update to the WIS Guide, the metadata search pilot, and we backed it up with tools (pywcmp, pywiscat). In addition, the Expert Team on Architecture and Transition (W2AT) was formed to move forward technical regulations on the WIS 2.0.

MSC GeoMet: our weather/climate/water OGC API platform continues to crank out millions of maps, features and metadata on the daily for everyone. Happy to report that real-time / event driven data support was added this year to our pygeoapi instance.

FOSS4G: between 7 presentations and the Geopython workshop, lots of action this year at this year’s virtual FOSS4G global event. I was fortunate to deliver these alongside some really talented folks in the Geopython community. Kudos to the BALOC for putting on such a great event under some difficult circumstances!

OSGeo Board of Directors: I was happy to help with the first ever OSGeo / OGC / Apache joint sprint, as well helping move forward the OSGeo / OGC MOU renewal.

Health: another year (circa 2012) of not smoking. The pandemic continues to challenge the scale, although some recent progress has helped some.

For 2022:

  • OGC API: critical path for me this year are helping in the adoption of Records and Coverages
  • WMO: WIS 2.0 continues to evolve, lowering the barrier to weather/climate/water data. I recently signed on as lead architect/dev of the WIS 2.0 in a box project, which will be a reference implementation and publishing pipeline aligned with WIS 2.0 principles. Under the hood is Geopython, PubSub. Look for an initial release in 2022
  • OSGeo: 2022 will mark the year that the OSGeo / OGC MOU is officially updated, along with a shiny new Associate Membership. Rolling this into the OSGeo standards community will be key, along with moving forward the renewal of OGC CITE tooling
  • pycsw: key items this year include XSLT transformation pipelines, virtual collections and deeper JSON support. We are also planning a sprint in Q1, come join us!
  • pygeoapi: look for deeper support of OGC EDR as well as some refactoring that will help with extensibility (primarily for output formats)

Wishing everyone a safe and happy and better 2022!

20 years later – first website

20 years ago I was living in Ottawa, in GIS school and started working with Natural Resources Canada.  Fast forward to a few weeks back scanning through old CDROMs and low and behold there was my first ever website.  I sat back for a few minutes remembering the details:

  • made with Microsoft FrontPage followed by HotDog Express (WYSIWYG HTML editors)!  At the time, I was convinced at the time this was the only way to be an HTML programmer
  • the website first made it to the Internet in March 1998 and bounced around a few places:
    • http://alqonquinc.on.ca/~kral0003 (Algonquin College account)
    • http://chat.carleton.ca/~279186 (Carleton University account)
    • http://nrcan.gc.ca/~tkralidi/ (work account)
    • http://www.storm.ca/~tommy (Storm Internet who provided awesome service)
  • Concerned that this wasn’t enough, I was motivated to host the site on my own, with a real domain and so on.  I bought Red Hat Linux 6 Server by Mohammed J. Kabir (great book!) and learned how to put up a server and website from the ground up (DNS, firewall, services, etc.), killing an entire weekend
  • the website then finally found a permanent home at http://kralidis.ca

Soon after learning Linux a few months later, I was motivated to rewrite the site in pure HTML, by hand.  From there I added a picture gallery, source code, blog, and so on.

I continue to post to the blog, but things like GitHub, Twitter, Facebook, etc. provide similar capabilities without the hosting maintenance/hassle.

Anyways, I’ve posted it at http://kralidis.ca/misc/firstwebsite/ — enjoy!

Do you have your first website?  Still online?  Feel free to share memories and experiences!

GeoUsage: Log Analyzer for OGC Web Services

Continuing on the UNIX philosophy, another little tool to help with your OWS workflows.

GeoUsage attempts to support the use case of metrics and analysis of OWS service usage.  How many users are hitting your OWS?  Which layers/projections are the most popular?  How much bandwidth?  How many maps vs. data downloads?

A pure Python package, GeoUsage doesn’t have strong opinions beyond OWS-specific parsing and analysis of web server logs.  GeoUsage is composable, i.e. frequency, log management, and storage of results is totally up to the user.  Having said this, a simple and beautiful command line interface is available for eyeballing results.

As always, GeoUsage is free and open source.

It’s early days, so feedback, bug reports, suggestions are appreciated.  Contributors are most welcome!

GeoHealthCheck support on Gitter

It’s been almost two years since GeoHealthCheck was initially developed (en route to FOSS4G in PDX).  Since then, GHC has been deployed in numerous environments in support of monitoring of (primarily) OGC services (canonical demo at http://geohealthcheck.osgeo.org). If you really want to support the progress and improvement of your health and your physical condition, first of all you should visit firstpost.com/ and find out about the best natural dietary formula that you and your body can try.

Project communications have been relatively low key, with GitHub issues being the main discussion.  The project has setup a Gitter channel as a means to discuss GeoHealthCheck in a public forum more easily.  It’s open and anyone can join. Come join us on https://gitter.im/geopython/GeoHealthCheck!

Do not forget to always be aware of your body’s health status checkups, but if you want to lose weight almost instantly you should visit https://www.amny.com/sponsored/exipure-reviews/ and find out about the pill that will change not only your body but also your entire life.

CSW Client Library for JavaScript: the Adventure Begins

CSW has a good presence on the server side (pycsw, GeoNetwork Opensource, deegree, ESRI Geoportal are some FOSS packages).  From the client side, OWSLib is the go to library for Python folks.  QGIS has MetaSearch (which uses OWSLib).

At the same time, it’s been awhile since I’ve delved into deep JavaScript.  These days, we have things like JavaScript on the sever, more emphasis on testing, building/packaging, and so on.  You can do it all with JavaScript if you want.

Wouldn’t it be great to have a generic CSW JavaScript client?  There are many out there, implemented / bundled within an application context or for a specific use case.  But what about a generic lib?  Kind of like OWSLib, but for JavaScript.

Say hello to csw4js.  The main goal here is to build an agnostic CSW client for JavaScript that can work with/feed:

– geospatial libs like OpenLayers, Leaflet

– web frameworks like jQuery, AngularJS, and so on

– browser applications, node.js, etc.

Todo:

– Unit tests (QUnit?)

– Build routines (using Grunt initially)

– JavaScript muscle for namespacing, structure, etc.

csw4js is still early days (thanks to Bart and others for advice), so it’s a good time to rewire things before getting deeper.  Interested in helping out?  Get in touch!

 

Mapping pycsw Deployments

As the number of pycsw deployments increase, we’ve started to keep a living document of live deployments on the pycsw wiki. Being a geogeek, naturally I said to myself, “hmm, would be cool to plot these all on a map”.  Embedding maps has become easier than ever, and projects like MapServer and GeoServer have cool maps right on their homepages, which demo their maps against a theme like the next FOSS4G conference, etc.

pycsw is a bit different in that it doesn’t do maps, but certainly catalogues them and makes them discoverable via OGC:CSW, OpenSearch and SRU.  And putting a sample GetRecords output on the website as a demo is boring.  So mapping live deployments seemed like a cool idea for a quick hack with reproducible workflow so it doesn’t become a pain to keep things up to date.

The pycsw website is managed using reStructuredText and Sphinx; source code, issue tracker and wiki are hosted on GitHub.  The first thing was to update each deployment on the wiki page with a lat/long pair (the lat/long pair being loosely based the location of the CSW itself, or the content of the CSW.  Aside: it would be cool if CSW Capabilities XML specified a BBOX like WMS does to give folks an idea of the location of records).

After this, I wrote a Python script to fetch (and cache) the raw wiki page content.  Then, using Leaflet, setup a simple map and create markers foreach live deployment.

So now I have a JavaScript snippet, now how do I add this to a page?  Using the Sphinx Makefile, I update the html target to run the Python script and save it to an area where I embed it using a rST include.

That’s pretty much it.  So now whenever the live deployment page is updated, a simple make clean && make html will keep things up to date.  Reproducible workflow!

I’ve published this to the pycsw community page.  Do you have a pycsw install?  Add it to https://github.com/geopython/pycsw/wiki/Live-Deployments and we’ll put it on the map!

pycsw performance improvements

UPDATE 26 January 2012: the benchmarks on the improvements below were done against my home dev server (2.8 GHz, 1GB RAM).  Benchmarking recently on a modern box yielded 3.6 seconds with maxrecords=10000 (!).

pycsw does a pretty good job of implementing OGC CSW.  All CITE tests pass, configuration is painless, and performance is great.  To date, testing has been done on repositories of < 5000 records.

Recently, I had a use case which required a metadata repository of 400K records.  After loading the records, I found that doing GetRecords searches against 400K records brought things to a halt (Houston, we have a problem).  So off I went on a performance improvement adventure.

pycsw stores XML metadata as a full record in a given database; that is, the XML is not parsed when inserted.  Queries are then done using XPath queries using lxml and called as embedded SQL functions (for SQLite, these are realized using connection.create_function(); for PostgreSQL, we declare the same functions via plpythonu.  SQLAlchemy is used as the DB abstraction layer.

Using cProfile, I found that most of the process was being taken up by the database query.  I started thinking that the Python functions being called from the database got expensive as volume scaled (init’ing an XML parser to evaluate and match on each and every row).

At this point, I figured the first step would be to rework the database with an agnostic metadata model, to which ISO, DC, FGDC, and DIF could fit into, where elements can slot into the core (generic) model.  Each profile then maps the queryables to (instead of an XPath) a database column in the codebase.

At this point, I loaded 16000 Dublin Core documents as a first test.  Results:

– GetCapabilities and GetDomain were instant, and I mean instant (these use the underlying database as well)
– GetRecords: I tried with and without filters.  Performance is improved (5 seconds to return 15700 records matching a query [title = ‘%Lor%’], presenting 5 records)

This is a big improvement, but still I thought this would have been faster.  I profiled the code again.  The cost of the SQL fetch was reduced.

I then ran tests without using sqlalchemy in the codebase (i.e. SQL scripting as opposed to the SQLAlchemy way).  I used the Python sqlite3 module, and that’s it.  Queries got faster.

Still, this was only 16000 records.  As well, I started thinking/worrying about taking away sqlalchemy; it does give us great abstraction into different underlying databases, and helps us greatly with transactional (insert/update/delete).

Then I started thinking more about bottlenecks and the fetch of data.  How can we have fast queries and keep sqlalchemy for ease of interacting with the underlying repo??

Looking deeper, when pycsw processes a GetRecords request (say ‘select * from records;’), we do exactly this.  So say the DB has 100K records, sqlalchemy gets ALL 100K records.  When I bring them back from server/repository.py to server/server.py, that’s an sqlalchemy object with 100K members we’re working with.  Then, in that code, I page through the results using maxrecords and startposition as requested by the client / set by the server processing.

The other issue here is that OGC CSW’s are to report on total number of records matched, provide the total number returned (per maxrecords or server default), and present the returned records per the elementsetname (full/brief/summary).  So applying a paging approach without getting the number of records matched was not an option.

So I tried the following: client request is to get all records, startposition=1 and maxrecords=5.

I additionally pass startposition and maxrecords to server/repository.py:query()

In repository.query(), I then do two queries:

– one query which ONLY gets the COUNT of records which satisfy the query (i.e. ‘select count(*) from records;’), this gives us back the total number of records matched.  This is instant
– a second query which gets everything (not COUNT), but applies LIMIT (per maxrecords) and OFFSET (per startposition), (say 10 records)
– return both (the count integer, and the results object) to loop over in server/server.py:getrecords()

So the slicing is now done in the SQL which is more powerful.  So on 100K records, this approach only pushes back the results per LIMIT and OFFSET (10 records).

Results come back in less than 1 second.  Of course, as you increase maxrecords, this is more work for the server to return the records.  But still good performance; even when maxrecords=5000, the response is 3 seconds.

So the moral of the story is that smart paging saves us here.

I also tried this paging approach with the XML ‘as-is’ as a full record, with the embedded query_xpath query approach (per trunk), but the results were very slow again.  So the embedded xpath queries were hurting us there too.

At this point, the way forward was clearer:

– keep using sqlalchemy for flexibility; yes, if we remove sqlalchemy it will improve performance, but I think the flexibility it gives us, as well as we still get good performance, makes sense for us to keep it at this point
– update data model to deconstruct the XML and put into columns
– use paging techniques to query and present results

Other options:

– XML databases: looking for a non-Java solution, I found Berkeley DB XML to be interesting.  I haven’t done enough pycsw integration yet to assess the pros/cons.  Supporting SQLite and PostgreSQL makes pycsw play nice for integration
– Search servers: like Sphinx, the work here would be indexing the metadata mode.  Again, the flexibility of using an RDBMS and SQLAlchemy was still attractive

Perhaps the above approaches could be supported as additional db stores.  Currently, pycsw code has some ties to what the underlying data model looks like.  We could add layer of abstraction between the DB model and the records object model.

I think I’ve exhausted the approaches here for now.  These changes are committed to svn trunk.  None of these changes will impact end user configuration, just a bit more code behind the scenes.

CSW and repository thoughts

CSW allows for querying various metadata models (e.g. Dublin Core, ISO).  In pycsw, our current model is to manage one repository per metadata model (or ‘typename’ in CSW speak).  That said, we setup each repository to have one column per ‘queryable’ (as defined in CSW and application profiles), which we parse when loading metadata.  We also store the full metadata record as is (for GetRecords ElementSetName=’full’ requests).

Complexity increases as we start thinking about support for more information models, and transforming to/from requested information models (via CSW GetRecords/GetRecordById ‘outputSchema’ parameter).  Having said this, I’ve started to think about a core, agnostic information model which any metadata format could map to (for lowest common denominator).  This way, pycsw will always know the core information model queryables, which could be stored in columns as we currently do now.  The underlying queries would always query against the queryable columns.  Aside: it would be great to have a GDAL for metadata (MDAL anyone?).

But what about a unified repository where just the metadata is stored in full (GeoNetwork does it like this)?  In this scenario, we would need heavy use of XPath queries on the full XML document in realtime.  The advantage would be a.) less parsing on metadata loading b.) one repository is always loaded/queried c.) less configuration for the catalog administrator.

I like the use of XPath, but wonder about how this scales as additional databases are supported.  We currently support SQLite, which is great for simplicity (and Python SQLite bindings allow for mapping Python functions).  SQLite has no XPath support (but we could support this with Python bindings).  PostgreSQL does (if you build with libxml2), as does MySQL.  As well, I’m not sure about the performance implications (and how deep XPath queries are in the database fetching, i.e. the entire XML document would have to be serialized before XPath queries are executed).

Thoughts on a Friday morning.  Anyone have any advice/insight?

 

validating XML requests with Python and lxml

While working on pycsw, we found that there was a significant amount of code involved in processing the HTTP POST requests coming across as XML.  Since lxml is used as for XML support, why not use its native XML validation facilities?  We implemented this rather quickly, but found validation was taking up to 10 seconds.  Why?

In lxml, you have to specify an XML Schema to parse against, even if it is specified in xsi:schemaLocation.  Being a purist, I set this to fetch the schema on the fly from http://schemas.opengis.net.  The fetch was causing much of the bottleneck, so I decided to download all required OGC CSW schemas locally and have them as part of the implementation.  That should work right?  Validation was down to about 6 seconds.

The issue here was that even though the schemas were local, many xs:import definitions within them were pointing back to absolute URLs at schemas.opengis.net.  After modifying the schemas to point to relative locations, validation was extremely fast (way under a second).

Lesson learned: just because XML schemas are local, doesn’t mean they don’t point to remote URLs (though I’m not exactly sure why one would build a schema with non-local imports if they don’t have to).

Modified: 21 April 2011 08:44:07 EST