UPDATE 26 January 2012: the benchmarks on the improvements below were done against my home dev server (2.8 GHz, 1GB RAM). Benchmarking recently on a modern box yielded 3.6 seconds with maxrecords=10000 (!).
Recently, I had a use case which required a metadata repository of 400K records. After loading the records, I found that doing GetRecords searches against 400K records brought things to a halt (Houston, we have a problem). So off I went on a performance improvement adventure.
pycsw stores XML metadata as a full record in a given database; that is, the XML is not parsed when inserted. Queries are then done using XPath queries using lxml and called as embedded SQL functions (for SQLite, these are realized using connection.create_function(); for PostgreSQL, we declare the same functions via plpythonu. SQLAlchemy is used as the DB abstraction layer.
Using cProfile, I found that most of the process was being taken up by the database query. I started thinking that the Python functions being called from the database got expensive as volume scaled (init’ing an XML parser to evaluate and match on each and every row).
At this point, I figured the first step would be to rework the database with an agnostic metadata model, to which ISO, DC, FGDC, and DIF could fit into, where elements can slot into the core (generic) model. Each profile then maps the queryables to (instead of an XPath) a database column in the codebase.
At this point, I loaded 16000 Dublin Core documents as a first test. Results:
- GetCapabilities and GetDomain were instant, and I mean instant (these use the underlying database as well)
- GetRecords: I tried with and without filters. Performance is improved (5 seconds to return 15700 records matching a query [title = '%Lor%'], presenting 5 records)
This is a big improvement, but still I thought this would have been faster. I profiled the code again. The cost of the SQL fetch was reduced.
I then ran tests without using sqlalchemy in the codebase (i.e. SQL scripting as opposed to the SQLAlchemy way). I used the Python sqlite3 module, and that’s it. Queries got faster.
Still, this was only 16000 records. As well, I started thinking/worrying about taking away sqlalchemy; it does give us great abstraction into different underlying databases, and helps us greatly with transactional (insert/update/delete).
Then I started thinking more about bottlenecks and the fetch of data. How can we have fast queries and keep sqlalchemy for ease of interacting with the underlying repo??
Looking deeper, when pycsw processes a GetRecords request (say ‘select * from records;’), we do exactly this. So say the DB has 100K records, sqlalchemy gets ALL 100K records. When I bring them back from server/repository.py to server/server.py, that’s an sqlalchemy object with 100K members we’re working with. Then, in that code, I page through the results using maxrecords and startposition as requested by the client / set by the server processing.
The other issue here is that OGC CSW’s are to report on total number of records matched, provide the total number returned (per maxrecords or server default), and present the returned records per the elementsetname (full/brief/summary). So applying a paging approach without getting the number of records matched was not an option.
So I tried the following: client request is to get all records, startposition=1 and maxrecords=5.
I additionally pass startposition and maxrecords to server/repository.py:query()
In repository.query(), I then do two queries:
- one query which ONLY gets the COUNT of records which satisfy the query (i.e. ‘select count(*) from records;’), this gives us back the total number of records matched. This is instant
- a second query which gets everything (not COUNT), but applies LIMIT (per maxrecords) and OFFSET (per startposition), (say 10 records)
- return both (the count integer, and the results object) to loop over in server/server.py:getrecords()
So the slicing is now done in the SQL which is more powerful. So on 100K records, this approach only pushes back the results per LIMIT and OFFSET (10 records).
Results come back in less than 1 second. Of course, as you increase maxrecords, this is more work for the server to return the records. But still good performance; even when maxrecords=5000, the response is 3 seconds.
So the moral of the story is that smart paging saves us here.
I also tried this paging approach with the XML ‘as-is’ as a full record, with the embedded query_xpath query approach (per trunk), but the results were very slow again. So the embedded xpath queries were hurting us there too.
At this point, the way forward was clearer:
- keep using sqlalchemy for flexibility; yes, if we remove sqlalchemy it will improve performance, but I think the flexibility it gives us, as well as we still get good performance, makes sense for us to keep it at this point
- update data model to deconstruct the XML and put into columns
- use paging techniques to query and present results
- XML databases: looking for a non-Java solution, I found Berkeley DB XML to be interesting. I haven’t done enough pycsw integration yet to assess the pros/cons. Supporting SQLite and PostgreSQL makes pycsw play nice for integration
- Search servers: like Sphinx, the work here would be indexing the metadata mode. Again, the flexibility of using an RDBMS and SQLAlchemy was still attractive
Perhaps the above approaches could be supported as additional db stores. Currently, pycsw code has some ties to what the underlying data model looks like. We could add layer of abstraction between the DB model and the records object model.
I think I’ve exhausted the approaches here for now. These changes are committed to svn trunk. None of these changes will impact end user configuration, just a bit more code behind the scenes.
Written from Starbucks Laval, QC: