One of the main goals of the GovData.de prototype is to unite as many open data sets from Germany as possible in a single catalogue. Thspre biggest part is automatically imported by so-called harvesters. In this article we provide you with an overview on which tools have been used and how useful they have proven.
In earlier articles we showed that a metadata structure which is based on CKAN is used. We also reported on a workshop with the operators of the catalogues which are to be harvested. In this context, four different import techniques were presented: JSON import, CKAN-CKAN harvesting, CSW-ISO19115-harvesting and CKAN-REST-API. In practice, primarily the first three approaches have proven to be the most useful.
With the JSON import, the operators of the remote catalogues just name an HTTP URL, under which we can retrieve a JSON file that is updated on a daily basis and contains all of the data sets. This procedure has been used in Bremen, Bayern and Moers. With a few feedback loops, the providers were able to optimize their individual JSON export tools to the extent that a smooth integration of the metadata is possible. The metadata were initially loaded with the Unix application “wget”, if necessary given basic adjustments with a python script, and finally uploaded to the GovData.de-CKAN with the Python library ckanclient. We are currently integrating these steps in our own CKAN harvesting plug-in which makes the regular harvesting easier.
The CKAN-CKAN harvesting is used in the data portals of Hamburg, Berlin, Rostock and Rhineland-Palatinate. Theoretically, it is possible to use the CKAN harvesting extension ckanext-harvest for this task without a further development or configuration, as the providers orientate themselves towards the suggested metadata structure. In practice, however, it is necessary to take several details into account: the adoption of the categories (CKAN: “groups”), for instance, only works with minor tricks, sometimes, the allocation CKAN.author ↔ “publishing authority” is not consistently used, the use of the locally clear CKAN.name and .id has to be considered thoroughly, and capital letters and special characters in the tags, or keywords, are not transferred properly. In addition to this, keywords and titles also have to be supplemented sometimes, since the Hamburg metadata catalogue, for example, naturally does not tag all of the data sets with the word “Hamburg”. Technically, however, this approach is very elegant, since among others, with every update, only those data sets that have been changed in the intervening period are transferred.
Importing geo metadata which are coded according to the ISO 19115 standard is somewhat more complicated (see Working Group metadata site). In my opinion, this is because geo data are distributed and (should be) consumed very differently from the normal approach with open data. In this context, date are called ‘products’, frequently CDs or paper maps, which are gathered and found on the basis of the metadata, but then usually a contract is signed and the data is handed over directly by the provider to the contractual partner. Thus, the details ‘online resource’ and ‘licence’, which are of key importance for Open Data, only have a very limited level of relevance in terms of both the standard and the use by the data provider. Then there is the fact that the very detailed (meta) data model is used with differing profiles from federal state to federal state, which means that it is difficult, for instance, to identify the publishing authority in all the data sets of Geoportal.de which covers the whole of Germany.
For this reason, the import of Geoportal.de and PortalU.de has been put on hold. It has been possible, however, to partially import destatis, the Regional Database and the Open Data-offering of the Environment Office of Lower Saxony. Here, the standard was implemented very consistently and the question of the licences partially clarified (DL-DE-BY and/or UDL). For the harvesting, we have developed a branch of the CKAN extension ckanext-spatial. This adapts the standard CSW client (Catalog Service for the Web) to destatis and the regional statistics: Here, instead of CSW, zipped XML files are distributed via HTTP. At the Environmental Office of Lower Saxony, the relevant data sets are found through a CSW enquiry for ‘opendataident’. Hamburg also uses the CSW harvester to transfer metadata from the Hamburg metadata catalogue to a Hamburg CKAN.
The first two importers are only based on the ckanext-harvest extension and have therefore been directly installed in the productive CKAN of GovData.de. The ISO Harvester, however, is based on the quite comprehensive ckanext-spatial extension and therefore runs on a separate machine. In a next step, the data sets are then transferred to the actual data catalogue.
In the further development of these and new harvesters, we think that it is necessary for the following problems to be addressed:
- Differing semantics: What exactly are data, documents, apps? How are services aligned? What is the meaning of time stamps when metadata is harvested from several different catalogues?
- The standardisation of key words: How do we resolve the problem of ambiguous and differing designations for identical meanings (homonyms and synonyms)? How do we summarize similar tags (tag curation)?
- Recognizing duplicates: Until now, duplicate data at GovData.de were the exception for organizational reasons. However, the more catalogues are networked with each other, the more it is necessary to ensure that duplicates are reliably recognized using the fields metadata_original_portal und metadata_original_id.
- Synchronization instead of harvesting: Nowadays it is usually clear from where to where the harvesting is taking place, yet challenges are also foreseeable here: Berlin, for example, is interested in the datasets of destatis that contain the keyword Berlin; the university library centre of the federal state of North Rhine-Westphalia (hbz) has registered its Open Data exports at thedatahub.org, yet they also belong to GovData.de.
That final point is also clearly evident today: those who harvest are also harvested. The GovData.de metadata are found at offenedaten.de. It is possible that there will be harvesting in the direction of EU and in further special catalogues. We hope that we are able to support these processes through exposing our CKAN-API and the maintenance of the metadata structure.
In conclusion, it can be said that harvesting accounts for a key part of the work at GovData.de and clearly offers a corresponding added value. To grow continuously better in this area, a lot of small scale work is necessary. The cooperation between the providers and catalogue operators should ideally lead to a subsequent standardization of the metadata structure and the catalogue interfaces.
This work and its content is subject to a Creative Commons Naming 3.0 Unported Licence.