Document Actions

General principles

Content Registry is a search engine for structured data. It has a list of URLs that it downloads at regular intervals and then builds indexes on.

We usually call CR an object-oriented search engine. The characterisation is unfair. It is more than that. We use the term to explain that; just like Google isn’t seen as owning data even though it has a copy of everything published on the Internet, just so is CR respecting the principle that data is stored where it is maintained. CR has a copy and it is refreshed at regular intervals by reharvesting the source.

The main difference between Google and CR is that while Google deals with semi-structured data in the form of webpages, CR works on structured data. And just like there is an agreed file format for Google called HTML, there is a file format for CR called RDF.

Working on structured data means that entirely new possibilities open up. You’ll be able to join information from different sources as if it all was already in one large database.

Since 2004 Reportnet has given XML preferential treatment. This is no accident. XML has a degree of self-explanation and therefore it was possible to create a quality assessment system that has had no small impact on the quality of the deliveries from national level to EEA. XML has a lot of potential.

The next question was; how to bring the benefits of XML into the later stages of the production pipeline; the aggregation of data to European level? We investigated using an XML database, but this didn’t work very well. What finally worked was to use semantic web technology. We decided to turn CR into a Semantic Web search engine.

The Semantic Web derives from Tim Berners-Lee’s vision of the Web as a universal medium for data, information, and knowledge exchange.

Harvesting

CR finds datasets when someone tells it about a site to harvest. We have a simple mechanism called a manifest file that CR fetches to find out what the remote site contains. It will later be extended to also use the semantic web sitemap protocol prototyped by the National University of Ireland.

Then the files are added to CR’s list of URLs to harvest, and CR will fetch it. If the file is in RDF format, CR loads it directly. If it is in XML, then CR checks whether it has a conversion for the file’s XML schema identifier, and if it does, transforms the XML to RDF and then imports it. It will periodically check whether the file has changed. Most of the XML deliveries stored on CDR will have conversions.

Illustration 2: Harvesting

We expect that in the future, publishers of structured data will develop extraction scripts that create RDF directly from a database. This is the mechanism we intend to use for EUNIS and several other EEA systems.

Aiding the discovery

The remote site can improve the findability of the datasets by adding metadata. This is done by providing RDF files that describe other files on the site as identified by their URL.

Read more at wiki:ContentRegistry/MetadataDeclaration

Understanding the dataset

How does harvesting work exactly? Let’s imagine that Austria reports two new Airbase stations. Number 32301 and 32302 They could store the information in an XML file called stations.xml like this:

<stations xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation="http://air-climate.eionet.europa.eu/o3excess/stations.xsd">
  <station>
    <local_code>32301</local_code>
    <name>St. Pölten - Eybnerstraße</name>
    <longitude_decimal>15.63166</longitude_decimal>
    <latitude_decimal>48.21139</latitude_decimal>
    <altitude>270</altitude>
    <station_type_eoi>Industrial</station_type_eoi>
    <area_type_eoi>urban</area_type_eoi>
  </station>
 ...
</stations>

It will be simpler to explain if we show the XML file as a table like shown below. The second row of the top table corresponds to the XML snippet above. The transformation will take the XML content and store it in a database table with only four columns called Subject, Predicate, Object and Source. Every record gets a type. This one is called “Station”.

Figure here Station structure as a table

Figure here Triples

Now imagine what will happen when CR loads files of the same format from several locations. The data is automatically aggregated. You can even load files with conflicting information on the same stations. They are kept separate because the Source values will be different.

The mechanism makes it possible to have other sources add columns (predicates) to a table if the key (subject) is the same.

Querying the database

The question is how do we make use of it? Here we must implement a query language in CR. If we start by something simple, such as a query on all records of type Station, CR will then find all rows with the predicate=”type” and object=”Station”. It would then for all the subjects it has found look up which predicates exist and show a table.

IdentifierLocal codeName...
#3230132301St. Pölten - Eybnerstraße...
#3230232302Europaplatz...
............

The data could then be exported as MS-Excel, MS-Access or whatever else is convenient for the user. In chapter 6 we’ll describe queries in detail.

Building up a federated table structure

The effect of loading everything into CR is that the systems slowly coalesce.

One surprise that we encountered when investigating the possibilities was that EUNIS has knowledge about what species are listed on various legal instrument annexes. It is the sort of information that is locked away in our systems. By linking the annexes in EUNIS to their respective legal instruments in ROD we can increase the usability. For instance, if you browse ROD and find the habitats directive, you’ll find that it has four reporting obligations referring to it. CR will have the full overview and report four obligations and three annexes (II, IV and V). Those annexes come from EUNIS and if you follow one of then you’ll get to the list of species. If you follow one of the obligations you’ll arrive at CDR and find assessments linking to country and obligation in ROD as well as a species in EUNIS.

By establishing these links in CR we can now improve the information in EUNIS because we can create a webpage saying “What else do we know about Bufo Bufo”? The webpage queries CR for the assessments and shows them in a friendly format.

This is only possible because we have all the information as a copy in a central database that has the necessary query interface.

GeoNames

An example involving another organisation: The GeoNames database http://www.geonames.org/ contains 6.5 millions of features. Things such as dams, lakes, administrative units, parks, cities, towns etc. Every feature has a feature code to describe what it is. What they have done is to align their feature codes with GEMET thereby getting translations, generalisations and definitions for them immediately.

Since it follows Semantic Web principles, we can import the database into CR and when you look up the term “park” in GEMET, you would be able to see instances of parks on a map.

European Environment Agency (EEA)
Kongens Nytorv 6
1050 Copenhagen K
Denmark
Phone: +45 3336 7100