From data model to HTTP

When creating a data model, you already inherently decide the questions which are going to be asked. A data model is in most of the cases not created by data experts which cost a lot of money, but mostly by a civil servant that decided to start a new spreadsheet, by an application developer that needed access to certain data or by someone that simply needs an answer to a couple of questions.

An example? Well, here’s a small spreadsheet which has been initiated by Antti Poikola which collects app competitions on top of Open (Government) Data:

The CSV version of this dataset can be found here

Antti decided to choose a pragmatic approach to the data model: we can see a contest name, the link towards the contest, the year(s) in which the contest is held, the country, the city or region, the level, the organization who organizes the event and finally the theme of the contest. The questions that can be answered with this dataset are already visible in the data: which events are organized in a certain year? Which are the events organized around the theme of transport? And so forth.

It gets a bit more complex when we are going to ask combined questions, such as: which are the events that are organized in 2010 around transport? You can quickly answer these by answering the 2 separate questions and taking the intersection, or just doing 2 filter actions in your spreadsheet program, or by using query language such as SQL and doing a SELECT with a WHERE clausule.

select * from CSV-file WHERE year == 2010 && theme == transport

Open Data on the Web

Data as such is nothing more than a collection of facts or statements. These statements can be structured in various ways: hidden in full text documents, structured in tables or structured using graphs. Just like language and conversations, data and data exchange may have a lot of problems such as misinterpretation, misrepresenting reality, not understandable to third parties, and so on.

On the Web we don’t talk to only one other person: e.g., we publish blog posts which we want to be read by an audience that is as large as possible. That’s probably why I chose to write this blog post in English.

Open Data is the name we give to a dataset that is openly licensed. That means everyone is free to use, reuse and redistribute the data. In my work today I focus on an even smaller subset. I focus on Open Data where publishers want to see the data used by an audience of people and machines that is as large as possible. That’s why I will talk about “Open Data on the Web” and not just about data that had to be opened by some law and is available if you browse for it very carefully. “Open Data on the Web” is not something I have invented, the standardization body of the W3C has a working group about it.

In order for me to explain how targeting an audience that is as large as possible with data works, I need to explain a bit more about how I see data. In datasets, we still use human language to express facts. For instance, in the spreadsheet above, you could recognize a fact as a row in the table: it describes one event with several properties. The table is data, because it is a collections of facts. What do I find the most workable form the data can be in? Well, the maximum number of atomic facts that can be extracted from a file.

Data in its most fundamental form

The most concise way you can express a fact using language is using three words: a subject, a predicate and an object. Creating a list of these triples makes you able to represent any dataset in its most fundamental form.

For example, the spreadsheet above could become:

<http://webarchive.nationalarchives.gov.uk/20100402134053/showusabetterway.com/> name "Show us a better way" .
<http://webarchive.nationalarchives.gov.uk/20100402134053/showusabetterway.com/> year 2008.
<http://webarchive.nationalarchives.gov.uk/20100402134053/showusabetterway.com/> country "UK".
# And so forth...

This is the most fundamental way to represent this dataset and it makes studying it a lot easier. For instance, what are the identifiers and words used within this dataset? And how well can be understood by machines that are crawling the Web? We can much quicker see the words and compare facts, without having to worry about the serialization of the data (XML, JSON, CSV…).

We have used this triple structure as the start to quantify the interoperability of datasets published to the Web in a journal called Computer. The article is going to be published in October of 2014 (you can always request a preprint version by sending me an e-mail):

[bibtex file=refs.bib key=colpaert_computer_2014]

In the paper we have added a section about Linked Open Data as the next logical step. When you want to achieve that a machine will understand what “year” and “country” and “UK” and “name” means in your context, the most easy thing to do would be to create a look up service that can show you more information about how to interpret the data. So instead of using words which can be very ambiguous, we are going to use HTTP URIs: third parties can look up more information about the term by using a browser, but also machines are able to request more machine readable information. Furthermore, URIs cannot be ambiguous as there is one party in charge of maintaining the meaning of every URI.

For instance, we have introduced a number of HTTP URIs to describe opening hours on the Web. You can read the paper, visit the website with more explanation or cite it as follows:

[bibtex file=refs.bib key=pc_openinghours]

From data model to HTTP

Now comes a very difficult question: do we need to publish all data using Linked Data techniques and this triple structure? Did the creator of this spreadsheet make the wrong decision and should he have gone for a Linked Data architecture instead?

Well, for the time being: certainly not. It is way too expensive and way too difficult to start doing that, while the return is not high enough. What is the goal of this spreadsheet? To create a collaborative list of app competitions, which will answer questions such as when and what theme. The creator succeeded in this goal using a collaborative spreadsheet.

Yet, there are much more difficult datasets to publish than a spreadsheet. Take for instance all the businesses in Belgium. This is assembled in the Belgian Crossroads Bank for Enterprises. The things people want to use this dataset for has become beyond imagination, and thus the government service has chosen to publish data dumps from time to time, which you can download in a zip archive. This makes the dataset accessible over HTTP, but not queryable over the Web (or HTTP). You can only asks question to the dataset when you have downloaded the dataset and stored it in your own local datastore.

Yet, Paul Hermans and Tom Geudens have created something really interesting as a hobby project: republish the data directly through HTTP, mapping the data to a URI structure. Now, you can get an overview of every company in Belgium at a URI, for instance, the Open Knowledge Foundation Belgium can be found here: http://data.kbodata.be/organisation/0845_419_930#id, and you can as well get a JSON representation of the data about this company.

Now a benefit from the mapping work of Paul and Tom to these triples, is that using Linked Data Fragments, small question can be answered. For instance: “What is the preferred label of id http://data.kbodata.be/organisation/0845_419_930#id“? The answer to this question is available at: http://data.kbodata.be/fragments?subject=http://data.kbodata.be/organisation/0845_419_930%23id&predicate=http://www.w3.org/2004/02/skos/core%23prefLabel.

When we want to solve difficult question on the Web, our programs can divide the difficult question in simple questions, fire them at the right servers on the Web and assemble the response for you. You can check out a demo of this at client.linkeddatafragments.org or you can read the paper:

[bibtex file=refs.bib key=verborgh_iswc_2014]

Make your data discoverable

Finally, whether your data is stored within a PDF file, spreadsheet or opened up using Linked Data Fragments, you still need to advertise your data if you want to maximize the reuse of your data. This is typically done using an Open Data Portal, of which I have outlined the essentials in a paper called “the 5 stars of Open Data Portals“:

[bibtex file=refs.bib key=colpaert20135]

This blog post is part of the deliverables for an Open Data project for the Flemish Government (EWI). If you want help in setting up an Open Data policy, if you want help convincing your colleagues, or if you want to organize an event, contact me at our not-for profit organization called Open Knowledge Belgium. If you want technical help to publish your data to the Web, contact me at Ghent University, we are always eager to have a chat.