Linked Open Data: We Need Cool Tools

posted by warren on Tue, 09/05/2017 - 09:30

A disturbing trend emerged during both LODLAM 2017 in Venice, Italy and Digital Humanities 2017 in Montreal, Canada concerning Linked Open Data and the semantic web in general. Both conferences were chalk-full of projects that were either creating data or thinking of publishing their data as Linked Open Data, but very few projects dealt with the consumption of the data. When the topic of consuming LOD is discussed, it is in the context of faceted search or schema.org style discovery. This is problematic because we are not leveraging the linkages of the data and the work done within the project ontologies. And... who really wants to look at triples? The Linked Open Data tools available so far (Pelagios Commons, Trench Map Converter) tend to be highly integrated with the data of the organization that created it, if we are to move on with this technology we need tools that apply to multiple datasets.

Scrambling for sessions at LODLAM 2017 The LODLAM 2017 session Cool tools was well attended by over 30 people crowding around the tables in the Salone Degli Arazzi about cool tools to consume LOD. Oddly, most of the tools discussed were still of the backend or engineering variety. With production getting so much attention, the lack of thought about consumption is concerning: What do we expect end-users and scholars to do with this data? When asked what tools they would like to see, the session members still talked about workflow toolchains and backend facing processes. This isn't unexpected as LAM practitioners worry about their day to day responsibilities first and foremost. Enrichment and creating linkages were similarly popular topics as people wanted to cross-link their datasets on a larger scale than is possible with manual methods. For all of the work entailed, it is all primarily a straight-forward engineering problem.

If we expect end-users to make use of the data, then tools must be available for them to do so.

A similar event occurred in the 1A Workshop "From Production to Consumption (Tools)" at DH2017 late this summer. The discussions about the entire process revolved primarily about production and distribution tools instead of consumption. The question was asked repeatedly about what the specific research problem was that the participant were trying to solve when using linked open data. Few answers from the workshop were forthcoming. The notion of dataset exploration from an ontological / vocabulary perspective was discussed, but this is really a backend view of the problem that a data "wrangling" specialist would worry about.

Susan Brown introducing the workshop at DH 2017. Worrisome was the enthusiasm for training scholars in the creation of SPARQL queries to access data. It initially appears to be a reasonable thing to do: there is value in being able to write ad-hoc queries about the number of Oscar winning female actors born in 1965 or the proportion of university educated parliamentarians. In the end, it may be a frustrating and wasting effort, not because scholars aren't capable but because they should not have to. Few forensic accountants write their own SQL statements on their own accounting systems. Why should we expect scholars to do the same using a complex graph query language that was primarily meant for machine-machine data interchange?

The topic of a natural language interface for building SPARQL queries was similarly touched on. Historically, these tools have not done very well. An early (earliest?) example was Hermes in 1998[1] for SQL databases and most recently Siris and Alexa. These tools are backed by huge amounts of development time dedicated to handling exceptions, every day user requests and embedding the odd Monty Python joke. Your mileage solving research problems with these tools may vary and the underlying reason is the same that occurs with any other layer of abstraction: it's hard to keep out of expert user's way while simultaneously helping novice users get started. Add to this the complexity of selecting "correct" answers to non-trivial questions "Why did the Roman empire fall?" and the tool breeds distrust from a domain-specialist community that is unaware of its inner workings. Paradoxically, the communities that do understand how these tools work find them cumbersome because they don't need the mediation in the first place!

John Simpson brought forward the notion that a Linked Open Data tool should be a car with a steering wheel; whether it be a family car, the farm tractor or heavy quarry truck, the interface remains the same. The analogy is nice, materializing it into an actionable design strategy isn't strait forward. We are now approaching a period where reviewing (or close reading) the raw data is no longer possible for a human being; software agents are needed as data mediators. It is unclear how this will develop in the end, but web browsers are the primitive materialization of the these tools. ...where does the steering wheel go? The only way to find out is to experiment.

The question that needs to be asked is:

If you had all of the data you needed as LOD, what is the research question that you would want to answer?

So far, there is an uncomfortable silence after this question. To move forward we need applications that are generic enough to be applied across multiple datasets and that can leverage the richness of the underlying data. These will be the LOD Killer Apps.