Authority, Statistics and RDF

warren's picture

The recent discussion on the LOD-LAM website have prompted me to begin writhing about some of the data available on Muninn and how it is generated. Creating a database from the contents of war archives nearly a century old presents some special challenges, some old some new. Jonathan Rochkind' flippant remark 'Sorry Linked Open Data people' was made in jest and drew a lot of responses. But it in the end you can't say that text-mining is better than linked open data anymore than apples are better than submarines as they don't perform the same function. Similarly, the problem of interpreting archival contents is central to Muninn's role as well as the question of what is an authoritative source (at least in the knowledge sense). There is no question that "applying statistical analysis text-mining ‘best guess’ type techniques, provides more relationships than dbpedia alone does".

Generating relationships is easy, since anything can be a relationship, the hard part is figuring out which class of relationship actually means something even before the accuracy of its instances are judged. Dbpedia on the other hand creates its relationships from very specific mark ups within wikipedia. Both the creation of the wikipedia rdf and the wikipedia infoboxes are human-driven and the number of classes is limited by the manpower available. There is still a possibility of errors, someone might have written in Bern as the dbpedia:capital of (dbpedia:populated_place) Germany, but there is no question this is the class that was intended to hold that information. Muninn uses linked open data to publish its processed, cleaned up, results. Behind the curtain, many domain-specific statistical and logical models (including statistical mining)look for particular patterns. However, as opposed to free-running relationship mining these are constrained to the specific linked open data tags that they actually fill. For example, Muninn automatically fills in foaf:knows tags if its calculates that two people know each other within a .95 confidence using a statistical 'best guess'. But the rel:friendOf tag require specific evidence before it is instantiated, such as a record of a communication or an entry in a diary. Most statistical data-mining algorithms won't understand the relationships between the Dominion of Canada (oddly it currently redirects to Canada instead of Canada under Imperial control) as a location and the Dominion of Newfoundland as a political entity. They will happily ignore the untidy 'Dominion', 'location' and 'political' tokens and simply file one as a owl:partOf the other since that is the strongest signal on the graph. It's a simple, easy to understand, wrong answer that causes all sorts of confusion down the line because it 'looks right' but isn't. Should we should consider wikipedia as an 'authority'? It is an outstanding resource for general background data and as a statistical resource to design things like tag clouds, auto tagging html text and creating specialized dictionaries. It does well by the law of averages since the crowd eventually fixes things. This also means that at any given time something will be broken at the page level and that is a problem for an authority. Do you really want your project to catalogue Japan as a rogue legislative element of the European Union because someone messed up a page edit last night? Or watch your reasoner spin madly because there is an edit war on whether Henry Kissinger is_a war criminal? The United States is part of the British Empire right? This is like some of the problems that Open Street Maps is having with bits of countries getting flooded after bad edits. Some data lends itself to statistical or social consensus making for a variety of reasons such as ease of observation. It's easy to fix a bad restaurant location and it only causes limited nuisances for downstream users. Fixing a bad coastline is a bit harder since few people are willing to go out with a transit to survey it. When possible, Muninn links to the appropriate dbpedia triple using rdf:seeAlso for background documents. It's not clear that it is a good idea to link Muninn triples with the dbpedia ones using owl:sameAs at this point since we don't know how authority and stability of triples works. To create a database from the documents requires some interpretation of the information beyond normal indexing or finding aids. Interpretation can range from codifying referencing standards to hard-core detective work. Right now, Muninn focuses on extracting basic facts and trying to link them across documents and entities, cleaning up ranks and names to a point where there is enough of a basic database to add more information. Eventually, there will be enough room for very high level analysis of the data and some interpretation will be done automatically. The nice thing about the Great War is the large number of forms in use, even if not all these are typed since this pre-classifies some of the information for us. Yet, there are many cases where the forms are not responsive to the users needs, so the clerk crosses out fields and replaces them by his own. It will be interesting in the long term to track when and why people do so. Text analysis is a little bit more demanding in that it gives good results when customized to the specific document or text types but that requires some hand holding and hand-coding by a human that needs to be doing too many things at once. There is an experimental text search interface that searches the human readable parts of the linked open data as well as other texts in the collection along with linguistic mood and style. The interface and presentation is still something that is being worked on and comments are welcome. A interesting occurrence is when faulty data is used within a document and is propagated within the organizations' other document, such as impossible dates of birth. The date is obviously wrong, but is still useful as a means of referencing other documents within the archive. It is impossible to trick most databases into storing this information since it fails basic consistency checks, through we are able to force rdf into supporting it through create use of mark-ups. Thus we need to be able to provide information which is clearly wrong from a temporal aspect because it is still useful for linkages purposes.