Why you need a SPARQL server

warren's picture

The original title to this blog post was supposed to be "Hardening SPARQL Servers in the wild". But I've since changed it to "Why you need a SPARQL server" after reading a number of articles critical of SPARQL while at the same time juggling RDF/OWL sources without SPARQL stores and a multitude of APIs. The benefits of having a machine readable export format is gaining traction with data providers as a data-delivery model. However, the lack of support for search and discovery is still hampering data-delivery. If you are serious about delivering RDF/OWL data, having a SPARQL server is the best way to make your data available to the broadest possible audience while keeping your bandwidth costs down.

Classic Web Search Engines and Data Dumps

Many data providers still think of data delivery in terms of the classic web search model: external search engines crawl the web site and then answer queries from end-users. Certainly that is one of the underlying philosophy behind schema.org and rdfa: make each webpage easier to parse semantically by search engines so that they can answer queries or aggregate information accurately. This approach makes sense in terms of a general information retrieval engine but if your content is a niche, a consumer grade search engine is unlikely to support you or your users. The query language is likely to be 'human friendly' which means the results will be tweaked to the search engines preference instead of your own specific request. Content people tend to think in terms of "The Document" being an html page while data people tend to think of the individual nodes within the RDF as the document. Not all data out there fits the "One Node, One Document, One HTML Page" paradigm. You can make the two ideas co-exists, just be aware that some webpages will get indexed that will look a bit odd.

Another approach to data delivery is the data dump where you make the entirety of your data downloadable from a link on your site. This works especially well for data that is stable or only periodically updated. Of course, if your users only need one item out of the entire site, they are downloading a lot of data for no other reason than to search for it. If your dataset is extremely popular, your bandwidth utilization will go up. 

Erik Mill writes in an Oct 2, 2013 blog post that Government APIs Aren't A Backup Plan, his point being that as the US government shutdown most of the data APIs provided by it were likely to be offline and thus data dumps were the only way of ensuring appropriate retention of the data. This is obviously a problem if there exists only one source for the data, it has a danger of disappearing and there exists only one copy on the website. 

The problem lies in that there exists a limit to our capacity to manage the Download, Extract, Transform, Load processes that this model entails. If the data is that valuable, you may want to keep a complete copy of the collection, but in practice only a small portion of any given dataset is likely to be value to you. Perhaps a better way to look at data transfer and retention within the linked open data world is through longer term web caching. Most retrievals (actually SPARQL queries too) in the linked open data world are through an HTTP connection. Through the use of (smarter) SPARQL servers, or inline web caching proxies, the data is always retrained as long as it is needed or it expires. You can even calculate the Expires and Last-Modified headers based on the information contained within the dcterms:accrualPeriodicity and dcterms:modified tags of the void dataset description.  Since only the URI's or query results that are used or repeated get kept, the overhead on the infrastructure is minimal. There is precedent for this in that most of the building blocks of html, rdf and xml schemas are themselves stored on the W3 servers: it is a single point of failure, but the data is so prevalent on the web that it becomes a non-issue.

Dump versus Store remains a challenge: what information should be retrieved in its entirety and what information should simply be queried? Linked Open Data does offer some solutions in terms of providing a data delivery mechanism that allows the splitting up datasets for partial retrieval through multiple URLs. But this still leaves the problems of searching the data for the items of interest to the user. Two camps currently exist: the use of custom APIs or standardized SPARQL endpoints.

REST API versus SPARQL

Data APIs are currently popular on websites with queryable datasets as they are a quick (for the programmer) means of exposing the website data for external use. A debate that occasionally flares up even within the linked open data community is the creation of a customized API for searches versus a standard SPARQL endpoint.

Dave Rog makes a number of criticism in his June 4th, 2013 blog post The Enduring Myth of the SPARQL Endpoint. Both REST API's and SPARQL endpoints share many of the same problems: after all an API is simply a wrapper around code and databases for a specific purpose. For example, the Muninn Trench Map API internally makes use of 2 SPARQL endpoints and 3 different APIs with appropriate caching of intermediate results. Very clever use of SPARQL could probably replicate the functionality of the API in pure SPARQL but given the specificity of the service, Converting Great War coordinates systems is a bit of a niche area that very few people care about, a custom API is appropriate. 

SPARQL endpoints support a subset of the full SPARQL database language, with a full secondary vocabulary dedicated to documenting the level of support by the endpoint and another one documenting oddities such as the maximum amount number of rows returned by a query. To the best of my knowledge, the SOAP/UDDI/WSDL stack is the only other querying mechanism has that amount of machine readable documentation. The reason that this machine readable documentation is important is that it can be read using the same SPARQL queries as with any other data, but it describes what the client can expect from the server.  With such a setup, a client can optimize its querying of the server by managing the complexity and style of queries being set to the server.

As Dave Rog points out, there exists a great opportunity to use a SPARQL server on the client side as well as on the server side. For example, the SPARQL LOAD command is useful for this purpose since nodes relevant to the query can be pre-fetched or in some SPARQL implementations fetched on an as-needed basis when doing joins.

Response times to queries over the Internet can be a problem. A dedicated API designer can control both the query being created and the design of the underlying database to obtain a statistical guarantee of returning a result within a time frame. This is directly related to the specificity of the API approach: it does one thing and one thing well. A SPARQL endpoint is a full blown database engine that is answering ad-hoc queries from different clients, with result sets that may or may not be cacheable across clients. Setting a time limit to query runtime will encourage clients to keep their queries manageable; but it does require smarter clients that ask for reasonable queries.

Interestingly, response time is not necessarily dependent on bandwidth - it's the latency that grinds things down because an API model functions with individual query response pairs. In certain situations a SPARQL endpoint can improve timing by aggregating a number of requests in a single query where an API would need to process each case serially.

SPARQL endpoints scale

When people start talking about "scale" they usually mean a) the hard drive on their desktop is bigger than yours or b) they have more computers than you do. The problem here isn't that you have billions of bytes to search, it is that accessing them requires dozens, if not hundreds of different interfaces. Mia Ridge has a large list of Museum APIs to browse through that is impressive in its breath and width. 

What happens when you are looking for something across several hundred API's? Where can I find a museum with a 18th century brooch of the type that would have been traded with in North America? Even if they have primitive support for a query parameter q for a keyword search, aggregating the data is nearly impossible. A standardized query language is the only means of ensuring that the client is getting what it wanted in the format that it wanted.

API are useful for one-off, specific query problems. This has to do with their need to implement their own basic query language to communicate requirements using url parameters and values.  This works well for small sets of key-value pairs parameters with titles such as 'keyword'. As the number of parameters and the complexity of the query increases, tracking the parameters and their expected value requires some serious documentation.

For reference, the Geonames API has about 19 endpoints (url's) with about 3 parameters each and the Trove API has 3 different endpoints with 4-5 parameters each. This is still manageable and with the good documentation provided, the required data can be retrieved quickly. This does not hold if what we are aspiring to is the full promise of the semantic web, there will always be one more API to write code for.

It's the Internet!

As with everything in the WWW stack, there are no guarantees and given the complexity and the number of institutions involved in getting a browser to load a webpage from a server across the world it is a wonder that anything works. SPARQL endpoints can timeout on queries just like APIs do, this is the price that we pay for running flexible queries with no program completion guarantee. As with APIs, SPARQL queries can also retrieve entirely too much data and completely drown out the client or create a Denial Of Service attach on the server. These problems are not new and SPARQL is just another application that needs to be protected.

What we do have with SPARQL, Linked Open Data and HTTP headers are opportunities to negotiate with the client so that an informed decision can be made. There exists a Pareto-like tradeoff between bandwidth utilization, query complexity and processing power which all data providers need to think about when designing systems. The advantage of SPARQL and Linked Open Data are that there exists at present standardized, machine readable interfaces that can negotiate content format (xml, ttl, json, etc...), endpoint results set parameters as well as negotiate workload between server and client.

Hardening a SPARQL Server

When you are running a SPARQL endpoint you are really letting someone else's limited instruction set program  on your computer. That is a scary proposition for a lot of conservative administrators out there. The irony is that we let hundreds of javascript webpages run on our web browsers every day without thinking about it too much.

There are a few tips that were learned with the Muninn SPARQL endpoint:

  1. Set a reasonable query runtime limit -  Likely you will want to let queries run for a few seconds to enable a client to do some useful work. Muninn will run queries for about 20 seconds before stopping. Virtuoso Triple Store will try to estimate the runtime and preemptively refuse to run the query if the runtime is over the limit. Consider using the triple store vocabulary to make this limit machine readable by the client so that the query can be modified by the client if needed.
  2. Set a maximum number of triples to be retrieved per query - A SPARQL endpoint isn't meant to be a data dump facility and too much buffer space could be taken up in memory before the result set is streamed to the network. Documenting the maximum result set with the triple store vocabulary is a good way of informing the client of the maximum number of triples that can be retrieved and it should also avoid unfortunate {?s ?o ?p} queries by beginners. Clients that really need a large result set will send a series of LIMIT / OFFSET queries and you can expect many clients to use this technique.
  3. Monitor bandwidth and connections - Using firewall rules, web server modules or SPARQL server configuration options, implement a connection and bandwidth limits. This prevents one runaway client from overwhelming your server with multiple concurrent requests. A favorite among poorly written clients is to do a series of LIMIT / OFFSET queries while assuming that the maximum number of returned triples is 50. The resulting series of queries is able to send your server into cardiac arrhythmia as the same result set is recomputed and thrown away over and over again. Limiting the number of concurrent connections to 3 or 4 will encourage clients to change their evil ways. Returning an http header with 429 - Too Many Requests or 509 - Bandwidth Limit Exceeded will signal that they are the problem and not you. You may want to set a Retry-After header that will promote exponential backoff by the client - the objective isn't to punish clients but to promote the sharing of resources by communicating server load.
  4. Set a limit on the system load -  Since your data is very valuable and everyone wants it, you can expect your endpoint to be very busy. After a certain machine load limit is reached, have the the SPARQL endpoint return HTTP 503 Service Unavailable. It will signal that it's you and not them. You can encourage a retry at a later date with a Retry-After header. This will signal the client to backoff from additional queries for a few moments, long enough for the server to catch up with its workload. Of course, some clients will be disappointed but your machine will still be usable and the clients that get through will still get service. An interesting proposal by Bryce Nesbitt in 2011 had suggested allowing Retry-After with successful HTTP 20x responses that would allow servers to suggest that bots should wait for the next "quiet period". That may gain some traction with linked open data as we can try to move automated SPARQL queries to times of the day where the server is under utilized.
  5. Consider a reverse proxy - Large internet sites sometimes do this to cache dynamic content. As mentioned in the previous section, there is plenty of information within the linked open data to auto configure the proxy with the appropriate Expires and Last-Modified parameters. In high volume situation, a small cache that lasts for a few minutes can ensure that often requested queries are cached transparently for most SPARQL clients.