First data dump from Library and Archives Canada

warren's picture

The first data dump from Library and Archives Canada has been shipped to the Sharcnet data-center and loaded onto the cluster for processing. The data contains scanned images of the enlistment papers of Canadian Expeditionary Force soldiers (about a million images) and the full personnel file of about 200 soldiers (about twenty thousand images). The hard drive was first picked up in Ottawa and then traveled with a Muninn staffer to Waterloo, Ontario to one of the Sharcnet machine rooms.

The contents were copied directly to the disk array of one of the computer cluster to be worked on. The first step will be to catalog every image and link it to its subject. Since the contents of the image is not always known, we have to identify the form that was scanned and the information contained in it before we are able to extract the information. It has been asked why we use hard-drives to move the data from a donor institution to Sharcnet instead of just sending it over the Internet? This has mostly to do with the practical considerations of moving and managing large amounts of data amongst different organizations and systems. Donor institutions do not always have the facilities available to transfer large amounts of data over the wire, nor may they be comfortable doing it for IT security reasons. Another has to do with backing up what is essentially primary source data in the distributed computing system that has become the Muninn back-end: a hard-drive on a shelf is an insurance policy against computing mishaps. The data is being worked on now and should be visible on the online catalog system shortly. We will announce results and extracted data-sets on the blog.