Transcribed CEF Medical Files as Linked Open Data on the Canada Open Data Portal

warren's picture

Word cloud from the transcribed contents of the medical case sheets

One of the collections that Library and Archives Canada has been digitizing and putting for access online has been the personnel records of the soldiers of the Canadian Expeditionary Force sent to Europe during the Great War. A typical personnel file is a folder containing about 100 pages of documentation about the soldier himself and sometimes includes his medical records in the form of temperature charts, dental records and medical case sheets. In this project it was decided to focus on the contents of the "Medical Case Sheet" that is a lined form used by hospital staff to record information about their patient.

Late last year, a partnership was struck between Library and Archives Canada and the Muninn Project to explore the possibilities of crowd-sourcing to extract the content out of digitized historical documents and create Linked Open Data with them. This project had for goals to foster a relational network of resources on information retrieval research, making documentary heritage information discoverable and searchable and to contribute to the pool of Great War resources freely available to the public during the commemoration period of the Great War. The results of this trial project was a partial transcription of the Medical Case Sheets from a sample of the personnel files of the CEF that has been released by Library and Archives Canada to the Canada Open Data Portal

The dataset is released in RDF/XML format with links to Muninn, LOC Subject Headings and DBpedia. As an additional bonus, this dataset is the first data set on the Canada Open Data Portal that ranks five stars (✭✭✭✭) on the Tim Berners-Lee data deployment scheme! A void description of the data is here.

Making sense of scanned historical documents

One of the problems with imaged historic documents is that searching them in a large collection is difficult when the meta-data identifying the image contents is missing. Creating that meta-data manually is time consuming: a person has to physically look at each image and type in some meaningful descriptors for the document. Digital image processing is a lower cost alternative can help in generating this meta-data [1,2] by extracting information from the raw images themselves.

In the case of a CEF Soldier's file, the individual pages in the folder are unordered and directly scanned to a PDF file. The breath of information that is available within the scanned personnel files of the Canadian Expeditionary Forces (CEF) collection includes everything from handwritten notes to hospital temperature charts.Finding these specific imaged documents within the file was the first problem: it is not known which pages within the PDF file contains the form that interest us. Using an image analysis script, a sample of 1,000 CEF files was analyzed to locate these forms within the service files for them to be transcribed. Out of the 1,000 sampled soldiers, and the resulting 10,000 images, about 500 Medical Case Sheet forms were found and submitted for transcription.

An unreadable text from the medical files

The data is actually quite noisy - at times the hand-writing (as shown above) has bled into the paper and is no longer legible. The handwriting was not particularly neat when it was first written and time has made it harder still to transcribe. In keeping with the archival direction of the project, the text is not edited nor is the short-hand used within the medical files expanded or translated. As an example, the handwriting might be transcribed from the document as "TXH GSW of Leg rt". What this actually means is "TXH(?) Gun Shot Wound Of Leg Right". Keep in mind that the contents of the files are not a compelling narrative that tells a good story, but the record keeping tool used by medical professionals trying to get a soldier back to health.


Of course, this is a Linked Open Data set which makes it easy to annotate and link different interpretation of the same transcribed material. The dataset uses the NLP Interchange Format ontology, the W3 Provenance ontology, the Void dataset vocabulary and the FOAF ontology to markup the contents of the transcriptions and serialized in RDF/XML. The provenance information in the dataset records not only what page of which document the specific image was taken from, but also the individual transcriptions that were submitted and the ones that were recognized as begin valid. The sub-images that show the specific phrases that are being transcribed are not located in this dataset but will eventually be distributed through another mechanism. This was done to reduce the dataset size to a manageable amount; that sub dataset will likely use Open Annotation to record the locations that the image was cut out from.

What were the most common complaints and/or injuries?


The contents of the medical case sheets represents all of the incidents that one would expect to occur in a large body of men involved in the conduct of war. Not all injuries were directly related to combat, besides the usual problems of every day life (such as measles), venereal diseases occurred often as well as respiratory ailment that would affect people in cold wet trenches.


PIE chart of different diseases and injuries seen in the transcribed medical case sheets


The figure above is a count of the number of incidences that occurred in all of the transcribed medical case sheets. As the project only dealt with 1,000 service records, this represents only a small sample of the experiences of the more than 500,000 CEF soldiers. Take note, that some afflictions occur multiple times, such as firearms-related injuries, to the same soldier over the course of their service and are counted multiple times. Hence, this chart is not a true statistical representation of the injuries sustained by CEF troops.


Not especially surprising is the high instances of coughs, pneumonia and influenza that were pervasive in a trench warfare environment and could spread quickly to large groups of men in dugouts or barracks in less than hygienic situations. Influenza, or the Spanish Flu, would eventually kill more people than the Great War did. Venereal diseases were also common, 1 in 9 soldiers were affected according to some sources, which at that time was listed as a self-inflicted injury and would result in a soldier's pay being docked while he got treatment. 


The data is available on the Canadian Open Data Portal in Linked Open Data and plain text format.