A technical narrative

November 16, 2012 Leave a comment

1. The brief

Following the work of the open metadata pathways project, the step change project sought to develop a data service for the AIM25 collection descriptions and their indices. These indices are a superset of the UK Archival Thesaurus that has built up over the 13 years of AIM25 thanks to the input of a great number of archivists. As well as subject terms (drawn from UKAT) the AIM25-UKAT dataset includes additional terms for personal names, place names and corporate names. The step-change project sought to represent all these resources as linked data, assigning a URI for each and publishing the them as RDF. The step-change service also provided an API that allowed for the searching and listing of the resources. An editing tool (ALiCat) was also developed ostensibly to allow archivists to assign and amend the linkages between the resources, but also to demonstrate at least one use for the data service.

Trenches to triples (T3) was conceived as a sister project that would harness both the step-change data service and ALiCat to enhance data from an archival collection outside of AIM25. The project started mid way through the step-change project and the outcomes of T3 were essential in guiding the work of step-change.

The King’s College archive includes a broad range of material relating to the first world war and this material was selected for the project. An archivist would use the ALiCat tool to index the collection metadata with the AIM25-UKAT resources. Providing both essential feedback from the real-world use of the ALiCat tool and also enhancing the KCL metadata such that it could be used to provide a richer user experience when presented to researchers and the public via the kingscollection.org front-end.

 

2. Exchanging data  – KC -> ALiCat with EAD

The first job was for the developers at KCL to allow the consumption of their archival metadata by ALiCat (and indeed any other interested clients). To do this here was some recreation of the work done as part of step-change to publish the AIM25 archival records.
In designing an interface between ALiCat and KCL Archive Catalogues (KCLAC), we initially assessed the best method of providing:

  • A structured information schema
  • Persistent authentication through the lifecycle of an update
  • Markup and ‘identified terms’ in a standardised way

An API was built at KCLAC to provide information about catalogues, and summary level information in EAD, via URIs. An initial enquiry returns a list of Catalogues, and then further requests for summary level information about a specific catalogue can be made using supplied URIs. SUmmary information is then supplied in EAD.

The main reasons for adopting EAD in this area were:

  • EAD already defines an industry-standard schema
  • ALiCat ‘understands’ EAD
  • The API can also be used by other enquirers of the KCLAC system that work in EAD


In order to make sure that update requests from ALiCat are genuine, we designed a ‘simple’ handshake process to verify user credentials supplied by the user in the ALiCat interface. After authentication, the KCLAC API supplies a token that is maintained across all ALiCat update requests in a particular session. As a safety-net, KCLAC also backs up any change made by ALiCat, and enables the restoration of previous version of catalogue entries if required.

Summary level information contains ISADG field data, as well as a list of all detailed referenced items within the collection. Each detailed item contains a unique identifier (see below).

 

3. Going deeper – handling detail level catalogues

Some of the Catalogues in KCLAC contain thousands of detailed items…each individually referenced. In this area, we adopt EAD in the most part – however we supply a slightly ‘cut-down’ version of the schema to avoid the need to return ’title’ references for EVERY detailed item (as specified in the strict EAD schema definition). IE: when ALiCat requests information about a specific detailed item, information is returned in EAD about that item…including its hierarchical level…but we do not return a complete list of other detailed (sibling/child/parent) items.

In tests, we found that returning ‘full-bore’ EAD structures for a requested detailed item (containing just a few lines of description) resulted in several megabytes of XML information in some cases. So in order to minimise the impact to bandwidth, server and ALiCat’s user interface, we just return the relevant fragments of EAD – but for these detailed item requests.

As ALiCat was initially designed to handle AIM25 records and these records only extend to the collection level, we had to implement a new interface to handle extra item and file level descriptions. Luckily these descriptions tend to only involve a few elements and we were able to accommodate them in a single tab.



4. Enhancing the records (ALiCat’s approach to inserting linked data into existing archival records)

After retrieving a requested record, the ALiCat operator is able to view ISADG and detailed information, and analyse for known terms by interrogating external data sources (AIM25-UKAT for example). When an external record of interest is found, the ALiCat operator can choose to insert a reference to it:

  • In main content fields (ISADG), operator selected terms are tagged with the URI pointing to the external source.
  • For processed index terms (Subject, Place, Corporation and Name), ALiCat returns the URI, and also indicates the types of data available at the remote source.


All the index terms inserted are represented both separately in a list of index terms (within the <controlaccess> tags in EAD) and also in-line (wrapping the relevant text in any EAD element with <span> tags). Attributes are added to these tags to record the URI of the resource. Presently these attributes are somewhat homecooked, but the below there is some discussion on the use of RDFa for the inline index tags.

ALiCat analysis allow for terms to be tagged with resource external to the AIM25-UKAT dataset. At present these include geonames, LOC and openCalais with potential for more external services to added. In some cases – notably geonames – ALiCat uses 3rd party data and the RESTful interface developed for the AIM25-UKAT data service to create new resources within the data.aim25.ac.uk domain.

 

5. Exchanging data – ALiCat -> KC with ‘qualified’ EAD

When the ALiCat operator is satisfied with their analysis, the updated information is collated by ALiCat and compiled in an EAD format.

In this phase of development, references (to external data sources) are embedded in the content using standard HTML tags, with some additional (non-standard) attributes. This data is passed back to KCLCA using the API and, after session authentication, is stored in the KCLCA system.

In the next phase of system evolution, and as a result of RDFa 1.1 (http://en.wikipedia.org/wiki/RDFa) reaching recommendation status in June 2012, we will shift to using inline RDFa tagging for content, instead of non-standard HTML tag attributes. This will enable content to still be supplied to, and received from ALiCat in EAD without changing the underlying method of delivery/authentication etc…and will significantly improve usefulness in the front-end presentation of the data (see below), as well as supply RDFa markup in the page that can be crawled by various web spiders/bots (the GoogleBot for example).

 

6. Making use of it all?

Currently, the KCLCA site displays tagged items in two ways:

  • Content (ISADG) fields indicate the presence of a related external source visually, and supplies the URI if requested.
  • Index terms (Subject, Place etc.) indicate the presence of an external related data source and, if requested, will retrieve the URI content in multiple formats (depending on the capability of the remote source site).


This, at present, isn’t particularly useful for a human site visitor – unless they understand JSON (http://en.wikipedia.org/wiki/JSON) formatted RDF(http://en.wikipedia.org/wiki/Resource_Description_Framework)…however…the next phase of development will expand this capability, and will attempt to interrogate the remote source dynamically using a combination of SPARQL (http://en.wikipedia.org/wiki/SPARQL) and AJAX (http://en.wikipedia.org/wiki/Ajax). This will enable the site visitor to continue reading the page, while a selected term is queried at the remote site, and related items will be displayed as and when available. In order for this to happen, the remote source site(s) will need a SPARQL endpoint.

The KCLCA will also (shortly) feature its own SPARQL endpoint and triple store, and will enable remote systems to query records. This work will be undertaken after conversion of tagging to RDFa has been completed.

Advertisements
Categories: Uncategorized

Licensing and other legal issues

September 27, 2012 Leave a comment

This post will discuss the licensing issues that have emerged during the project.

In addition to creating new content in the form of Linked Data, this project will be making available the software that was used to process the data. The software will be made available under a GNU General Public Licence.

As for our data, whilst we intend to make it openly available (in keeping with Trenches to Triples’ obligations as a JISC-funded project), we have encountered a number of interesting issues that have prevented us from placing all of our data under one particular licence.

During the course of this project, we have produced nearly 1,500 new index terms, which have been imported into AIM25. As part of the process of marking-up our selected catalogue content, we will be making use of this dataset. We will also be using some of the many other authority records that exist in AIM25-UKAT.

By submitting our data to AIM25-UKAT, we have tacitly accepted that our terms will be licensed in the same way as the rest of AIM25-UKAT’s data. As it happens, the authority records that are held in AIM25-UKAT are currently unlicensed. The founding of AIM25 pre-dates the advent of Creative Commons and Open Data Commons; however, it would appear that AIM25 approves of its data being used for a multitude of research purposes.  The only discernible caveat is the following statement, which appears on the UKAT website:

… UKAT data should not be used for commercial purposes or sold without prior permission from the UKAT project.

This statement expresses the sentiments of the Creative Commons Attribution Non-Commercial Licence (BY-NC), although it is not, of course, legally binding.

Having incorporated our dataset of authority records into AIM25, we have surrendered the right to license our data separately. This is not necessarily a problem, as AIM25 shares our ideals of openness. However, it does mean that our set of First World War-related terms remains unlicensed.

The main aim of this project has been to make a selection of First World War-related catalogue content available in the form of Linked Data. The catalogues themselves fall under the copyright of King’s College London. Therefore, in order to make our data openly available, we have chosen to apply the Open Data Commons Attribution Licence (ODC-BY) to all of our catalogues. Those who are familiar with the principles of Creative Commons and Open Data Commons will understand that this does not amount to a surrender of copyright; it simply means that we are giving legal consent for the reuse of our data.

We chose the Open Data Commons Attribution Licence after carefully considering a number of licences. Following the guidance offered by Naomi Korn’s and Professor Charles Oppenheim’s Licensing Open Data: a Practical Guide, we looked at the following licences: the Creative Commons Zero Licence (CC0), the Open Data Commons Public Domain Dedication & Licence (ODC-PDDL), the Open Data Commons Open Database Licence (ODbL), and the Open Data Commons Attribution Licence (ODC-BY).

We decided against using either the Creative Commons Zero licence or the Open Data Commons Public Domain Dedication & Licence, since both of these licences include no restrictions, and we wanted to ensure that we would be attributed as the creator of our data. The Open Data Commons Open Database Licence is similar to the Open Data Commons Attribution Licence, except that it also stipulates that adaptations of the licensed database must be made available under the same licence – a condition that we regarded as too restrictive. Thus it became apparent that the Open Data Commons Attribution Licence was the most appropriate for our requirements. Each of our catalogues will include a statement confirming that the content has been made available under the Open Data Commons Attribution Licence.

There was some concern that out-of-date versions of our catalogues might be disseminated long after we have updated our catalogues. In an attempt to prevent this from happening, we have opted to include with our licence statement an additional statement, which notifies our users that our catalogues may be updated from time to time, in order to reflect any additional material and/or emergence of new information regarding material.

Users and use cases: part two

September 27, 2012 Leave a comment

This is a follow-up to an earlier post on users and use cases. That post discussed the needs of our users and the ways in which we have accounted for those needs during this project. This post will consider the requirements that archivists have as users of the cataloguing tool, Alicat (Archival Linked-data Cataloguing).

Alicat, the tool that we have been pilot testing during this project, allows archivists to process catalogue content as Linked Data, as part of the cataloguing process. It enables cataloguers to identify terms within their own descriptions and define each term as a concept, place, person, or organisation. This is done by highlighting the relevant text within a chosen field (e.g. Scope and Content) and when prompted, verifying in which of the four categories the term belongs. The term can then be added to an index of access points.

For example, to tag Hamrin, Iraq as a new place name, simply highlight the word ‘Hamrin’. Alicat will provide a list of suggested locations from Geonames. You can choose from one of these suggested place names, or alternatively, you can define a new place name by pinpointing a location in Google Maps:

The creation of index terms can be achieved by other means also. Eventually, archivists will be able to use Alicat to import data from a number of external systems (during our pilot test, this function was only available using data stored in AIM25-UKAT and Geonames). This function allows archivists to browse their own descriptions for pre-defined terms that exist as personal, corporate, place, and subject names in AIM25-UKAT. By clicking on the relevant ISAD (G) field, then moving the cursor away from that field and clicking once more, the archivist instructs Alicat to perform an analysis of that particular body of text. After a few seconds, those terms that already exist in AIM25-UKAT will be coloured according to their categories (blue for people, brown for organisations, red for concepts, and green for places). In order to mark up these terms, users can simply click and drag the relevant coloured words from the catalogue description and into the index on the right hand side of the screen.

When enriching catalogues with index terms, it is likely that most archivists who use Alicat either will draw on the data found in AIM25-UKAT (or another external CMS), or will use the tool to identify and define new terms. Since AIM25-UKAT does not have an exhaustive set of terms, it is inevitable that archivists will need to spend some time defining new terms.

Archivists who are using this tool during the cataloguing process should find it a great benefit to be able to create authority records either by defining new terms, or by drawing on the vast amount of data that is housed in AIM25-UKAT. Archivists wishing to edit descriptions in existing catalogues will find that Alicat is useful in this regard also. When accessed through Alicat, existing catalogue descriptions are not read-only but in fact can be altered. For instance, inconsistencies such as variations of the same personal, corporate, place and subject names can be amended manually.

The testing of Alicat by archivists has allowed Alicat’s developer to respond to problems and suggestions in order to make the tool both more user-friendly and more effective.

For instance, during our first test, we instructed Alicat to analyse the Scope and Content field of one our catalogues and to highlight any existing, pre-defined terms. Alicat failed to identify more than a couple of AIM25-UKAT terms that were not already present in the catalogue’s index. We could see that there were a further eight or nine terms that had not been identified – terms that we knew had been added to AIM25-UKAT.

This initial test revealed an issue that was already apparent to Alicat’s developer. He acknowledged that what was needed was a facility that enabled users to highlight terms that Alicat had missed – i.e., terms that were known to be in AIM25-UKAT – and to select such terms from a list of AIM25-UKAT suggestions, in a similar way to how, when users choose to define a new place name using Alicat, it presents them with a list of suggested place names from Geonames (and where applicable, from AIM25-UKAT also).

Clearly, this is a very important function. It is not essential that Alicat finds all of the relevant terms from AIM25-UKAT at the first time of asking (although of course, that would be ideal), but it is essential that archivists can highlight within bodies of text terms that they suspect are in AIM25-UKAT, so that they can then select these terms from AIM25-UKAT and mark them up as index terms.

This is necessary not least because there are some terms (such as abbreviations or alternative names) that only humans (as opposed to machines) could be expected to identify. For instance, in the example pictured above, ‘Gallipoli’ appears highlighted in green in the Scope and Content, denoting it as a place. However, as we pointed out in our earlier post on users, ‘Gallipoli’ also exists in AIM25-UKAT as a concept, as the non-preferred term for ‘Dardanelles’. It is understandable that Alicat did not make this connection, but it is important that at this point, an archivist is able to intervene and select the terms ‘Dardanelles’ and ‘Gallipoli’ from the AIM25-UKAT data. Another example is the abbreviated term ‘29 Div’: only an archivist with the necessary background knowledge would be able to recognise this as referring to ‘29th Division’, a corporate name that we have recently added to AIM25-UKAT.

In order to overcome this problem, Alicat’s developer installed a mechanism that allows archivists to dictate their own search terms. So, when we came to test Alicat again, we found that a box had been added to the search function: the highlighted term appeared in this box, and we were able to edit the term and ask Alicat to search for a word or phrase that was more likely to return the desired term. In the case of ‘29 Div’, we knew that it was expressed in AIM25-UKAT as ‘29th Division’, so we changed the search term accordingly, and Alicat retrieved the correct entry:

In the case of our chosen topic, the First World War, this facility has allowed us to locate specific battle names in the AIM25-UKAT data. For instance, the Scope and Content field of one of our collections includes the phrases ‘Battle of the Somme, 1 Jul 1916’ and ‘Battle of the Somme, 4 Jul 1916’. A search for the term ‘Somme’ under the ‘concepts’ category returned the following suggestions from AIM25-UKAT:

Somme

Battle of the Somme (1916)

Actions at the Somme Crossings (24-25 March, 1918)

Operations on the Somme (1 July-18 November, 1916)

Thiepval Memorial to the Missing of the Somme

One of these terms, ‘Operations on the Somme (1 July-18 November, 1916)’, was added to the index. However, we also wanted to include the broader term, ‘Battles of the Somme, 1916’. The search edit function made this straightforward: it allowed us to change our search term to ‘Battles of the Somme, 1916’. Alicat retrieved an exact match, and we dragged the term into the index.

A separate issue that we encountered during the testing of Alicat was the problem of updating old index terms. When a catalogue is viewed in Alicat, any existing index terms appear in the ‘Index (access points)’ column on the right hand side of the screen. In our case, the index terms dated from when the catalogues were first created. We required a function that would allow us to tag these terms so that they appeared on our website with URIs attached. However, the usual method of dragging these same terms from an ISAD (G) field and into the index resulted in the creation of duplicates. For instance, the term, ‘World War One (1914-1918)’, was already listed in the index of one of our catalogues, but we wanted to create a tagged version of this term, one that would appear on our website with an attached URI. We followed the usual process of highlighting the text and selecting the right match from the list of AIM25-UKAT suggestions. We then dragged the term into our index. The index in Alicat now appeared to have two entries for ‘World War One (1914-1918)’: presumably one with a URI and one without.

We reported this issue to Alicat’s developer and he duly provided a new feature that solved the problem. Those terms in the index that had not yet been tagged now had exclamation marks attached to them. When we clicked on the exclamation marks next to the index term, ‘World War One (1914-1918)’, we were given the option of searching AIM25-UKAT for that term. We could then select the term from the list of suggestions and drag it into the index, thereby replacing the untagged ‘World War One (1914-1918)’ with a tagged version.

The term now appeared in the index without exclamation marks – a sign that it had been tagged.

By performing the functions described above, Alicat enables archivists to enhance their catalogues simply and efficiently. No doubt, as further refinements are made, additional features will appear.

 

 

 

Users and use cases: part one

This project can be said to focus on two broad categories of users. The first is archive professionals who would like to incorporate Linked Data into their online finding aids simply and within the usual constraints such as budget, lack of specialist technical knowledge and so on. The second are our archives end users – those researchers who depend on our finding aids to help them find the resources of use to their research interests. We hope that the Linked Data output produced during the project will open our resources up and make it easier for our researchers to find what they want. In this post we’re focussing on the latter group.

The Liddell Hart Centre for Military Archives  attracts a broad range of users including undergraduates, postgraduates, academics, archivists, and genealogists. In most cases, users will search our catalogues using a combination of terms from the following categories: personal names, corporate names, place names, and subjects. It follows, then, that the more structured our catalogue data is, the more likely it is that our users will find what they are looking for. Focusing on the topic of the First World War, we are in the process of creating hundreds of additional authority terms, which will provide a new level of granularity to a large number of our catalogues (including one very important legacy catalogue). This will benefit all of our current users who have an interest in this subject.

As we create our new set of authority terms, it is important that we bear in mind the varying needs of our users, as these will determine the range of granularity that is required. Let’s take the example of First World War battles. An undergraduate may be asked to examine the conflict in a rather broad way, perhaps by identifying trends and patterns across different theatres of war. A genealogist, on the other hand, may be concerned only with the battles at which a particular soldier was present. An academic researcher might wish to study in great detail one important incident that occurred during a particular battle, whilst at the same time remaining aware of the context of that incident, i.e. how the incident related to (in ascending order) the battle, the operation, and the theatre of war of which it was a part.

Each of these scenarios demands a different degree of granularity: the undergraduate would be content with a set of quite general terms (e.g. France and Flanders, or Egypt and Palestine); the genealogist might be satisfied with a list of names of battles; the academic would be best served by a comprehensive hierarchy of events, beginning with the theatre of war (e.g. France and Flanders) and descending to specific tactical incidents (e.g. Second Defence of Givenchy, 1918).

In order to satisfy all of our users, we have to find an acceptable range from the general to the specific. In the case of First World War battles, we have opted to follow the guidance of the Battles Nomenclature Committee’s report, The Official Names of the Battles and other Engagements Fought by the Military Forces of the British Empire during the Great War, 1914-1919, and the Third Afghan War, 1919. The report tabulates all of the on-land battles and engagements according to a descending scale of levels. We have decided that we will include all of the information from ‘theatre of war’ down to and including ‘actions.’

 

Another thing that we must bear in mind is the fact that different users may use different terms to describe the same subjects. In any area of study, there will never be a complete consensus regarding which definitions are ‘official’ and which are not. Moreover, terminology may change over time as trends come and go. For a subject as widely studied as the First World War, there may be some contrast between how certain events or concepts are understood by academic researchers, and how they are known by amateur enthusiasts.  For instance, ‘Dardanelles Campaign (1915-1916)’ is the UKAT-approved term for that military campaign, but to many people it is better known as the ‘Gallipoli Campaign.’ When it comes to creating a subject term for that campaign, we can reflect the needs of both camps by including ‘Gallipoli Campaign’ as a non-preferred term.

These improvements to our catalogue data should give all of our users a better understanding of what our First World War-related records contain. The next stage of the project, the transformation of selected catalogue content into Linked Data, will go one step further:  by making this important body of First World War-related catalogue content available as Linked Data, we are trying to reach out to a wider community of users. As was mentioned in an earlier post, Trenches to Triples is strongly motivated by the concern that too few potential users of archives know the location of the sources that might be of interest to them. It is hoped that this project will demonstrate the potential that Linked Data has to attract new users to archives.

The problem we are addressing and why

Recognition of the potential uses of Linked Data has been comparatively slow within the archive sector, although this has changed in recent years, following a number of successful projects, such as LOCAH, SALDA, and Linking Lives, which have shown the opportunities that are available. However, there remain certain obstacles that may prevent institutions from beginning to use Linked Data as a way of increasing accessibility to their catalogues.

One obstacle has been the lack of means by which archivists can convert existing catalogue data into Linked Data, or indeed create Linked Data as part of the cataloguing process. This issue was initially addressed during the Open Metadata Pathway project, through the development of a workflow tool that should enable archivists to create Linked Data at the same time as cataloguing. The workflow tool is currently being refined as part of the Step change project; one of the objectives of the Trenches to Triples project is to provide a demonstration of this workflow tool in use.  As was stated in the previous post, RDFa data will be created both from World War One related entries in the Liddell Hart Centre for Military Archives’ military catalogues, and from entries found in one of its legacy catalogues. The Trenches to Triples project is therefore supplementary to Step change: Step change aims to provide Linked Data architecture for the archive sector, while Trenches to Triples hopes to be an exemplar of how this architecture can be used effectively.

Trenches to Triples also aims to address another problem, which is that any institution wishing to embark on a similar project is likely to be put off from doing so by the lack of an existing precedent: an example by which to base estimations of time, cost, appropriate scale etc. By creating a toolkit, the Triples project will provide the necessary guidance for future projects. The toolkit will draw on lessons learned during the project in order to give guidelines regarding workload, time, cost, technical requirements, and potential pitfalls. It is hoped that if these problems are successfully addressed, then there should be nothing to discourage other institutions from using Linked Data to enhance their catalogues.