Archive

Archive for November, 2012

A technical narrative

November 16, 2012 Leave a comment

1. The brief

Following the work of the open metadata pathways project, the step change project sought to develop a data service for the AIM25 collection descriptions and their indices. These indices are a superset of the UK Archival Thesaurus that has built up over the 13 years of AIM25 thanks to the input of a great number of archivists. As well as subject terms (drawn from UKAT) the AIM25-UKAT dataset includes additional terms for personal names, place names and corporate names. The step-change project sought to represent all these resources as linked data, assigning a URI for each and publishing the them as RDF. The step-change service also provided an API that allowed for the searching and listing of the resources. An editing tool (ALiCat) was also developed ostensibly to allow archivists to assign and amend the linkages between the resources, but also to demonstrate at least one use for the data service.

Trenches to triples (T3) was conceived as a sister project that would harness both the step-change data service and ALiCat to enhance data from an archival collection outside of AIM25. The project started mid way through the step-change project and the outcomes of T3 were essential in guiding the work of step-change.

The King’s College archive includes a broad range of material relating to the first world war and this material was selected for the project. An archivist would use the ALiCat tool to index the collection metadata with the AIM25-UKAT resources. Providing both essential feedback from the real-world use of the ALiCat tool and also enhancing the KCL metadata such that it could be used to provide a richer user experience when presented to researchers and the public via the kingscollection.org front-end.

 

2. Exchanging data  – KC -> ALiCat with EAD

The first job was for the developers at KCL to allow the consumption of their archival metadata by ALiCat (and indeed any other interested clients). To do this here was some recreation of the work done as part of step-change to publish the AIM25 archival records.
In designing an interface between ALiCat and KCL Archive Catalogues (KCLAC), we initially assessed the best method of providing:

  • A structured information schema
  • Persistent authentication through the lifecycle of an update
  • Markup and ‘identified terms’ in a standardised way

An API was built at KCLAC to provide information about catalogues, and summary level information in EAD, via URIs. An initial enquiry returns a list of Catalogues, and then further requests for summary level information about a specific catalogue can be made using supplied URIs. SUmmary information is then supplied in EAD.

The main reasons for adopting EAD in this area were:

  • EAD already defines an industry-standard schema
  • ALiCat ‘understands’ EAD
  • The API can also be used by other enquirers of the KCLAC system that work in EAD


In order to make sure that update requests from ALiCat are genuine, we designed a ‘simple’ handshake process to verify user credentials supplied by the user in the ALiCat interface. After authentication, the KCLAC API supplies a token that is maintained across all ALiCat update requests in a particular session. As a safety-net, KCLAC also backs up any change made by ALiCat, and enables the restoration of previous version of catalogue entries if required.

Summary level information contains ISADG field data, as well as a list of all detailed referenced items within the collection. Each detailed item contains a unique identifier (see below).

 

3. Going deeper – handling detail level catalogues

Some of the Catalogues in KCLAC contain thousands of detailed items…each individually referenced. In this area, we adopt EAD in the most part – however we supply a slightly ‘cut-down’ version of the schema to avoid the need to return ’title’ references for EVERY detailed item (as specified in the strict EAD schema definition). IE: when ALiCat requests information about a specific detailed item, information is returned in EAD about that item…including its hierarchical level…but we do not return a complete list of other detailed (sibling/child/parent) items.

In tests, we found that returning ‘full-bore’ EAD structures for a requested detailed item (containing just a few lines of description) resulted in several megabytes of XML information in some cases. So in order to minimise the impact to bandwidth, server and ALiCat’s user interface, we just return the relevant fragments of EAD – but for these detailed item requests.

As ALiCat was initially designed to handle AIM25 records and these records only extend to the collection level, we had to implement a new interface to handle extra item and file level descriptions. Luckily these descriptions tend to only involve a few elements and we were able to accommodate them in a single tab.



4. Enhancing the records (ALiCat’s approach to inserting linked data into existing archival records)

After retrieving a requested record, the ALiCat operator is able to view ISADG and detailed information, and analyse for known terms by interrogating external data sources (AIM25-UKAT for example). When an external record of interest is found, the ALiCat operator can choose to insert a reference to it:

  • In main content fields (ISADG), operator selected terms are tagged with the URI pointing to the external source.
  • For processed index terms (Subject, Place, Corporation and Name), ALiCat returns the URI, and also indicates the types of data available at the remote source.


All the index terms inserted are represented both separately in a list of index terms (within the <controlaccess> tags in EAD) and also in-line (wrapping the relevant text in any EAD element with <span> tags). Attributes are added to these tags to record the URI of the resource. Presently these attributes are somewhat homecooked, but the below there is some discussion on the use of RDFa for the inline index tags.

ALiCat analysis allow for terms to be tagged with resource external to the AIM25-UKAT dataset. At present these include geonames, LOC and openCalais with potential for more external services to added. In some cases – notably geonames – ALiCat uses 3rd party data and the RESTful interface developed for the AIM25-UKAT data service to create new resources within the data.aim25.ac.uk domain.

 

5. Exchanging data – ALiCat -> KC with ‘qualified’ EAD

When the ALiCat operator is satisfied with their analysis, the updated information is collated by ALiCat and compiled in an EAD format.

In this phase of development, references (to external data sources) are embedded in the content using standard HTML tags, with some additional (non-standard) attributes. This data is passed back to KCLCA using the API and, after session authentication, is stored in the KCLCA system.

In the next phase of system evolution, and as a result of RDFa 1.1 (http://en.wikipedia.org/wiki/RDFa) reaching recommendation status in June 2012, we will shift to using inline RDFa tagging for content, instead of non-standard HTML tag attributes. This will enable content to still be supplied to, and received from ALiCat in EAD without changing the underlying method of delivery/authentication etc…and will significantly improve usefulness in the front-end presentation of the data (see below), as well as supply RDFa markup in the page that can be crawled by various web spiders/bots (the GoogleBot for example).

 

6. Making use of it all?

Currently, the KCLCA site displays tagged items in two ways:

  • Content (ISADG) fields indicate the presence of a related external source visually, and supplies the URI if requested.
  • Index terms (Subject, Place etc.) indicate the presence of an external related data source and, if requested, will retrieve the URI content in multiple formats (depending on the capability of the remote source site).


This, at present, isn’t particularly useful for a human site visitor – unless they understand JSON (http://en.wikipedia.org/wiki/JSON) formatted RDF(http://en.wikipedia.org/wiki/Resource_Description_Framework)…however…the next phase of development will expand this capability, and will attempt to interrogate the remote source dynamically using a combination of SPARQL (http://en.wikipedia.org/wiki/SPARQL) and AJAX (http://en.wikipedia.org/wiki/Ajax). This will enable the site visitor to continue reading the page, while a selected term is queried at the remote site, and related items will be displayed as and when available. In order for this to happen, the remote source site(s) will need a SPARQL endpoint.

The KCLCA will also (shortly) feature its own SPARQL endpoint and triple store, and will enable remote systems to query records. This work will be undertaken after conversion of tagging to RDFa has been completed.

Advertisements
Categories: Uncategorized