Australian Scholarly Editions Centre Projects (ASEC)

THE PROBLEM WITH E-TEXTS

Authentication and Collaborative Interpretation
A presentation for NLA staff, 21 June 2005
Paul Eggert and Phill Berrie
UNSW at ADFA, Canberra
JITM server

Our presentation is about a technique of authenticating the accuracy of texts held electronically but needing to be interpreted over time, collaboratively, by scholars and other readers. We have been working on it for some years as intermittent funding has allowed. It depends on standoff markup files that are applied to the text file Just In Time. I'll explain that later, and Phill will demonstrate the system at work. He's going to show you one of three sample editions we are assembling in collaboration with scholars at other universities. The other two sample editions will be of Waltzing Matilda and of clips from two film versions of His Natural Life. Chris Tiffin and Graham Barwell are gathering the materials for them. What Phill will show us today is Ned Kelly's famous Jerilderie Letter that I have edited afresh. Our JITM – or Just In Time Markup – system was originally designed for the manipulation of ordinary text; so far that's what it does best; and so that's what we'll show you today.

But first I need to give a little background for, surprisingly, the very specialised aims that we started with have turned out to have an application that fits in with emerging agendas at DEST and the ARC for the provision of digital infrastructure to support of what is being called e-research. And having heard Chris Rusbridge's paper here at the NLA on 9 June about the shift in how libraries and archives in the UK are understanding their responsibilities – that is to say a shift from one of digital preservation to what he called digital curation, I have realised that our system is very much part of that swim too.

So first the background. It makes some sense to talk about two stages of humanities computing, which in fact parallel the shift Chris Rusbridge was talking about. Electronic scholarly editions designed in the 1990s were conceived as archives, but finished ones whose accuracy and completeness remained under the control of their creators. Those few editions that have survived had creators who realised that the use of a standard markup language at least gave independence from the unpredictabilities of proprietary software. Nevertheless, they were still stuck with the fact that what they had provided users was essentially something that the physical book could have provided anyway: facsimiles and other images, transcriptions and ancillary information. Its digital form lent it new power, it is true, but if you were not a linguist or a computational stylistician then the word search and word frequencies tools helped you very little indeed. You as user still stood essentially outside the archive, just as you had done with the book. This was necessary if the ongoing accuracy of these electronic editions was to be guaranteed. The William Blake archive at the University of Virginia is one of the few surviving, marvellous examples of this first stage of humanities computing. But nevertheless it will soon be seen as first-stage.

Our project was originally conceived in this mode, although more modestly. We were working on Marcus Clarke's novel His Natural Life. It was a spin-off from the Academy Editions of Australian Literature project. At first we thought we were creating an electronic equivalent to the hardback scholarly edition but with more information.

You all know what I mean by scholarly editions, yes? Here is one. They are based on a comparison of versions of the work in question. The work can be literary, biblical or historical. A great deal of care is taken to report the variant readings of the text in the various versions and to work out which was copied from which, to figure out who must have made the changes in the text in each version, and then, armed with this set of inferences, to establish either how the author wanted the text to read or to establish its earliest retrievable form. But books, even big books, can only contain so much; tables of variants are hard to read; and in any case they are not the same thing as facsimiles. So our aim was a typically 1990s one: to create a CD or a website that contains all the versions in facsimile and in transcription, together with automatic collation of the versions and other information. If the reader did not agree with the editorial policy then at least all the information on which the edited text had been based would be readily available. Creating this textual assembly turned out to be a much bigger job than we thought.

Then Phill hit on the brilliant idea that we could guarantee the accuracy of all our text versions via a checksum algorithm if only we divided up the digital information into content on the one hand and markup on the other. This was counter-intuitive because the 1990s paradigm – into which, it has to be said, digital text collections around the world have invested a lot of energy and money – was and still is that markup should be stored in-line within the text file.

We went out on a limb. Fortunately the Academy of the Humanities and the Australian Research Council were prepared to go there with us and back our successive experiments. What we realised, late in the day, once the authentication routine had been perfected, was that we had done something that might have a wider application than we had originally thought. By splitting off text from interpretation of it (this is what markup essentially is) we had laid the groundwork for collaborative interpretation of the text that could be recorded separately from it and would not endanger its accuracy.

Could this be one building block in the new era of digital humanities research that is constantly being envisioned but never actually achieved? Well, we don't know the answer for sure yet, and especially we don't know to what extent some JITM-based system will undergird a new form of scholarly communication. We have not properly articulated how JITM will supplement and link to discovery databases such as AUSTLIT, which itself supports publication of specialised project datasets, or to other databases such as Picture Australia or Music Australia.

Ideally scholars want a working environment where humanities databases are interoperable, where data types are integrated and data is reliable. There has to be trust in digital resources before the community of scholars will change their working habits. It's not just a matter of more is better. New opportunities have to be trustworthy. Humanities scholars are used to working alone. But if an integrated environment can be created where data can be harvested as well as contributed then a new, collaborative sense of community may spring up amongst humanities scholars. This, I believe, is going to be the second stage of the e-humanities in which JITM can play a role. For this to happen, there are probably going to have to be joint ventures between libraries and the relevant scholarly communities. This next step needs to be taken, but on a bigger, program-like scale. I am not advocating big digitisation projects for their own sake – that won't wash with the funding bodies any more – but about facilitating the collaborative interpretation of significant numbers of texts. I am talking about digital curation, to use Chris Rusbridge's term.

To get to this second stage we have to cater to the culture of research. What do humanities scholars typically publish? Their essays, as published in journals and books, often range over several or many works, discussing general themes, theories or discourses. What role might JITM have here? User-generated Perspectives on a text in JITM can be cited in another electronic document and produced and authenticated on request, essentially as if the document were conventionally quoting the text. The writer chooses the relevant markup file and then copies into his or her own essay the link that generates the JITM Perspective. Phill will show this happening in a minute. Having harvested, the scholar might also want to deposit their interpretation of the text, or some local aspects of it, as an additional tagset that may be of use to the next reader who comes along.

Anyway, that's your traditional, wide-ranging essay. That leaves: (1) the literary-critical sort that focuses on the interpretation of an individual work; and (2) the glossing or elucidating of difficult passages in works or commenting on their textual complications. Traditions of interpretation of canonical texts are a well known feature of humanities scholarship. Here JITM comes into its own because ongoing collaboration of a constantly authenticated text is its central feature.

It also becomes feasible to allow many copies of the text and of markup to be duplicated safely in many locations. Copies of the text can be separately interpreted without danger because every markup file authenticates any copy or reveals where it has been corrupted. So there is no disaster recovery problem for the text. Markup files are not yet separately authenticatable. But an extension of JITM called JITAM can authenticate existing text files that are already heavily marked up and that are then subject to further interpretation.

JITAM works this way. By treating the marked-up files as a series of content characters on the one hand and readily identifiable markup on the other, as you can with SGML-like languages, it becomes possible to add or modify markup while verifying that the textual content of the file has not been accidentally changed. This will be so despite changes in platform, file format or added interpretation for new functions: say to suit an E-Book reader or to create a typesetting file. What we are positing, and now experimenting with, is the idea of the logical independence of text from its presentation and rendering on screen, whatever form that may take.

Now I want to recapitulate the features of JITM via a diagram, so you'll understand what is happening in the background later as Phill drives the machine. Please look at the diagram. JITM uses stand-off markup stored in any number of external tagsets. The user chooses which of them will present or interpret the base transcription file. The chosen tagsets are inserted into the transcription file upon the user's call ('Just In Time'). The base transcription file itself consists only of the verbal text contained within uniquely identified text elements. JITM automatically authenticates the content of these text elements against a stored checksum value both before and after the embedding of the markup into the text.

The practical effect of this authentication is that JITM allows a base transcription file to be annotated or augmented with analytical or structural markup, in parallel and continuously, while retaining its textual integrity. Different or conflicting structural markups can be applied to the same base file in separate passes because they are in different stand-off files and can be applied to the base file selectively. In all of this, the base transcription file remains as simple as possible (thereby greatly easing its portability into future systems) and the authentication mechanism remains non-invasive. Because JITM separates the textual transcription from the markup, copyright is also simplified. Since the markup is necessarily interpretative, a copyright in it can be clearly attributed, even if the work itself is out of copyright. Although the output of the JITM CGI is a TEI conformant SGML file we convert that to a rendering in XHTML for display on all commonly available web browsers In this way we don't disadvantage users by forcing them to buy SGML-capable applications.

JITM is itself character-set independent, as is JITAM. However because we are for now using the Text Encoding Initiative's DTD and because it requires ISO-646 we are employing a workable, minimalist definition of the verbal text restricted, for now, to the ASCII character set with the addition of entity references for characters not so accommodated. The base transcription does not contain the graphic features such as italicising that we are all used to considering as an intrinsic aspect of a text, but then neither is this base file the textual representation that non-expert users will first read. They will typically read a perspective on the base transcription created by its conjunction with separate tagsets that render its page breaks, line breaks, italics, the long s, and so on, and that will, in the act of rendering, authenticate it. Over to Phill

FURTHER READING

There are papers about JITM and JITAM on the ASEC website. Our most considered pieces to date are in press: an article by Paul Eggert on JITM and theories of the text, 'Text-encoding, theories of the text, and the "work-site"' and one by Phill Berrie about JITAM, both forthcoming in Literary and Linguistic Computing and a chapter in a forthcoming MLA volume, 'Authenticating Electronic Editions' by Phill Berrie, Paul Eggert, Graham Barwell and Chris Tiffin, in Electronic Textual Editing, ed. Katherine O'Brien O'Keefe, Lou Burnard and John Unsworth (New York: Modern Language Association, forthcoming 2005).