3 Review of Background Material
This chapter reviews some background material necessary to understand how the JITM paradigm differs from the current systems for creating electronic editions. It also reviews previous research efforts on aspects of interest to the creation of a working JITM system. The sections of this chapter are:
The fields of Information Science and Scholarly Literature are very active on the Internet and, as indicated in the introduction, some of the references included are to Internet resources as the material may not have been published in physical form yet. 3.1 Current Practices in Electronic EditionsThis section briefly covers some of the different methods in use for creating electronic editions and discusses some of their short comings. 3.1.1 HTML EditionsThe HyperText Markup Language (HTML) is an application of SGML which has gained much popularity because of its use on the World Wide Web (WWW). Its popularity has arisen mainly through the availability of free, platform independent, HTML browsers. Starting with "Mosaic", developed by the U.S. National Center for Supercomputing Applications (NCSA) this led to "NetScape Navigator" from NetScape Communications Corporation and "Internet Explorer" from Microsoft. These free browsers and associated plug-in modules have contributed to the richness of the World Wide Web as it is experienced today. HTML is used for a number of reasons. First it is a technology which is available now and requires little effort and expense on the behalf of potential users to be able to access the edition. Secondly, the HTML DTD provides access to a rich multimedia environment, allowing the editions creator to combine different data types seamlessly in order to provide a richer experience for their users. Finally the development environment is not difficult to master and there is also a lot of HTML expertise available from amongst the ranks of web page developers working on the World Wide Web. These are all very compelling reasons for an editor to adopt HTML as the platform of choice for developing a simple electronic edition. The proponents of the use of HTML recognise that it might have some deficiencies in terms of it capabilities, but are more interested in getting down to the task of creating their editions rather than trying to handle the harder learning curve of some of the other more rigorous systems (eg. SGML and the TEI DTD). The problem of most concern with the use of HTML for the creation of an electronic edition is standardisation, and the longevity of usefulness of the encoded files. The language is currently at version 3 with a working draft of version 4 released for comment by the Internet community on the 8th of July 1997. NetScape® and Microsoft® the publishers of the two most commonly used browsers are both pushing for more features to be added to the language and are adding support for these features to their own browsers without much regard to any standardisation. This means that for an editor to reach the greatest number of potential users, they must use the subset of features common to all browsers. The issue of the longevity of the usefulness of a HTML-based edition is emphasised by the release of a Recommendation for the XML 1.0 (eXtensible Markup Language) Specification, by the World Wide Web Consortium (W3C) on the 10th of February, 1998. XML as proposed by the W3C Working Group that created and developed it, is being presented as a version of SGML suitable for use on the World Wide Web. At the SGML/XML Asia Pacific 97 Conference, XML was being touted as a replacement to HTML with the capabilities of SGML and the extra advantage of being able to specify exactly the appearance of a delivered document on the users browser. Since NetScape® and Microsoft® have both announced that they will support XML in their browser products, and it is recognised that the HTML standard is lacking in the rigour and features that are promised for XML, this does not bode well for long term support for documents created using the current version of HTML. These problems are not insurmountable as long as the edition is maintained so that it supports or at least is compatible with the technologies used to access it. However without long term institutional support for the edition this would appear to constrain the editions longevity to that of its creators interest in it. This support would also need to cover all copies of the edition if a proliferation of different states of the edition is to be prevented. To be fair, this is a criticism that can be levelled at all electronic editions because of their nature; but HTML because of its changing nature and the lack of rigour in its DTD, is particularly vulnerable to this problem. 3.1.2 PDF EditionsAnother option which has appeal to literary scholars wishing to created electronic editions is the use of Adobes® "pdf" format. Once again there are viewers available for most major platforms and the pdf format supports most of the media types which users have become used to, using the World Wide Web. The big benefit in using pdf for the creation of a scholarly edition is its adherence to postscript for its presentation. The Adobe® Acrobat viewer effectively gives the user a postscript display engine on their workstation and so the creator of the edition can have almost as much control over the presentation of the edition as they can with a hard copy version. It is also possible for the electronic edition to be based directly on the hard copy version of the edition by converting the postscript printer files to the pdf format. Adobe® has provided a number of ways to convert existing documents prepared for the print medium into the pdf format. This capability must be very attractive to the more traditional editor as they tend to think in terms of the printed edition. If they can create an electronic edition from the printed version by effectively reprinting it, then this technology would appear very attractive. There are also some theoretical issues which support this multiple use of a single set of files. By reusing the same files to generate the pdf files, there is some guarantee that the creation of the electronic edition has not involved the introduction of any transcription errors. However, it must be noted that the creation of the electronic edition directly from the print files used for creating a printed edition limits the features available in the electronic edition as they will not include structural information and other meta-data such as hypertext links. The major problem involved with the use of the pdf format is its proprietary nature. The MLAs recently released Guidelines for Scholarly Electronic Editions [sect. I.E, MLA, 1997] specifically restricts the use of proprietary formats for the creation of electronic editions. The section concerned deals with the Archival format of the edition where it states:
The reasoning behind this decision is not detailed but it is unlikely that a proprietary format will ever be defined as an international standard. Therefore if Adobe® ceases to support the file format there is little that can be done to stop any electronic editions created in this format slowly become unusable as the number of viewers that can access the files becomes fewer and fewer. 3.1.3 SGML EditionsThe current SGML paradigm is for the editor of the edition to create an accurate, authenticated transcription and then add meta-data to the electronic edition in the form of markup incorporated into the text of the transcription. For work in the field of literature or the transcription of historical documents, the MLA prescribed DTD is that developed by the Text Encoding Initiative (TEI) [sect. I.B, MLA, 1997]. The marked up documents are viewed using an SGML browser which is compliant with the TEI DTD and can give the user access to the rich environment promised by the specifications of the DTD. The first and foremost goal of an electronic edition of an historical document should be to ensure that the transcription of the original document is accurate. Without this basic requirement any edition lacks scholarly authority, because in essence it is not referring to the work that it is supposed to represent. An SGML-based electronic edition tends to create documents with incredibly complex markup, and a large part of the work of creating such an edition must involve proofreading the content to ensure that it is still an accurate transcription. To add to this complexity, the TEI DTD is surely one of the most complex and convoluted markup schemes devised. It not only attempts to define tagging for the normal document structure for all forms of literature, markup for editorial comments and links to support files, but also includes the means for recording information about physical aspects of the original documents (eg. lineation & pagination, details of corrections and marginalia in manuscripts, etc). This last feature is unusual for a DTD, as SGML is generally more concerned with the logical structure of a document than its physical form. The complexity of the TEI DTD does not only derive from the proliferation of features supported by the DTD. The TEI DTD itself is written in a complex hierarchical manner employing an indirect referencing scheme which makes it difficult for readers trying to understand it. An example of this complexity is provided in the sophistication of the top-level DTD used by all document types covered by the DTD (i.e. documents from all forms of literature). This top-level DTD refers to other fragments of the DTD that deal with specific feature sets using the mechanism of "parameter entity references". The values of these parameter entity references are defined in the markup declaration at the beginning of the document instance, thereby activating the appropriate feature set. An extract from the TEI DTD demonstrates this below;
To include the parts of the DTD that deal with prose the user must specify the parameter reference entity " TEI.prose" in the markup declaration part of a document instance to be the string "INCLUDE".
Any part of the DTD that does not have its parameter entity reference so defined is not included into the instantiation of the DTD for the specific document being parsed. This mechanism, while being elegant and efficient with respect to instantiating the DTD at parse time, does not make it easy to comprehend the overall DTD. Unfortunately use of this mechanism is repeated throughout the DTD. This "feature", and the sheer size of the combined files of the DTD, reduces the readability of the DTD by human readers who are forced to have to refer to multiple files simultaneously to understand the DTD. The computer code-like nature of SGML would also reduce readability for academics who may not be familiar with such formal schema. It is a contention of this report that these difficulties have limited the adoption of SGML for the creation of electronic editions. The only institutions adopting it have been scholarly institutions which have developed their cross-discipline skills enough to be able to tag literary meta-data in SGML in accordance with the TEI DTD. This is not necessarily a bad thing, but experience with the ASEC has indicated that the overheads involved are not inconsiderable and that the combined skills of literature scholar and SGML encoder are not readily found in the same person. In support of this claim is the fact that the TEI has produced its own simplified version of its DTD, called "TEILite". The TEILite DTD is a smaller version of the full TEI DTD which only contains the most commonly used elements of the full DTD. It is also completely defined in one file, by including all the contents of the entity references used in the full DTD. This simplification is to provide a more comprehensible DTD, and allows SGML browsers like SoftQuads "Author/Editor" to be able to work with a TEI endorsed DTD. On a page on the TEI web site, Lou Burnard one of the editors of the Guidelines for Electronic Text Encoding and Interchange P3 [Sperberg-McQueen et al., 1994], says that "Author/Editor" and "FrameMaker" (i.e. another SGML application from Adobe®) cannot handle the "indirect nature" of the TEI DTD. Another reason why the adoption of SGML for the creation of electronic editions has been poor is the scarcity of freely available SGML browsers that can parse the TEI DTD on low-end desktop computers. Proprietary browsers have been available, but they generally require high-end machines and cost in the hundreds of dollars. Things may be changing with regard to SGML capable browsers, however: SoftQuad has just released a cheap version of their "Panorama" SGML browser. It is implemented as a plug-in for the common World Wide Web browsers. However it remains to be seen whether this new version of Panorama can parse documents created in the TEI or TEILite DTDs. The systems defined in this section all have one thing in common. They exist in a digitally encoded form which must be interpreted and displayed using software for the user to make use of them. Because the information may not be otherwise unavailable there has to be some level of trust assumed by the user that the information displayed is what it purports to be. Unlike physical editions the clues that this may not be the case are not so obvious. The following section looks at the issue of authentication of electronic editions. 3.2 Edition AuthenticationAs already discussed, the accuracy of the transcription of the contributing states is paramount if any authority can be claimed by the editor for scholarly assertions in a scholarly edition. Since the original source documents may not be easily accessible it could be difficult for users of an edition to check the accuracy of the transcription. Therefore it becomes a matter of trust that the edition is based on accurate transcriptions of the original states. For any edition, it is generally the reputation of the editor that defines the level of trust that a user has in the accuracy of the transcription. In a physical edition, modifications to the transcription after publication are fairly obvious because of the nature of the media used (i.e. ink on a page). However in an electronic edition modification after publication can occur quite easily and have a far greater chance of going unnoticed. This vulnerability of electronic editions to invisible modification creates yet another area of concern for the scholarly user. Can they trust that the copy of the edition they are working on has not been modified? Can they be sure that their copy of the edition is the same as that published by the editor? Another aspect of this capability for invisibly modifying electronic editions is that it allows the editor to quickly add to or enhance the edition. While this may be a useful capability, adding any new material to an edition has two potential consequences for the authenticity of the edition. The first consequence is that any editing of the file brings with it the chance of introducing an accidental variant into the file. Ignoring such gross occurrences as file corruption or accidental deletion, subtler changes can occur. An example of a subtle error involves the mistaken correction of misspelled words. Words misspelled in the original document, which should be recorded as originally printed (according to scholarly editing principles), are corrected (sometimes unconsciously) in the transcription by the person working on the file. The second consequence is that inserting any new material (even meta-data encoding) into the transcription file creates a new state of the work. In the case of meta-data it can be argued that the changed file is a new state of the work that is the electronic edition (i.e. the work that is the combination of the original transcriptions and the meta-data that is encoded into it) but that the literary work is unchanged. That distinction is not relevant to the problem that inserting material of any type might cause, because unless it can be guaranteed that all previous versions of the electronic edition can be replaced with the new version, there is the potential for the proliferation of any number of "slightly" different versions of the electronic edition. For future scholars, relying on these electronic editions, this could result in a situation analogous to the proliferation of different states produced by scribes when transcribing manuscripts by hand. This sections looks at the current standard of authentication in electronic editions. It then briefly discusses some of the technology available for electronic document authentication with regard to its suitability for use in electronic editions. 3.2.1 Current Standards (or lack thereof)It became apparent during this literature review that the issue of authentication of electronic editions after publication has only recently come under consideration. Considering the importance of this issue, this omission is hard to understand unless one takes into account the "technological" distance between the fields of bibliography and computer security. Even Peter Shillingsburgs seminal work, Scholarly Editing in the Computer Age [Shillingsburg, 1996], first released in 1986 and now in its third edition, fails to mention the authentication of files after publication, while stressing heavily the need for exhaustive proofreading in the creation of the edition [p. 134, Shillingsburg, 1996]. More evidence of neglect of this missing feature is found in its omission from the latest draft of the TEI Guidelines [Sperberg-McQueen et al., 1994] released in May 1994. The TEI P3 guidelines restricts its discussion on revision control, to maintaining a list of revisions (i.e. by hand), in the "header" of the document. This method is inadequate mainly because it only records intentional changes and ignores the possibility of accidental or malicious changes occurring to the document and going undetected. This method has no way of checking that the "body" element of the document, corresponds with what is described in the header, other than through rigorous proofreading whenever there is an indication that the document may have been changed. It appears that the TEI seem to be heading in a dangerous direction when it comes to maintaining the accuracy of transcriptions. They appear to favour experimentation with the new medium, ignoring the possible consequences of a proliferation of different states of an electronic edition. They even go so far as to state the following;
and continue on to talk about updates to an edition, based on the number of substantive changes that have to be made before the document is to be considered a new edition. Perhaps this arises because the TEI guidelines takes the view that meta-data is not part of the original transcription. This is a fallacy, (as discussed earlier). As long as the meta-data is incorporated into the body of the transcription file in the form of embedded tags then there are dangers to the authenticity of the transcription and therefore the authority of the edition. It is only in the recently released Guidelines for Electronic Scholarly Editions by the Modern Language Association of America that we find mention of a mechanism for authenticating an editions content [sect. I.C.4a, MLA, 1997]. In the section on revision control, the guidelines recognise how easy it is to accidentally manipulate the contents of a digital file, suggesting that the editor of an electronic edition should incorporate into the edition some means for checking whether the files have been modified. The guidelines do not go into detail as to how this should be done. 3.2.2 Digital Authentication TechnologyHow does one verify that the contents of a digital document has not been changed? The file size and modification date attributes used in computer file systems are the simplest and most commonly used means available. However these characteristics of a file are not definitive indicators that a file is as it is expected to be. For example, a file can be modified without its size being changed. Similarly the use of a files modification date is subject to a number of problems. To start with, the modification date of a file is initially recorded and maintained by the computers file system. Assumptions are made that the host computers internal clock is correct so that the modification date stored for the file is approximately the real time the file was last modified. Other problems occur because of the way different computer systems handle the file modification date attribute. For example, in file systems that only maintain a modification date for a file (i.e. Windows and UNIX) there is no way to determine if the file has been modified, except by personal experience of previous versions of the file. The worst examples of the untrustworthiness of the file modification date occur when files are being copied from one operating system to another. On the dominant personal computer platforms of today a copy of a file has the same modification date as the file that was copied. This is the preferred result in this instance since the copying process has not modified the contents of the file. However under UNIX, the predominant operating system used on the World Wide Web for server systems, the modification date of a copied file is the date at which the copy was created. This difference in how the operating systems time stamp their files becomes a major problem when copying a file from a personal computer to a UNIX machine. Because of the difference in how the operating systems time stamp copied files different file transfer mechanisms produce different modification dates for the copied file. For example, copying a file from Windows 95 to an NFS file structure on a UNIX machine gives the copied file the modification date that the original file had; copying a file from a Macintosh to an AppleShare volume running under CAP results in the file having the modification date of the time it was copied. If the file is "ftped" from either type of personal computer to the UNIX machine it is given the copying date as its modification date on the UNIX machine. Clearly other mechanisms must be used to be able to rigorously authenticate digital documents. Authentication is a major area of study in the field of computer security covering a range of issues such as user authentication, private keys for encrypting and decrypting messages and plain-text message authentication. It is this last aspect which is of interest to this project. Used mainly for commercial transactions, plain-text message authentication is used in situations where the content of the message is not necessarily confidential, but the receiver of the message needs to be able to trust that the message is from who it is supposed to be from and that it has not been modified in transit. The technologies used for such systems would seem to suit the requirements for an electronic edition authentication scheme and these technologies will now be investigated. In the trusted plain-text communications system described above, the message authentication scheme generally involves the inclusion of a Message Authentication Code (MAC) with the plain-text message. The MAC is necessarily generated from the text of the message, so that there is a one-to-one relationship between the two. The recipient knowing the plain-text message, the accompanying MAC and the way of MAC generation can validate the message [p. 220, Seberry et. al., 1989]. The sophistication of this basic idea varies greatly in implementation and is dependent on the level of secrecy involved in creating the MAC. The simplest form of this mechanism was originally called a Manipulation Detection Code (MDC) [p. 9, Christoffersson et al., 1988], and a description of how an MDC scheme is implemented will be used to introduce some of the basic ideas involved in computer security. The MDC is generated by passing the plain-text message through a one-way hash function. The MDC is supplied with the message, and the recipient can check that the message has not been altered by passing it through the same hash function to see if the same MDC is calculated. If the results agree then it is very likely that the file has not been altered. A hash function, is defined as a cryptographic function for computing a fixed length digest of a message [p. 12, Pieprzyk et al., 1993]. A digest in this definition is a condensed summary of the message. It is not human readable, but can be used to identify and authenticate the plain-text message. The function is defined such that the likelihood of two messages creating the same digest is very small. If the size of the digest string is limited to 128 bits (approximately 16 characters) then there are only 2 128 different digests possible for a theoretically infinite number of arbitrary length messages. Since the number of possible messages is greater than the number of possible digests for any fixed length of digest there must be a many to one relationship between the elements of the set of all possible messages and the set of possible digests. Therefore a useful hash function for creating an MDC is one that will calculate very different digests from similar messages, so that the chances of two similar messages generating the same output string is very small A one-way hash function is a hash function which is very difficult to reverse. This means that the digest is easy to compute, but it is difficult to determine an input message that will generate a specific digest [sect. 2.4, Christofferssen et al., 1988].The security of the MDC is dependent on the secrecy of the hash function used and the difficulty of determining what the hash function is from plain-text/MDC pairs. There is no verification of the authenticity of the sender built into the system so if the hash function becomes public then false messages that would verify as authentic could be generated by anyone knowing the function. For these reasons MDCs have lost favour for document authentication in secure computer communications. The current state of the art for authentication using a MAC are digital signatures. Digital signatures were first mentioned by Diffie & Hellman in 1976 [Diffie et al, 1976]. They based their development of digital signatures on the following basic assumptions:
In a digital signature system the sender incorporates a digital signature with the message to be sent. The signature is created from the message text using a publicly known function, but it is individualised using the senders secret key. The recipient, on receiving the message, uses a public key (specific to the sender) to not only verify that the message has not been modified, but also verify who was the original sender. The biggest benefit of this system is that its secrecy is dependent on the senders secret key which is unique for each individual. Gaining someones secret key would allow someone to pose as the key holder, but this would not put anyone else using the system at risk. This form of message authentication is seen as being secure enough to form the basis for Internet-based financial transactions [sect. 2.6, Schneier, 1994]. The use of an authentication system that uses a MAC dependent on the content of the plain-text message has obvious attractions for the authentication of electronic editions. However there are some possible disadvantages to using such a system. The relationship between the plain-text message and the MAC is normally represented by including the MAC in the same file as the message. The inclusion of a MAC into the files of an edition could limit the long term use of these files as the authentication technology used and supported by the software industry will no doubt change. These changes would require the creation of a new state of the electronic edition even though there have been no changes to the original transcriptions. The proposed JITM paradigm makes use of some of the technologies mentioned in this section, but uses them in a way which does not modify the transcription files. Section 4.3 "Content Authentication" details how a JITM system would abstract the authentication mechanism away from the transcription files and some of the advantages this mechanism brings to future proofing a JITM-based electronic edition. 3.3 Representation of Multiple StructuresThe implicit logical structure of a document created under the SGML paradigm is that of the hierarchy tree, where elements can be progressively expressed as collections of smaller elements bounded by their parent element. When recording the content and features of an existing document, there may be a number of structures which could be recorded. The document fragment below shows two conflicting structures.
The page and line elements represent the typography of the existing state of the work which is useful for referring back to the original from the electronic edition (especially useful for proofreading), whereas the paragraph tagging defines the logical structure of the document. These two structures conflict in that paragraphs cross page boundaries. Using the normal rules of SGML a single DTD could not be defined to handle this situation. Three of the four methods described in this section are mentioned in a paper in Computers and the Humanities [Barnard et al., 1988], and so this paper is used as the basis for this section. Only two of the methods are in common use today, the others are mentioned because of their similarity to the JITM paradigm. The first method makes use of "milestone" tags; the second method is the CONCUR feature of SGML; the third is called Markup Filtering; while the last method uses entity references. 3.3.1 MilestonesThe relevant section of the TEI Guidelines, Transcription of Primary Sources [sect. 18, Sperberg-McQueen et al., 1994], suggests the use of milestone tags for the recording of physical phenomena in the source document. These phenomena include unusual spaces in the document, page and line breaks, change of manuscript hand or ink. Such phenomena do not have any bearing on the content of the document, but can be useful as references to the source material. Electronic editions, where the transcription is accompanied by facsimiles of the original source document, could find the addition of such markup especially useful for reference purposes. Milestone tags in the TEI DTD are defined as empty elements which mark boundaries between elements in a non-nesting structure [sect. 31, Sperberg-McQueen et al., 1994]. These empty elements cannot be further parsed and are therefore designated by single tags using the SGML tag minimisation rules and can only find use as reference markings in the marked up document or for controlling the presentation of a page in a browser that supports this feature. The two main milestone tags used in the transcription of source documents are the line break and page break tags, " <lb>" & "<pb>" respectively. The following example shows how the previous example could be rewritten using milestone tags so that both structures can be recorded in the same document and still have a valid DTD hierarchy.
Milestone tags can contain attributes, a feature which is normally used to indicate their relationship to the source. For example a page break tag may have an attribute indicating the page number of the new page as shown in the example above. Line break tags could also be identified with attributes, but this may not be a good idea as it depends on what document reference system is in use. For example, are lines numbered from the beginning of a page or the beginning of a paragraph? Both are legitimate reference points. To avoid ambiguity it is better to leave the line break tags unidentified and let the reader or browser do the simple task of counting lines from the specified reference point. The section, Boundary Marking with Milestone Elements of the guidelines [sect. 31.2, Sperberg-McQueen et al., 1994] details how milestones can also be used to markup more complicated multiple structures. In an example similar to the one shown below, the guidelines demonstrate how complicated dialogue can be, and how some form of multiple structure markup may be required to represent the structure involved.
This piece of dialogue contains quotations within a quote which makes it difficult to maintain a sequential ordering of the paragraphs in the text as the paragraphs that occur inside the quotation are not at the same level of understanding as the other paragraphs in the documents hierarchy tree. The following example shows how handling the dialogue using milestone tags can get around this problem.
The " <qb>" and "<qe>" milestone tags are used to indicate sections of dialogue in the text and a specifically designed browser could be used to handle the dialogue marked up in this fashion. Because each milestone tag is complete in of itself, a standard SGML parser cannot be used reliably to extract content from between milestones. In the previous example, there is no way for a standard browser to recognise which "<qe>" tag belongs to which "<qb>" tag.The milestone method for marking up physical typography (i.e. lineation and pagination) of source documents is a very useful feature of the TEI DTD. The proposed JITM paradigm would use these milestone tags in the creation of authenticated transcription files because of the assistance they provide in proofreading. However this type of meta-data is not required in the final transcription file, and automated filters will be used to extract these milestone tags leaving the basic logical structure tags. The extraction process will be used to create a meta-data tag set containing the information required to regenerate the physical layout of the original source document as a particular perspective of the transcription if required. 3.3.2 The CONCUR FeatureThe CONCUR feature of SGML as defined in the SGML Handbook [p. 177, Goldfarb, 1990] allows for the encoding of two or more document structures in the one document. The distinct encodings do not interact in any way and the document can be parsed against any of the represented structures as if the other structures were non-existent. The following example show an extract of a document that conforms to two concurrent DTDs and is encoded using the CONCUR feature.
Qualification of the tags is done using a document type specification the name of the appropriate DTD in braces before the tag identifier. The markup, qualified with the "(phys)" tag is being used to record information about the typography of the original source document. The "(log)" markup shows the logical structure of the content of the document. The "<bold>" elements belong to both DTDs and therefore do not need to be qualified. A document marked up in this fashion needs to have multiple DTDs, one for each document structure to be recorded in the document. The DTD appropriate to the users requirements would need to be specified in the documents declaration so that it can be parsed for presentation on the users browser. Markup tags that are qualified for another DTD are ignored in parsing, and access to their meta-data is not available. The different structures of the example above could be presented in parallel by parsing the document twice and displaying it in separate windows. A big disadvantage of the CONCUR method is the increase in size the qualified tagging makes to the markup. This decreases the readability of the file and therefore makes proofreading harder. The CONCUR mechanism is the preferred method of representing multiple structures in the TEI Guidelines. However, it is only an optional feature of the SGML standard and according to the electronic paper, Lessons for the World Wide Web from the Text Encoding Initiative, by Barnard et al. [Barnard et al., 1996] has not been supported by many SGML-compliant browsers. 3.3.3 Markup FilteringBarnards filter method [Barnard et al., 1988] looks like a combination of the CONCUR scheme and the JITM paradigm in reverse. Like the CONCUR method the transcription file is marked up for all structures of interest, but unlike CONCUR the tags are not qualified to indicate ownership by a particular DTD. The following familiar example shows how a transcription file might look when encoded for this system.
This document will not parse because of conflicting structures. The user must decide which structure they are interested in viewing and then pre-process the transcription file through a filter which will eliminate any markup which is not part of the DTD for the view of interest. For example, filtering out the physical elements from the transcription file would result in the following file being sent to the SGML browser.
If the other DTD was chosen and the logical structure of the transcription were filtered out then the following encoding would result.
Parallel views of the document emphasising the different structures could be viewed in different windows simultaneously with appropriate parallel filtering of the original source document. The biggest problem with this method is that the original transcription file cannot be parsed to check that the logical consistency of the contributing DTDs is correct. Therefore this consistency check will need to be done by hand as part of the proofreading of the transcription file, a job which is going to be made especially difficult because of the extra markup. There were no further references on this method to be found in the literature. The most likely reason is that the next method for discussion achieves the same result with a much more elegant mechanism. 3.3.4 Markup by Entity ReferenceThis method was also found in the Barnard paper [Barnard et al., 1988]. Similarly to the markup filter method described previously, all structures of interest are encoded into the transcription file. Instead of using element tagging, entity references are used as shown in the example below;
When this document is scanned by the parsers lexical scanner, whatever is defined for the string substitution value of the entity reference will be substituted into the file for parsing. With the appropriate entity declarations defined, any set of tagging can be activated in the document to generate the desired view of the transcription. For example the following example shows how the physical typography elements can be eliminated from the transcription file with the right set of entity declarations.
These entity declarations would result in the following;
This method is elegant and results in a file similar to the output file from the JITM system, but it suffers from some of the same problems as the other methods. There is a great deal of extra text added to the transcription file in the form of entity references which will make proofreading difficult. Another major problem is that attributes that change, such as a page number, cannot be represented by a single entity declaration and so some of the referencing of the transcription file is lost using this method. The methods for representing multiple structures that have been discussed in this section have varying degrees of success in fulfilling this requirement for a sophisticated electronic edition. All suffer from one major problem complexity. The complexity of markup for two structures makes the file difficult to read, and this can only get worse if more structures are added. The JITM paradigm solves this problem by abstracting the markup for an edition away from the transcription of the document, and only applies the markup required for a perspective when necessary. The following section looks at other abstracted markup mechanisms used in SGML-based systems. 3.4 Abstracted Markup & Reference MechanismsFor this report the term "abstracted markup" is defined as the creation and maintenance of meta-data for a document in a separate file. The JITM paradigm uses this concept to maintain a simple to create, easy to read, easy to authenticate, unbiased transcription file and merges this transcription file with a user specified set of markup options to create a specifically encoded version of the transcription file, a perspective, to be viewed using an SGML browser. This technique allows a number of different perspectives of the document to be created without having to store and maintain multiple states of the edition. The previous section on multiple structures dealt with a similar issue in terms of the logical structure of a document, but all the techniques mentioned in that section required the use of a point of reference within the document (i.e. generally in the form of a string of embedded characters) that was to be replaced or reinterpreted by the system to enable the required logical structure. To achieve its goal of being able to embed markup without requiring a specific internal point of reference, the JITM paradigm has to have an external reference system which allows it to be able to refer to any span of the transcription file down to an individual character. The JITM paradigms use of a combination of abstracted markup and a referencing scheme to do just in time markup of structural elements is unique in terms of a document creation system. References to abstracted markup in the literature are generally limited to references dealing with hypertext linking. This section will therefore look at some of these schemes. 3.4.1 TransclusionsIn 1965 Ted Nelson coined the term "hypertext", and proposed the Xanadu universal document system or "docuverse". The concepts behind the Xanadu system were very advanced and beyond the technology of the day. Nelson has been continually developing the Xanadu concept and an operational system was first reported in BYTE Magazine in 1988 [Nelson, 1988]. With regard to the topic of reference schemes for hypertext systems, one of the main features of Xanadu system reported in the BYTE article was the concept of an "inclusion" (since renamed as a "transclusion"), where any part of a document on the system could be referred to from any other document and the two sources would be seamlessly integrated on the users document viewer. This mechanism was supposed to solve a number of potential problems that Nelson foresaw with hypertext environments. Firstly, using a transclusion meant that there could be no transcription errors involved in quoting from another document. Secondly, since the document that contains the transclusion reference does not contain any of the text of the other document it does not break copyright and it is only in the instantiation of the transclusion that royalties become payable. The Xanadu system was envisioned as having an automated system that would handle the calculation and payment of any royalties for the referenced material by users accessing a document through transclusion references. In the proposed Xanadu system any documents entered into the system could be referenced to any level. Transclusion references could be to data of any type and had to be able to refer to a document down to its smallest meaningful components (i.e. characters with regard to textual information). Another requirement of the transclusion paradigm was that a document once entered into the system could not be removed or changed in any way since there was no way to know if such an action would invalidate a transclusion reference from another document. The basis for the Xanadu system was a universal reference schema which could define any span of any document on the system. This schema used a complicated composite number system using entities called "tumblers", which the system used for referencing documents [pp. 231-235, Nelson, 1988]. The article stressed that the reference scheme was only to be used by the Xanadu system, which is understandable since the tumblers could not be understood without access to a master index to give meaning to the numerical components. The Xanadu project is now a commercial project being developed jointly by Xanadu Australia™ and Ted Nelson who is now Professor of Environmental Information at the Shonan Fujisawa Campus of Keio University in Japan. As the following extract from the Xanadu Australias FAQ [Pam, 1997] indicates they have handled most of the technical aspects of the project and are now dealing with some of the legal aspects of the system;
where transpublishing and transcopyright deal with the payment of royalties from the instantiation of transclusions. The Xanadu system is not based on SGML, and since it is proprietary, little is known about the way the transclusion mechanism is implemented in the current version of the system. However the ideas expressed in Ted Nelsons original vision have appeared in many document systems some of which are based on SGML applications [Pam, 1997]. 3.4.2 Pointers and LinksAlthough falling short of the accuracy and sophistication of transclusions one of the basic advantages of todays hypertext environments is the ability to associate different digital resources using links. The most common example of this is the hypertext link in HTML, where documents available anywhere on the World Wide Web can be accessed by the click of a mouse in the users browser. This is a simple, but powerful paradigm, and is the basis of the enormous popularity of the World Wide Web, where individual web site authors can leverage off each other by creating hypertext links to other sites of interest to their subject. A limitation of the HTML anchor tag is that it downloads the entire document referred to in its Uniform Resource Locator(URL). The addition of an anchor tag reference in the "href" attributes URL as shown below with the "#" separator;
will scroll the document on the users browser to a specific location in the document to aid in maintaining the relevance of the link. However to achieve even this level of control the anchor tag references have to exist in the target document. If they dont exist then the external link has no automatic means of controlling access to a particular section of the linked document. In electronic editions, hypertext links are seen as providing facilities which cannot be achieved in physical editions. They can provide the ability to associate parts of the edition with multimedia, contextual materials to provide readers, unfamiliar with the period, extra information to help with their understanding of the work. This capability is certainly possible in HTML, as was demonstrated by an ARGC Pilot Project which the ASEC put together and presented in 1996. The demonstration was a small fragment of an electronic edition of The Term of His Natural Life by Marcus Clarke. This fragment included hypertext links to maps, audio tracks of period songs and video clips from movie adaptations of the book. However it is hard to see how HTML with its limited linking facility could implement a useable textual apparatus, which (as specified in the MLAs Guidelines for Scholarly Electronic Editions [MLA, 1997]) is a required part of scholarly edition. A textual apparatus would require the reading text of the edition to have one-to-many links from variants in the reading text to the appropriate span in the transcriptions of the contributing states that were in variance, so that the reader could see how the editor made their selection of the reading text. This is beyond HTML linking capabilities. The SGML standard (ISO 8879) does not contain any references to hypertext linking: this is left up to the individual applications of SGML. To assist in establishing some standards in this area, the International Standards Organisation(ISO) produced a standard extension to SGML, called the Hypermedia/Time-based Structuring Language (HyTime), which has just recently gone into its second edition (i.e. 1st August, 1997. [ISO 1997]). The formal definition for the term HyTime is quite large [def. 3.49, ISO, 1997]. The following is a brief definition taken from the book Special Edition: Using SGML [p. 164, Colby et al., 1996];
this reference goes on to define the goals of HyTime as ;
It can be seen from points one, five and six above that HyTime has many of the same features and goals as the JITM paradigm. Since HyTime is not a markup language in itself, but a standard like SGML, it is fortunate that the TEI DTD incorporates some of the features of HyTime. Of most relevance to the current discussion are those concerned with advanced linking capabilities. These will now be discussed in detail. The TEI DTD has a set of four global attributes (i.e. attributes possessed by all elements), the attribute concerned with linking capabilities is the "id" attribute [sect. 3.5, Sperberg-McQueen et al., 1994]. This attributes formal definition according to the guidelines is that it:
It is formally defined with the value, "IMPLIED", which means it does not need to be supplied with a value in any element and can be ignored if there is no requirement to use it. Since the attribute requires a unique value, its main use in the TEI DTD is to provide the means for unambiguously referring to a specific element. The TEI DTD has a set of elements to represent associations between different parts of a document. These set of elements are collectively called, "pointers", and as well as the id attribute, share another common attribute called, "target", which takes as it value an "IDREF". An IDREF is any valid id for an element within the same document. Examples of these pointer elements are the "<ptr>" and "<link>" elements which can be used for creating hypertext links between different parts of a document. The "<ptr>" element is directly analogous to the anchor tag of HTML in that it is a one-way association and requires its target to have an id attribute so that it can refer to this id in its target attribute. The "<link>" element is more interesting in that it can create a two-way association between two places in a document, allowing the user to navigate to the other end of the link from either end. As shown below the two locations are linked by "id" attribute pairs recorded in a separate location.
However since this mechanism relies on the global "id" attribute it can only be used within the confines of a single document. This is because the definition of this attribute is such that it only has to be unique within the scope of a single document. The TEI DTD also defines a set of "extended pointers", which are based heavily on HyTime. These pointers can refer to parts of other documents, and more importantly from the point of view of this section, do not require their destination element to be uniquely identified. Of even greater interest is that they can refer to a span of text within and even across element boundaries. The elements are "<xptr>" and "<xref>" and their common attributes of interest are the "from" and "to" attributes which are defined as having values of type "location pointer specifications". Location pointer specifications are a sequence of "location terms", called a "location ladder" which define a span of the document the pointer refers to. The following is a list of the location terms taken from the Guidelines [sect. 14.2.2.2, Sperberg-McQueen et al., 1994]
As can be seen by the list above there is a very wide selection of location terms available for these extended pointers. Briefly, the location terms allow the target span to be referred to by; element id; walking the structural hierarchy tree of the document; word tokens within the document; and string searching. The location ladder of location terms is defined formally as;
so a number of different location terms can be used in the same location pointer specification as is demonstrated in the examples below;
which would point to the span of the document which starts with the element with id "d1.1" and end with the element with the id "d1.2". For an example with relevance to string insertion;
which would point to the 3rd character of the element with the identifier "d1.1" in the document referred to by the "doc" attribute. The value of the doc element is most likely a public system identifier fully identifying the appropriate file either on the local file system or as a URL. There is no "to" attribute listed because in the formal specification of an extended pointer, only the "from" attribute is "REQUIRED", the "to" attribute is assumed to be the same as the "from" attribute if it is left empty. To create a two-way association between two target locations in different documents using the TEI DTD, requires the use of uniquely identified extended pointers embedded somewhere in the associated documents. These extended pointers act as targets for link elements in a third document that define the two-way associations. The main limitation with this mechanism is that the source documents have to pre-prepared for linking and this may be undesirable or impossible (eg. if they are on CD-ROM). Barnard et. al., [Barnard et. al., 1996] list the following types of sophisticated associations as being available through the TEI DTD using the mechanisms described above:
The abstraction of the detail of the association away into a third file allows the modification of the associations between the two documents without modifying the two source documents. By so doing the authenticity of the source documents can be protected and it is only when the location or names of the source documents are changed that the extended pointer elements within these documents would have to be changed. However by relying on embedded markup of any form the TEI DTD weakens the long term usefulness of the source documents as any changes made to the extended pointers required to make inter-document associations possible changes the document possibly creating a new state of the original work. The HyTime standard has more powerful linking features [p. 386, Colby et al., 1996], but they are not supported directly by the current TEI DTD and are left for discussion in the next section as examples of truly abstracted meta-data where the associated documents require no embedded markup to facilitate the association. 3.4.3 Abstracted Links & HypertextsThere are likely to be types of associations between documents in an electronic edition that one would not want to be handled by any mechanism that used embedded markup. For example, automatically linking arbitrary words within a document to dictionary definitions or language translations. This type of linking, called an "implicit intensional link" by DeRose in [pg 255, DeRose 1989], would be created on the fly by an automated mechanism where the destination of the link is determined using the searching capabilities of the users browser based on the text selected. Another, similar type of link, called an "isomorphic link" [pg 256, DeRose 1989] of particular relevance to electronic editions would implicitly link two or more states of a work together so that users of the edition could automatically jump from one state to another and maintain the same context. This isomorphic linking would require some means of associating text elements in different documents without explicitly defining a link. One way of doing this is by the hand coding of a name attribute in isomorphic text elements so that the users browser can search and find a text element with the same attribute in the target document based on a specified text element in the source document. This type of link is considered highly desirable when trying to compare two states of a work where their respective formats mean that contextually similar pieces of text have very different locations with regard to other referencing schemes (i.e. page and line number). The automatically generated hyperlinks mentioned above are characterised by the fact that they are one-to-one associations and therefore can be determined simply when required. Verbyla and Ashman [Verbyla et. al, 1994] show examples of more sophisticated calculated links using a scripting language built into the linking mechanism of the users browser. Scripting of a link means that the target of a link could be based on the users choice from a list of possible targets or may even be the result of a query to a remote database. Perhaps the most powerful application of abstracted links in the field of electronic editions is in the development of customised "webs" of meta-data and associations between spans of text within a corpus of digital documents. According to the Groups of Links section of the TEI Guidelines [sect. 14.1.3, Sperberg-McQueen et al., 1994] the term "web" was first introduced as part of the Brown University, FRESS Project in 1969. FRESS stands for "File Retrieval and Editing System", and according to the article, Reading and Writing the Electronic Book by Andries van Dam et al. printed in Hypermedia and Literary Studies [Delaney et al., 1991] it was unique for its time in that it allowed the users of the system to create their own associations between different parts of the corpus of documents available on the system. This type of construct is now generally referred to as a hypertext. Defined originally by Nelson in 1965 [Pam, 1997], hypertext is defined as "non-sequential writing" [Nelson, 1981] and can be further further classified as "hyperdocuments" and "hyperbases" depending on the amount of authorial coherence imposed [Carr et al.,, 1996]. A hyperdocument has a significant amount of coherence, constraining the reader to a more or less directed path through the material according to the authors point of view. A hyperbase is less directed acting more as starting point for browsing. The types of reports delivered by the various world wide web search engines can be seen as hyperbases. The coherence of the generated hypertext is provided by relevancy matching the users original query against the indexes of the search engines, but the direction of inquiry is left to the user. Another way of differentiating between hyperdocuments and hyperbases is by looking at the granularity of the hypertext. Mouthrop defines the grain of a hypertext as its basic level of coherence [Moulthrop, 1992]. A hyperdocument would tend to have a grains larger in size than a hyperbase. Given the constraints in the TEI DTD mentioned above (i.e. the necessity of embedding extended pointers into the source documents to provide targets for the association links) it is possible to create a web for documents created using the TEI DTD. One of the best of examples of an electronic edition using SGML and the TEI DTD is the Wife of Baths Prologue on CD-ROM [Robinson, 1996] This electronic edition demonstrates this capability. The edition uses DynaText™ as its SGML browser. Dynatext™ supports the use of predefined and user-defined annotations and webs. The full HyTime specification includes a mechanism which completely abstracts the linking mechanism from the source documents. Called the "ilink" [pg 388, Colby] it supports many-to-many associations between spans of text in any number of documents using the HyTime location ladder mechanism described previously. The ilink specification does not need to be encoded in proximity to any target locations and in the field of electronic editions this would be the best mechanism for implementation of hypertexts as it use does not require modification of the source documents. However since the TEI DTD, the standard document type definition for use with electronic editions, does not itself support the <ilink> tag such non-invasive are not available without alteration to the DTD. The TEI Guidelines [Sperberg-McQueen et al., 1994] give detailed instructions as to how the TEI DTD can be modified and extended, however since the DTD includes elements for linking the best way of achieving full HyTime compatibility is by using Architectural Forms HyTime capabilities can be added to any pre-existing SGML application that supports the notion of hyperlinks in their DTD using the SGML-based modelling technique known as Architectural Forms. Architectural Forms allow two DTDs that have similar semantics, but different syntax to be processed in an equivalent manner by reference to a meta-DTD (in this case Hytime). In a manner very similar to the concepts used in Object Oriented Programming, Architectural Forms can act as abstract base classes for all DTDs which share the same general functionality [McGrath, 1998]. Converting an existing DTD using architectural forms is done by modification of the DTD, not the source documents, and so the capabilities of an electronic edition can be extended to include hypertext linking without modification of the source documents. It must be pointed out that although the technology exists for the development of hypertexts not a lot of work has been done in this area for scholarly electronic editions. A study of the current state of electronic texts in the humanities [Burrows Prepress] indicates that although these editions make significant use of hyperlinks for annotations and links to ancillary material there is little use of hypertext tools for implementing and publishing differing critical views of the same work. |