Book online «LOC Workshop on Etexts, Library of Congress [books to read in a lifetime TXT] 📗». Author Library of Congress

Go to page:

problems in dealing with service bureaus, for example, their inability to perform text conversion from the kind of microfilm that LC uses for preservation purposes.

But quality control, in ERWAY’s experience, was the most time-consuming aspect of contracting out conversion. AM has been attempting to perform a 10-percent quality review, looking at either every tenth document or every tenth page to make certain that the service bureaus are maintaining 99.95 percent accuracy. But even if they are complying with the requirement for accuracy, finding errors produces a desire to correct them and, in turn, to clean up the whole collection, which defeats the purpose to some extent. Even a double entry requires a character-by-character comparison to the original to meet the accuracy requirement. LC is not accustomed to publish imperfect texts, which makes attempting to deal with the industry standard an emotionally fraught issue for AM. As was mentioned in the previous day’s discussion, going from 99.95 to 99.99 percent accuracy usually doubles costs and means a third keying or another complete run-through of the text.

Although AM has learned much from its experiences with various collections and various service bureaus, ERWAY concluded pessimistically that no breakthrough has been achieved. Incremental improvements have occurred in some of the OCR technology, some of the processes, and some of the standards acceptances, which, though they may lead to somewhat lower costs, do not offer much encouragement to many people who are anxiously awaiting the day that the entire contents of LC are available on-line.

******

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

ZIDAR Several answers to why one attempts to perform full-text conversion Per page cost of performing OCR Typical problems encountered during editing Editing poor copy OCR vs. rekeying *

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Judith ZIDAR, coordinator, National Agricultural Text Digitizing Program (NATDP), National Agricultural Library (NAL), offered several answers to the question of why one attempts to perform full-text conversion: 1) Text in an image can be read by a human but not by a computer, so of course it is not searchable and there is not much one can do with it. 2) Some material simply requires word-level access. For instance, the legal profession insists on full-text access to its material; with taxonomic or geographic material, which entails numerous names, one virtually requires word-level access. 3) Full text permits rapid browsing and searching, something that cannot be achieved in an image with today’s technology. 4) Text stored as ASCII and delivered in ASCII is standardized and highly portable. 5) People just want full-text searching, even those who do not know how to do it. NAL, for the most part, is performing OCR at an actual cost per average-size page of approximately $7. NAL scans the page to create the electronic image and passes it through the OCR device.

ZIDAR next rehearsed several typical problems encountered during editing. Praising the celerity of her student workers, ZIDAR observed that editing requires approximately five to ten minutes per page, assuming that there are no large tables to audit. Confusion among the three characters I, 1, and l, constitutes perhaps the most common problem encountered. Zeroes and O’s also are frequently confused. Double M’s create a particular problem, even on clean pages. They are so wide in most fonts that they touch, and the system simply cannot tell where one letter ends and the other begins. Complex page formats occasionally fail to columnate properly, which entails rescanning as though one were working with a single column, entering the ASCII, and decolumnating for better searching. With proportionally spaced text, OCR can have difficulty discerning what is a space and what are merely spaces between letters, as opposed to spaces between words, and therefore will merge text or break up words where it should not.

ZIDAR said that it can often take longer to edit a poor-copy OCR than to key it from scratch. NAL has also experimented with partial editing of text, whereby project workers go into and clean up the format, removing stray characters but not running a spell-check. NAL corrects typos in the title and authors’ names, which provides a foothold for searching and browsing. Even extremely poor-quality OCR (e.g., 60-percent accuracy) can still be searched, because numerous words are correct, while the important words are probably repeated often enough that they are likely to be found correct somewhere. Librarians, however, cannot tolerate this situation, though end users seem more willing to use this text for searching, provided that NAL indicates that it is unedited. ZIDAR concluded that rekeying of text may be the best route to take, in spite of numerous problems with quality control and cost.

******

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

DISCUSSION Modifying an image before performing OCR NAL’s costs per page AM’s costs per page and experience with Federal Prison Industries Elements comprising NATDP’s costs per page OCR and structured markup Distinction between the structure of a document and its representation when put on the screen or printed *

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

HOOTON prefaced the lengthy discussion that followed with several comments about modifying an image before one reaches the point of performing OCR. For example, in regard to an application containing a significant amount of redundant data, such as form-type data, numerous companies today are working on various kinds of form renewal, prior to going through a recognition process, by using dropout colors. Thus, acquiring access to form design or using electronic means are worth considering. HOOTON also noted that conversion usually makes or breaks one’s imaging system. It is extremely important, extremely costly in terms of either capital investment or service, and determines the quality of the remainder of one’s system, because it determines the character of the raw material used by the system.

Concerning the four projects undertaken by NAL, two inside and two performed by outside contractors, ZIDAR revealed that an inhouse service bureau executed the first at a cost between $8 and $10 per page for everything, including building of the database. The project undertaken by the Consultative Group on International Agricultural Research (CGIAR) cost approximately $10 per page for the conversion, plus some expenses for the software and building of the database. The Acid Rain Project—a two-disk set produced by the University of Vermont, consisting of Canadian publications on acid rain—cost $6.70 per page for everything, including keying of the text, which was double keyed, scanning of the images, and building of the database. The inhouse project offered considerable ease of convenience and greater control of the process. On the other hand, the service bureaus know their job and perform it expeditiously, because they have more people.

As a useful comparison, ERWAY revealed AM’s costs as follows: $0.75 cents to $0.85 cents per thousand characters, with an average page containing 2,700 characters. Requirements for coding and imaging increase the costs. Thus, conversion of the text, including the coding, costs approximately $3 per page. (This figure does not include the imaging and database-building included in the NAL costs.) AM also enjoyed a happy experience with Federal Prison Industries, which precluded the necessity of going through the request-for-proposal process to award a contract, because it is another government agency. The prisoners performed AM’s rekeying just as well as other service bureaus and proved handy as well. AM shipped them the books, which they would photocopy on a book-edge scanner. They would perform the markup on photocopies, return the books as soon as they were done with them, perform the keying, and return the material to AM on WORM disks.

ZIDAR detailed the elements that constitute the previously noted cost of approximately $7 per page. Most significant is the editing, correction of errors, and spell-checkings, which though they may sound easy to perform require, in fact, a great deal of time. Reformatting text also takes a while, but a significant amount of NAL’s expenses are for equipment, which was extremely expensive when purchased because it was one of the few systems on the market. The costs of equipment are being amortized over five years but are still quite high, nearly $2,000 per month.

HOCKEY raised a general question concerning OCR and the amount of editing required (substantial in her experience) to generate the kind of structured markup necessary for manipulating the text on the computer or loading it into any retrieval system. She wondered if the speakers could extend the previous question about the cost-benefit of adding or exerting structured markup. ERWAY noted that several OCR systems retain italics, bolding, and other spatial formatting. While the material may not be in the format desired, these systems possess the ability to remove the original materials quickly from the hands of the people performing the conversion, as well as to retain that information so that users can work with it. HOCKEY rejoined that the current thinking on markup is that one should not say that something is italic or bold so much as why it is that way. To be sure, one needs to know that something was italicized, but how can one get from one to the other? One can map from the structure to the typographic representation.

FLEISCHHAUER suggested that, given the 100 million items the Library holds, it may not be possible for LC to do more than report that a thing was in italics as opposed to why it was italics, although that may be desirable in some contexts. Promising to talk a bit during the afternoon session about several experiments OCLC performed on automatic recognition of document elements, and which they hoped to extend, WEIBEL said that in fact one can recognize the major elements of a document with a fairly high degree of reliability, at least as good as OCR. STEVENS drew a useful distinction between standard, generalized markup (i.e., defining for a document-type definition the structure of the document), and what he termed a style sheet, which had to do with italics, bolding, and other forms of emphasis. Thus, two different components are at work, one being the structure of the document itself (its logic), and the other being its representation when it is put on the screen or printed.

******

SESSION V. APPROACHES TO PREPARING ELECTRONIC TEXTS

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

HOCKEY Text in ASCII and the representation of electronic text versus an image The need to look at ways of using markup to assist retrieval * The need for an encoding format that will be reusable and multifunctional

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Susan HOCKEY, director, Center for Electronic Texts in the Humanities (CETH), Rutgers and Princeton Universities, announced that one talk (WEIBEL’s) was moved into this session from the morning and that David Packard was unable to attend. The session would attempt to focus more on what one can do with a text in ASCII and the representation of electronic text rather than just an image, what one can do with a computer that cannot be done with a book or an image. It would be argued that one can do much more than just read a text, and from that starting point one can use markup and methods of preparing the text to take full advantage of the capability of the computer. That would lead to a discussion of what the European Community calls REUSABILITY, what may better be termed DURABILITY, that is, how to prepare or make a text that will last a long time and that can be used for as many applications as possible, which would lead to issues of improving intellectual access.

HOCKEY urged the need to look at ways of using markup to facilitate retrieval, not just for referencing or to help locate an item that is retrieved, but also to put markup tags in a text to help retrieve the thing sought either with linguistic tagging or interpretation. HOCKEY also argued that little advancement had occurred in the software tools currently available for retrieving and searching text. She pressed the desideratum of going beyond Boolean searches and performing more sophisticated searching, which the insertion of

Go to page:

Free e-book «LOC Workshop on Etexts, Library of Congress [books to read in a lifetime TXT] 📗» - read online now

Comments (0)

There are no comments yet. You can be the first!

Add a comment