2 Creation of bibliographic records

The bibliographic records maintained by the ADS project consist of a corpus of structured documents describing scientific publications. Each record is assigned a unique identifier in the system and all data gathered about the record are stored in a single text file, named after its identifier. The set of all bibliographic records available to the ADS is partitioned into four main data sets: Astronomy, Instrumentation, Physics and Astronomy Preprints (DATA). This division of documents into separate groups reflects the discipline-specific nature of the ADS databases, as discussed in DATA and Sect. 3.2.

Since we receive bibliographic records from a large number of different sources and in a variety of formats (DATA), the creation and management of these records require a system that can parse, identify, and merge bibliographic data in a reliable way. In this section we describe the framework used to implement such a system and some of its design principles. Section 2.1 details the methodology behind our approach. Section 2.2 describes the file format adopted to represent the bibliographic records. Section 2.3 outlines the procedures used to automate data exchange between our system and our collaborators. Details about the pragmatic aspects of creating and managing the bibliographic records are described in DATA.

2.1 Methodology

When the ADS abstract service was first introduced to the astronomical community ([Kurtz et al. 1993]), the system was built on bibliographic data obtained from a single source (the NASA STI project, also known as RECON) and in a well-defined format (structured ASCII records). The activity of entering these data into the ADS database consisted simply in parsing the individual records, identifying the different bibliographic fields in them, and reformatting the contents of these fields into the ones used in our system. Bibliographical records were created as text files named after STI's accession numbers (DATA), which the project used to uniquely identify records in the system.

As the desire for greater inter-operability with other data services grew (OVERVIEW), the ADS adopted the bibliographic code ("bibcode'' from here on) as the unique identifier for a bibliographic entry (DATA). This permitted immediate access to the astronomical databases maintained by the Strasbourg Data Center (CDS), and allowed integration of SIMBAD's object name resolution ([Egret & Wenger 1988]) within the ADS abstract service (OVERVIEW).

As more journal publishers and data centers became providers of bibliographic data to our project, a unified approach to the creation of bibliographic records became necessary. What makes the management of these records challenging is the fact that we often receive data about the same bibliographic entry from different sources, in some cases with incomplete or conflicting information (e.g. ordering or truncation of the author list). Even when the data received is semantically consistent, there may be differences in the way the information has been represented in the data file. For instance, while most journal publishers provide us with properly encoded entities for accented characters and mathematical symbols, the legacy data currently found in our databases and provided to us by some sources only contain plain ASCII characters. In other, more subtle and yet significant cases, the slightly different conventions adopted by different groups in the creation of bibcodes (DATA) make it necessary to have "special case'' provisions in our system that take these differences into account when matching records generated from these sources.

The paradigm currently followed for the creation of bibliographic records in our system is illustrated in Fig. 1. The different action boxes and tests displayed in the diagram represent modular procedures, most of which have been implemented as PERL ([Wall et al. 1996]) software modules. More details about each of the software components can be found in DATA.

$\begin{figure}\includegraphics[width=8cm]{DS1784F1.eps}\end{figure}$

Figure 1: Paradigm used for the creation of bibliographic records in the ADS

As the holdings of the ADS databases have grown over time, additional metadata about the literature covered in our databases has been collected and is currently being used by many of our software modules for a variety of tasks. Among them it is worth mentioning two activities which are significant in the context discussed here:

1) Identification of publication sources. This is the activity of associating the name of the publication with the standard abbreviation used to compose bibliographic codes, and allows us to compute a bibcode for each record submitted to our system.

2) Data consistency checks. For all major serials and conference series in our databases, we maintain tables correlating the volume, issue, and page ranges with publication dates. We also have recently started to maintain "completeness'' tables describing in analytical form what range of years or volumes are completely abstracted in our system for each publication. This allows us to flag as errors those records referring to publications for which the ADS has complete coverage, but which do not match any entry in our system. The availability of this feature is particularly significant for reference resolution, as discussed later in this paper.

2.2 Data representation

From the inception of the ADS databases until recently, each bibliographic record has been represented as a single entity consisting of a number of different fields (e.g. authors, title, keywords). This information was stored in the database as an ASCII file containing pairs of field names and values. While this model has allowed us to keep a structured representation of each record, over the years its limitations have become apparent.

First of all, the issue of dealing with multiple records referring to the same bibliographic entry arose. As previously mentioned, while much of the information present in these records is the same, certain fields may only appear in one of them (for example, keywords assigned by the publisher). Therefore the capability of managing bibliographic fields supplied by different sources became desirable, which could not be easily accomplished with the file format being used.

Secondly, the problem of maintaining ancillary information about a particular bibliographic entry or even an individual bibliographic field surfaced. Information such as the time-stamp indicating when a bibliographic entry was created or modified, which data provider submitted it, and what is the identifier assigned to the record by the publisher can be used to decide how this data should be merged into our system or how hyperlinks to this resource should be created. Even more importantly, it is often necessary to attach semantic information to individual records. For instance, if keywords are assigned to a particular journal article, it is important to know what keyword system or thesaurus was used in order to effectively use this information for document classification and retrieval ([Lee et al. 1999]).

Thirdly, the issue of properly structuring the bibliographic fields had to be considered. Some of these fields contain simply plaintext words, and as such can be easily represented by unformatted character strings. Others, however, consist of lists of items (e.g. keywords or astronomical objects), or may contain structured information within their contents (e.g. an abstract containing tables or math formulae). The simple tagged format we had adopted did not allow us to easily create hierarchical structures containing subfields within a bibliographic field.

Finally, there was the problem of representing relationships among bibliographical entries (e.g. an erratum referring the original paper), or among bibliographic fields (e.g. an author corresponding to an affiliation). While we had been using ASCII identifiers to cross-correlate authors and affiliations in our records, the adopted scheme was very limited in its capabilities (e.g. multiple affiliations for an author could not be expressed using the syntax we implemented).

Given the shortcomings of the bibliographic record representation detailed above, we recently started reformatted all our bibliographic records as XML (Extensible Markup Language) documents. XML is a markup language which is receiving widespread endorsement as a standard for data representation and exchange. Using this format, a single XML document was created for each bibliographic entry in our system. Each bibliographic field is represented as an XML element, and may in turn consist of sub-elements (see DATA for an example of such a file). Ancillary information about the record is stored as metadata elements within the document. Information about an individual field within the record is stored as attributes of the element representing it. Relationships among fields are expressed as links between the corresponding XML elements.

While it is beyond the scope of this paper to describe the characteristics that make XML a desirable language for representing structured documents, we will point out the main reasons why XML was selected over other formats in our environment. The reader should note that most of these remarks not only apply to XML, but also to its "parent'' language, SGML (Standard Generalized Markup Language).

XML can be used to represent precise, possibly non-textual information organized in data structures, and as such can be used as a formal language for expressing complex data records and their relationships. In our case, this means that bibliographic fields can be described in as much detail as necessary. For instance, the publication information for a conference proceedings volume can be composed of the conference title, the conference series name and number, the names of the editors, the name of the publisher, the place of publication, and the ISBN number for the printed book. While all this information has been stored in the past in a single bibliographic field, the obvious representation for it is a structured record where items such as conference title and editors are clearly indentified and tagged. This allows, among other things, to properly identify individual bibliographical items when formatting the record for a particular application (e.g. when citing a work in an article).

A second important feature which XML offers is the possibility of representing any amount of ancillary information (the "metadata'') along with the actual contents of a document. This permits, among other things, to tag bibliographic records, or even individual fields, with any relevant piece of information. For instance, an attribute can be assigned to the bibliographic field listing a set of keywords describing what keyword system they belong to.

Other important characteristics of XML are: the adoption of Unicode ([Unicode Consortium 1996]) for character data representation, allowing uniform treatment of all international characters and most scientific symbols; and the support for standard mechanisms for managing complex relationships among different documents through hyperlinking.

Some of the practical advantages of adopting XML over other SGML variants simply come from the wide acceptance of the language in the scientific community as well as in the software industry. There is currently great interest among the astronomical data centers in creating interfaces capable of seamlessly exchanging XML data ([Shaya et al. 1999]; [Murtagh & Guillaume 1998]). It is our hope that as our implementation of an XML-based markup language for bibliographic data evolves, it can be integrated in the emerging Astronomical Markup Language ([Murtagh & Guillaume 1998]). As many of the technologies in the field of document management change rapidly, it is important for a project of our scope to adopt the ones which offer the greatest promise of longevity. In this sense, we feel that the level of abstraction and dataset independence that XML imposes on programmers and data specialists justifies the added complexity.

2.3 Data harvesting

Of vital importance to the operation of the ADS is the issue of data exchange with collaborators, in particular the capability to efficiently retrieve data produced by publishers and data providers. The process of collecting and entering new bibliographic records in our databases has benefitted from three main developments: the adoption by all publishers of electronic production systems from the earliest stages of their publication process; the almost exclusive use of SGML and LaTeX as the formats for document production; and the pervasive use of the Internet as the medium for data exchange.

An overview of the procedures used to collect bibliographic data in the daily interactions between ADS staff and data providers is presented in DATA. In this section we discuss how the use of automated procedures has benefitted the activities of data retrieval and entry in the operations of the ADS. Two approaches are presented: the "push'' paradigm, in which data is sent from the data provider to the ADS, and the "pull'' paradigm, in which data is retrieved from the data provider.

2.3.1 Data push

The "push'' approach has received much attention since the introduction of web-based broadcasting technologies in 1997 ([Miles 1998]), to the point that many people consider both push and web broadcasting to have the same meaning. Here we refer to the concept of data "push'' in its original meaning, i.e. the activity of electronic data submission to one or more recipients. The primary means used by ADS users and collaborators to send us electronic data are: FTP upload, e-mail, and submission through a web browser (DATA). While these three mechanisms are conceptually similar (data is sent from a user to a computer server using one of several well-established Internet protocols), the one we have found most amenable to receiving "pushed'' data is the e-mail approach. This is primarily due to the fact that modern electronic mail transport and delivery agents offer many of the features necessary to implement reliable data delivery, including content encoding, error handling, data retransmission and acknowledgement. Additional features such as strong authentication and encryption can be implemented at a higher level through the use of proper software agents after data delivery has been completed. In the rest of the section we describe the implementation of an email-based data submission service used by the ADS, although the system operation can be easily adapted to work under other protocols such as FTP or HTTP.

In an attempt to streamline the management of the increasing amount of bibliographic data sent to us, we have put in place procedures to automatically filter and process messages sent to an e-mail address which has been created as a general-purpose submission mechanism. This activity is implemented by using the procmail filter package. Procmail is a very flexible software tool that has been used in the past to automatically process submission of electronic documents by a number of institutes ([Bell 1999]; [Bell et al. 1996]). Our procmail filter has been configured to analyze the input message, verify its origin, identify which dataset it belongs to, and archive the body of the message in the proper dataset-specific directory. Optionally, the filter can be set up so that one or more procedures are executed after archival. Most of the submissions received this way are simply archived and later loaded into the databases by the ADS administrators during a periodic update (DATA). Using this paradigm, the email filter allows us to efficiently manage submissions from different collaborators by enforcing authentication of the submitter's email address and by properly filing the message body. This procedure is currently used to archive the IAU Circulars and the Minor Planet Electronic Circulars.

By defining additional actions to be performed after archival of a submitted e-mail message, automated database updates can be implemented. We currently use this procedure to allow automated submission and updating of our institution's preprint database, which is currently maintained by the ADS project as a local resource for scientists working at the Center for Astrophysics. The person responsible for maintaining the database contents simply sends a properly formatted email message to the ADS manager account and an update operation on the database is automatically triggered; when the updating is completed, the submitter is notified of the success or failure of the procedure. We expect to make increasing use of this capability as the electronic publication time-lines have been steadily decreasing.

2.3.2 Data pull

"Data pull'' is the activity of retrieving data from one or more remote network locations. According to this model, the retrieval is initiated by the receiving side, which simply downloads the data from the remote site and stores it in one or more local files. We have been using this approach for a number of years to retrieve electronic records made available online by many of our collaborators. For instance, the ADS LANL astronomy preprint database (SEARCH) is updated every night by a procedure that retrieves the latest submissions of astronomy preprints from the Los Alamos National Laboratory (LANL) archive, creates a properly formatted copy of them in the ADS database, and then runs an updating procedure that recreates the index files used by the search engine (Sect. 3). This nightly procedure has been running in an unsupervised fashion since the beginning of 1997.

The pull approach is best used to periodically harvest data that may have changed. By using procedures that are capable of saving and comparing the original timestamps generated by web servers we can avoid retrieving a network resource unless it has been updated, making efficient use of the bandwidth and resources available. Section 4.2 discusses the application of these techniques to the management of distributed bibliographic resources.

Up: The NASA Astrophysics Data