next previous
Up: The NASA Astrophysics Data


Subsections

  
6 Future developments

By all accounts, the ADS project has been very successful in providing bibliographical services to the astronomer and research librarian. Much of the system's strength has been its role as part of a network of services designed to provide advanced search and retrieval capabilities to the scientific community at large. Given the rapid changes in the field of electronic publishing, resource linking, digital library research, it is of great importance for our project to adapt its operations to this ever-changing environment and its underlying technologies.

In this last section we analyze some of the promises and challenges that we expect to face over the next several years and we discuss how they may affect the evolution of our system. In Sect. 6.1 we describe the new datasets that are becoming available to our project and the changes necessary for their integration in the existing system architecture. Section 6.2 describes the effect of expected technological changes on the operations of the ADS. Finally, Sect. 6.3 discussed how increased collaboration and inter-operability among data providers can lead to the creation of a more integrated environment making better use of information discovery and electronic publishing technologies.

  
6.1 New data

From the user prospective, one of the most significant changes in the ADS will be the completion of our full-text coverage and abstracting for the scholarly astronomical literature. Over the next year we expect to complete the digitization of all astronomical journals back to volume 1 (DATA). The availability of such a large body of scanned publications allows us to pursue some important goals through the use of Optical Character Recognition (OCR) technology: the creation of full-text documents and the extraction of abstract and citation information from them.

The full text of an article produced by OCR programs can be used by the indexing and search engine to provide better retrieval capabilities. However, the current indexing model has been developed to work well with a homogeneous set of bibliographic data with little variation in document length and content model; extending the scope of our databases to include the full-text of articles may therefore require a new approach to the entire architecture behind the indexing and search engines. Furthermore, since the output generated by OCR packages is known to contain incorrectly recognized characters and words, new strategies may be required to manage this level of uncertainty during indexing and searching.

The extraction and OCRing of important document fragments such as abstracts and references is currently an ongoing process which holds great promise (DATA). Essentially, the combination of pattern recognition and OCR techniques allows us to identify areas in a scanned document corresponding to the abstract or reference section of a paper. The text extracted from an abstract section is then reformatted and inserted into the bibliographic record for that paper. Periodic analysis of the text index has been necessary to identify and correct misinterpreted characters and words produced by the OCR software. The increased amount of human checks on our data set as a quality assurance measure has been the price to pay for integrating these additional abstracts in our bibliographic records.

Text extracted from a reference section is analyzed by programs making use of natural language processing techniques to identify the individual works cited in the article and add them to our citation database. The challenge we are facing in this case is creating a robust system capable of correctly parsing and matching the cited reference strings with bibliographic records in our database ([Accomazzi et al. 1999]), with the additional complication that the input text may contain characters incorrectly recognized by the OCR software.

  
6.2 New technologies

The latest developments in Electronic Data Interchange and User Interfaces advocate the adoption of a model of data representation where there is clear separation between content, metadata, and presentation. The widespread endorsement of XML and related proposals such as the XLink language, the Extensible Style Language (XSL), and the Document Object Model (DOM), seems to indicate that we will see pervasive use of XML across platforms and implementations. While this raises hopes that data exchange among different astronomical data centers and institutions can be streamlined, it is not clear at this point that a unique framework describing all resources in astronomy can be defined, nor that such a system is necessary at this point. However, the adoption of XML as the "lingua franca'' for data interchange can help remove the initial obstacles preventing more widespread creation of peer-to-peer connections between information providers and can help speed up the creation of "federated'' services ([Murtagh & Guillaume 1998]).

In this context, we hope to leverage the wide deployment of XML-based applications to generalize and extend the services currently offered to our collaborators and users. This involves modifying the implemented APIs (SEARCH) to allow output of structured XML documents containing both metadata and bibliographic data. We have already started adopting this paradigm while implementing new and experimental services which require the exchange of data and metadata structures between client and server, such as the ADS reference resolver ([Accomazzi et al. 1999]).

Another issue related to data interchange which is currently receiving much attention is the definition of persistent identifiers for bibliographic resources available on the Internet. This issue is a particular instance of a more general problem, which is the need to define common naming schemes for digital objects and distributed locator services allowing their resolution. For a number of years this has been recognized as one of the most important infrastructure components necessary for the large-scale development of digital library systems ([Lynch & Molina-Garcia 1996]). Today most publishers are providing location services which are based on the traditional paradigm of identifying a published work by journal, volume and page. It is becoming increasingly clear that a more general mechanism will have to be adopted in the future since this model does not extend well into the digital era. For instance, a publication may be available only in electronic form (as is already the case for some "e-journals'' such as EPJdirect and ZPhys-e from Springer-Verlag). or may correspond to a multimedia object rather than a traditional text document; in these cases, the concept of pagination loses its meaning. The Document Object Identifier (DOI, [Paskin 1999]), which has been proposed by an international consortium of publishers, holds the promise of becoming the universal identifier suitable for naming digital objects.

The ADS has already extended the use of the bibcode identifier in different ways to account for the existence of electronic-only publications (DATA), but it is becoming increasingly more difficult to map new document identifiers into a model that was designed to describe printed material only. It is likely that over the next few years our project will need to adopt new notations for identifying bibliographic records, while still maintaining backward compatibility with the existing bibcodes for printed work. In this sense, it is likely that ADS will be able to help the astronomical community in the transition from print-based to electronic publishing by providing resolving services for astronomical bibliographies and related resources.

  
6.3 New services

The adoption of common technologies and protocols by data providers has helped create a low-level of inter-operability among different data services (in the sense that users can simply browse across different web sites by following links between them). However, with the exponential increase of documents and services available on the web, the problem of providing an integrated tool for locating information of interest to a researcher has remained unsolved. While well-organized repositories and archives with good search interfaces exist for a variety of data sets, a scientist who needs to consult several such archives is left with having to individually query each one separately and then organize the results collected from each one of them. It is fortunate that the creation of the ADS and its ongoing collaboration with other data providers has reduced (if not completely eliminated) this problem for astronomers, but this is not the case for scientists in other disciplines or for those researches whose work spans across the conventional boundaries of scientific research fields.

The problem of providing a unified search mechanism across datasets is being tackled both within the individual disciplines ([Heikkila et al. 1999]; [Fernique et al. 1998]; [Murtagh & Guillaume 1998]) and at the architectural level ([Schatz 1997]). A proposed solution to this problem is the creation of federated services composed by "clustering'' the combined assets and search capabilities of several independent data centers. A common set of metadata elements describing the local search domain and interface can be used to translate generic queries into site-specific ones, and then merge and present the results to the users. While this type of approach is known to work within well-restricted research domains, the broader problem of querying databases belonging to different research fields is far more complex and requires the creation of systems capable of implementing semantic inter-operability ([Schatz 1997]; [Lynch & Molina-Garcia 1996]). While the ADS has been offering direct access to its search engine since 1996 (SEARCH), in order for the ADS to become part of such a federated system, we will need to provide an increased level of abstraction and access to the capabilities of our search interfaces. Additionally, the emerging standards for site- and database-specific resource descriptions will require the creation and maintenance of a body of metadata defining both the extent of our databases and the supported query interfaces. [Hanisch (2000)] has recently proposed the creation of such a distributed system for Astronomy and the Space Sciences.

Another important aspect of services increasing inter-operability between data providers is cross-linking of online resources. While most publishers of scientific journals have been able to create electronic versions of their journals relatively quickly soon after the explosion in popularity of the web, only a few of them have taken advantage of the new capabilities that the technology has to offer, namely the possibility to create hyperlinks between online documents and related resources. In this respect, electronic publishing in astronomy was ahead of its times with the publication by the University of Chicago Press in late 1996 of the electronic version of the Astrophysical Journal which contained hyperlinks from the reference section of articles to bibliographic records in the ADS. The early implementation of this feature became possible thanks to the close collaboration between the publisher, the ADS staff, and the visionary leadership provided by the American Astronomical Society (AAS). Similarly, editors and publishers have now made it their policy to submit electronic versions of data tables appearing in astronomical papers to the CDS and Astronomical Data Center (ADC) archives, allowing ADS to easily maintain links to these datasets in its bibliographic records. This practice was estabilished back in 1990 with an agreement between the CDS and the editors of the journal Astronomy & Astrophysics.

While reference and object linking has today become more commonplace ([Hitchcock et al. 1998]), there are a number of unresolved problems that limit its usefulness. The issue of linking a reference to an instance of the document it refers to can be viewed as a two step process ([Caplan & Arms 1999]): (1) resolution of a reference string into a document identifier; and (2) resolution of the document identifier into one or more URLs. In the current use of the ADS reference resolver, ([Accomazzi et al. 1999]) step (1) is accomplished by the publisher during the last stages of the electronic publication process, and links are created only if a reference string is found to correspond to a valid bibcode in ADS ("static linking''). The step of document resolution (2) is another example of the problem of object resolution mentioned in Sect. 6.2. In this case, a bibcode needs to be mapped into the "best'' URL corresponding to it, and is typically implemented as a site-specific resolution activity, so that for example, the CDS mirror of the University of Chicago journals will link to the CDS mirror of the ADS bibliographic services.

While this model has worked well for many astronomical journals, it has some shortcomings. First of all, the computation of static links at publication time does not allow for the possibility that one of the works cited in the reference section may become available at a later date (e.g. if the coverage of the literature has been extended or if a more accurate resolution of the reference is later implemented). From a theoretical point of view, a better approach to the problem would be the use of "dynamic linking,'' in which links are created when the document is downloaded by the reader ([Van de Sompel & Hochstenbach 1999]). It is likely that most publishers will move towards a mixed model in which on-line documents are periodically reprocessed for the purpose of updating links between them and external resources that may have become available, or to provide options for forward-looking citation queries into bibliographical databases.

As far as the issue of bibcode resolution, it is clear that a better approach to having site-specific settings would be to allow real-time resolution of bibcode identifiers based on the preference of the individual users and the current availability of relevant resources. The approach we follow when resolving links to external resources (SEARCH) does account for user preferences, but does not take into account real-time availability of the possible instances of the resource. This is in contrast with the approach followed by Fernique et al. (1998), where the opposite is true. It is clear that in order to create a reliable system for resolving astronomical resources, and integration of both approaches is necessary, so that a global user profile can be used to specify preferences while a global resource database can be used to specify the availability and location of these resources on the network. The implementation of such a system is greatly complicated by the increasingly complex organization of networks, with firewalls and proxy servers acting as intermediary agents in the activity of resource resolution. Hopefully these issues will be solved over the next few years by the adoption of standard practices and software tools.


next previous
Up: The NASA Astrophysics Data

Copyright The European Southern Observatory (ESO)