4 Management of bibliographic properties

By combining bibliographic data and metadata available from several sources in a single database and by maintaining a list of what properties and resources are available for each bibliography, the ADS system allows users to formulate complex queries such as: "show me all the papers that cite any paper ever written about the object M 87 and the subject "globular clusters'' and which are available online as full-text documents''. This query is possible thanks to the collection and fusion of data from several sources:

1) The astronomical object databases, which maintain a collection of object names and bibliographies in which they appear. This search is performed through a peer-to-peer network connection with the SIMBAD ([Egret & Wenger 1988]) and NED ([Helou & Madore 1988]) database servers, as described in OVERVIEW and SEARCH. This first step allows us to find the set of bibliographies on M 87.

2) The ADS abstract service indices, which allow a search of all astronomical papers containing the words "globular cluster'' or their synonyms. This part of the search is performed by the ADS search engine and makes use of the local files generated by indexing the bibliographic databases as described in Sect. 3. This step allows us to discard any bibliographic entry which does not contain the words "globular cluster'' in its text index.

3) The list of citations in the ADS databases, which maintain updated lists of astronomical papers and any paper referenced in them. This allows us to look up the list of papers that have cited the selected bibliographic entries, and then proceed to join the results.

4) The list of papers available electronically from either the astronomical journal publishers or the ADS article service, both of which provide access to full-text articles online.

The query given above illustrates how knowing whether a particular bibliographic entry possesses a particular property (e.g. whether it has been cited) and what values may be associated with that property (e.g. the list of citing papers) can be used as a method for selection and ranking of query results. Additionally, the availability of remote resources for a particular bibliographic entry can be described as being one of its properties, which in turns allows an additional filtering of the result lists.

As new data regarding a bibliographic entry become available, its record is updated in the ADS database by merging the new information with the existing entry and possibly by updating its relevance within the database and its relation with respect to other internal and external resources. For instance, when a new paper is published which references an existing bibliography, the record for the latter paper needs to be updated by establishing a link between the two papers; at the same time, the "citation relevance measure'' for the paper, computed as the number of times the paper was cited in the literature, also needs to be updated.

The procedures used in the creation and management of bibliographic properties (simply called "properties'' from here on) in the ADS databases are a result of the need for managing resources related to bibliographies which may or may not be available locally. The main characteristics of the property sets as defined in our system can be summarized in the following list:

1) Some properties simply denote the fact that an entry belongs to a certain dataset (e.g. whether a paper is refereed or not), others may have values associated with them (e.g. "is available online electronically'' will have as its value the URL of the full-text paper). In general, the knowledge of whether an entry in the database has a certain property allows the search engine to select it for further consideration when executing a database query, while the value(s) assumed by this property do not need to be taken into account until later.

2) The lists of bibliographic identifiers and their properties may be defined as being either "static'' or "dynamic.'' Static properties are those that once defined do not change in time (e.g. whether a paper is refereed), while dynamic properties may change their value with time (e.g. the list of citations for a paper).

3) Some properties may depend on each other (e.g. references and citations), hence the creation and updating order for these properties is significant.

Currently the ADS has defined a set of 21 different properties which are applicable to its bibliographies. Some of them are listed in Table 3.

**Table 3:** Examples of bibliographic properties defined in the ADS and their possible values
Name	Explanation	Value(s)
associated	one or more associated bibliographic records exist for this entry (e.g. erratum or papers published as part of a series)	bibcodes of papers associated with bibliographic entry
citation	bibliographic entry has been cited by one or more papers in the ADS	bibcodes of papers citing bibliographic entry
data	bibliographic entry has electronic data tables published with it	URLs of data tables
electronic	a full-text electronic article exist for this bibliographic entry	URL of electronic journal article
ocr	abstract of bibliographic entry was generated by Optical Character Recognition programs	N/A
refereed	bibliographic entry is a refereed paper	N/A

In the rest of this section we will discuss the approach we followed in implementing the database structures allowing query and selection based on properties of bibliographies. In Sect. 4.1 we describe the implementation used to associate properties and attributes to entries in the database and the procedures maintaining relational links among them. In Sect. 4.2 we describe the framework used to automatically update and merge bibliographic data with information submitted to the ADS.

4.1 Representation of properties

The creation and updating of properties in the ADS system is the result of merging entries provided by different data sources and individuals at different times and in different formats. The procedures used to maintain the property database are therefore structured to be as general as possible (so that defining a new property is a simple task) while still allowing as much customization as necessary to deal with a variety of sources and formats. The representation of properties allows the search engine to efficiently filter results based on whether a bibliographic entry possesses a particular property. It also allows fast access to the values associated to a particular bibliographic property, so that the search interface can quickly access the information as required.

Instead of representing these properties as a single relational table where each bibliographic entry is associated with the ordered set of property values, a different approach was chosen where each property is represented by a separate table. The following definition was adopted:

"A bibliographic entry b possesses property p if the unique identifier for b appears in the property table associated with p, $T_{\rm p}$ . If p is a property that can have one or more values associated with it, the entry for b in table $T_{\rm p}$ will contain the n-tuple of such values next to it.''

As an example, a possible entry in table $T_{{\rm data}}$ for a bibliographic entry which has a data property associated to it could be:

The first column contains the bibliographic identifier for the property, while the second column contains the values of the data property, in this case a list of URLs of electronic data tables published in the paper. (Note that this record has been split on several lines for editorial reasons.)

The file structure most amenable to representing these property tables is again an inverted file, which allows fast binary searches on the bibcode identifiers. As is the case for the inverted files used to perform fielded searches on the contents of the bibliographic entries in our database (see Sect. 3), each property table is decomposed in two parts, an index file and a list file. Since the records in the index file contain only bibcodes, which have a fixed length, we can create a binary index file where each record consists of one bibcode identifier (which is the sort key in the file), a pointer into the list file, and the number of property values associated with the bibcode. Entries in the list file are variable length, newline separated records, each record corresponding to a property value.

In addition to the index and list files, a database-specific file is generated for each property containing the list of all bibcodes in that particular database which possess that property. When the data structures used by the search engine are loaded into random access memory, these lists of bibcodes are read and for each bibliographic entry a binary array containing the list of properties which it possesses is created. By storing this information as part of the memory-resident data structures used by the search engine, selection and filtering of bibliographic entries based on their properties becomes a very efficient operation. The current implementation uses a 32-bit integer to represent the binary array of properties, where the n-th bit is set if and only if the bibliographic entry possesses the n-th property.

4.2 Implementation of the property database management software

To provide the capability of merging properties and values generated from separate sources and in different formats, we devised a framework consisting of a hierarchical set of files and software utilities which are used to implement an efficient processing pipeline (see Fig. 5). The approach we follow may be regarded as being bottom-up, because the property files are always created from smaller, independently updated datasets. Updating of such datasets is typically event-driven, as described below.

$\begin{figure}\includegraphics[width=8cm]{DS1784F5.eps} \par\end{figure}$

Figure 5: Schema used for the creation of bibliographic properties. In this abstract example, four different sources contribute to the creation of bibliographic property files a.bib, b.bib, c.bib, d.bib. The input files used to generate the global list of properties may consist of either static lists of bibcodes (c.bib), tabular data to be reprocessed to create properly formatted entries (b.tab), lists of URLs containing information to be retrieved and processed (a.uri), or "filter'' functions acting on the global list of bibliographic entries (d.flt). The system allows for the existence of "exception bibcodes,'' here represented as the contents of files x.kill and y.kill that are removed from the global list of bibcodes before the property inverted file all.props is created. The execution and updating of any of these files is controlled by a system of makefiles that trigger updating only if necessary

A top-level directory is created which contains one subdirectory for each property in the database. Each of these subdirectories in turn contains files representing different datasets which need to be merged together. The nature and content of such files is determined by their extension, according to the following conventions:

.tab: files containing identifiers and properties as provided by different data centers and users; these entries will need to be translated to the standard format used by scripts managed by the ADS staff;

.bib: files containing lists of tab-separated identifier and value pairs; these entries are suitable to be merged into a single property file used by the ADS search engine;

.fmt: executable procedures which generate .bib files from their respective .tab files; these procedures contain format- and domain- specific knowledge about the source of the particular dataset and the mapping of entries from the .tab file into the .bib file;

.uri: file containing the URLs of documents which should be downloaded from the network and merged to create a .tab file; these URLs may correspond to static or dynamic documents generated by other service providers listing the bibliographic properties available on their web site;

.flt: executable procedures which generate .bib files by filtering the complete list of bibliographic identifiers according to some data-specific criteria; one example of such filter is the one which produces the list of all refereed bibcodes from the list of all bibcodes by checking the journal abbreviation;

.kill: file containing the list of bibcodes which should not be listed as possessing a particular property; these are typically used to implement "exceptions to the rule'', cases; for example, we use a kill file to remove bibcodes corresponding to editorial notices from the global list of papers appearing in a refereeed journal.

Data retrieval and formatting scripts designed after the GNU "make'' utility limit the creation and processing of data to what is strictly necessary. In particular, data sources that are specified as URLs are downloaded only if their timestamp is more recent than their local copy. This obviously applies to network protocols that support the notion of time-stamping, e.g. HTTP and FTP. Similarly, scripts that are used to format input tables into lists of bibcodes and relative URLs are only executed if the timestamp of the relevant tables indicates that they have been modified more recently than their corresponding target file.