4 Search engine details

4.1 General

The search engine software is written in C. It accepts as input a structure that contains all the search fields and flags for the user specified settings and filters. For each search field that contains user input a separate POSIX (Portable Operating System Interface) thread is started that searches the database for the terms specified in that field. The results obtained for each term in that field are combined in the thread according to the specified combination logic. The resulting list of references is returned to the main thread. The main thread combines the results from the different field searches and calculates the final score for each reference. The final combined list with the scores is returned to the user interface routines that format the results according to the user specified output format.

4.2 Searching

4.2.1 Database files

The abstracts are indexed in separate fields: Author names, titles, abstract text, and objects. Each of these fields is indexed similarly into an index file and a list file (see ARCHITECTURE). The index file contains a list of all terms in the field together with the frequency of the term in the database, and the position and length of two blocks of data in the list file. One block contains all references that include the exact word as specified. The other block contains all references that contain either the word or any of its synonyms.

A search for a particular word in the index file is done through a binary search in the index. The indexes are resident in memory, loaded during boot time (see Sect. 5). Once the word is found in the index, the position and length of the data block is used to directly access the data block in the list file. This data block contains the identifier for each reference that contains the search word.

If a quick update has occurred (see ARCHITECTURE) since the list file was last built (indicated by a negative in part of the last identifier), an auxiliary block of reference identifiers is read. Its position is contained in the structure of the last reference identifier. This auxiliary block is merged with the original one.

The identifier for each reference is the position of the bibliographic code for that reference in the list of all bibliographic codes. This system saves one lookup of the identifier in a list of identifiers, since the number can be used directly as an index in the array of bibliographic codes.

4.2.2 Synonym searches

As mentioned above, the index files contain information about two blocks of data, the data for the individual word and for the synonym group to which this word belongs. When a search with synonym lookup enabled is requested, the block of data for the whole synonym group is used, otherwise the data for only the individual word is returned. All the processing that enables these two types of searches is done during indexing time, therefore the speeds for both searches are similar.

Even though our synonym list is quite extensive (see ARCHITECTURE) our users will sometimes use words that are not in the database or the synonym list. In these cases the search software uses a stemming algorithm from the Unix utility ispell to find the stem of the search word and then searches for the word stem. The indexing software has indexed the stems of all words in the database together with their original words (see ARCHITECTURE). This word stemming is done as a last resort if no regular match has been found in an attempt to find any relevant references.

4.2.3 Wildcard searches

In order to be able to search for families of words, a limited wildcard capability is available. Two wildcard characters are defined: The question mark "?'' is used to specify a single wildcard character and the asterisk "*'' is used to specify zero or more wildcard characters. The "?'' can be used anywhere in a word. For instance a search for M 1? will find all Messier objects between M 10 and M 19. A search for a?sorb will find references with absorb as well as adsorb.

The asterisk can only be used at the beginning or at the end of a word. For instance 3C* searches for all 3C objects. *sorb searches for words that end in sorb like absorb, desorb, etc. When synonym replacement is on, all their synonyms (e.g. absorption) will be found as well. The "?'' and the "*'' can be combined in the same search string.

4.3 Results combining within a field

4.3.1 Combining results

As mentioned above, the user can select between four types of combination methods: "OR'', "AND'', simple logic, and full boolean logic. For the first three cases, the references for all search terms are retrieved and sorted first. The reference lists are then merged by going through the sorted reference lists sequentially and synchronously and selecting references according to the chosen logic.

The search algorithm for the full boolean logic is different. The boolean query is parsed from left to right. For each search term a function is called that finds the references for this term. A search term is either a search word, a phrase, or an expression enclosed in parentheses. If the search term contains other terms (if it is enclosed in parentheses), the parsing function is called recursively.

The next step is to determine the boolean operator that follows the search term, and then to evaluate the next search term after the operator. Once the reference lists for the two terms and the combining operator are determined, the two lists are combined according to the operator. This new list is then used as the first term of the next expression.

If the boolean operator is "OR'', the combining of the lists is deferred, and the next operator and search term are evaluated. This ensures the correct precedence of "OR'' and "AND'' operators.

The "NOT'' operator is implemented by getting the reference list for the term, making a copy of all references in the database, and then removing the references from the search term from the complete list. This yields a very large list of references. If the first search term in a boolean expression is a "NOT'' term, the search will take very long, because this large list has to be propagated through all the subsequent parsing of the boolean expression. Care should therefore be taken to put a "NOT'' term to the right of at least one other term, since processing is done left to right.

As an example, Fig. 8 shows the processing of the boolean expression mentioned in Sect. 2:

$\begin{figure}\includegraphics[width=9cm]{DS1781F8.eps}\end{figure}$

Figure 8: Pseudo code for parsing full boolean search expression

(pulsar or "neutron star'') and
   ("red shift'' distance) and
   not 1987A

4.3.2 Scoring

In addition to the information about the references for each word, the index file also contains its frequency in the database. The frequency is already pre-calculated as int(10000/(log(frequency))) during indexing (see ARCHITECTURE). This saves considerable time during execution of the search engine since all server calculations can be done as integer computations, no floating point operations are necessary. During the first part of the search, this frequency is attached to each retrieved reference. In the next step, the retrieved references are combined according to the selected combination logic for that field.

For "OR'' combination logic, the lists retrieved for each word are merged and uniqued. As described in Sect. 5, this is done by going through the sorted reference lists synchronously and adding each new reference to the output list. The score for that reference is determined by adding up the frequencies from each of the lists for weighted scoring, or by setting a score equal to the number of matched words for proportional scoring.

For "AND'' combination logic, only references that appear in every one of the lists are selected. Each of these references receives a score of 1.

For simple logic and full boolean logic, the score for the returned references is determined only from the search terms that were combined with "OR''. All words that have mandatory selection criteria (prefixed by "+'' or "-'' in simple logic, and combined with "AND'' or "NOT'' in full boolean logic) do not affect the final score.

4.4 Combining results among fields

4.4.1 Combining

After the POSIX threads for each search field are started, the main program waits for all threads to complete the search. When all searches are completed the search engine combines the results of the different searches according to the selected settings. If for instance one field was selected as required for selection in the settings section of the query form, only references that were found in the search for that field will be in the final reference list. The combined list is then uniqued and sorted by score. The resulting list of references is passed back to the user interface software.

If the user did not specify any search terms, a date range has to be selected. The software queries the database for all references in the selected date range and uses this list for further processing, e.g. filtering (see Sect. 4.5).

4.4.2 Scoring

The score for each reference in the final results list is determined by adding the scores from each list multiplied by the user specified weight for each field and then normalizing the score such that a reference that matches all search terms from all fields receives a score of 1.

4.5 Selection filters

After the search is completed according to the specified search words and the settings that control the combination logic and the scoring algorithms, the resulting list of references can be filtered according to several criteria. During the design of the software a decision had to be made whether to filter the results while selecting the references or after completing the search. The first approach has the advantage that the combining of the selected references will be faster because fewer references need to be combined. The second approach has the advantage that the first selection is faster. We chose the second approach because, except for selecting by publication date, only a small number of queries use filtering (see Sect. 6). Because of that, filtering by publication date is done during selection of the references, while the other filtering is done after the search is completed.

References can be filtered by five criteria:

1.  Entry date in the data base
2.  Minimum score
3.  Journal
4.  Available links
5.  Group membership

1. + 2. Entry date and minimum score. These two filters can be used to query the database automatically on a regular basis for new information that is relevant to a selected topic. The user can build a query form that returns relevant references, and then save this query form locally. This query form can then be sent to the ADS email interface (see above) on a weekly or monthly basis. By specifying an entry day of -7 for instance, the query will retrieve all references that fit the query and that were entered within the last seven days. The minimum score can be used to limit the returned number of references to only the ones that are really relevant. The references are returned via email as described in Sect. 2.5 about the email interface.

3. Journal filter. This filter allows the user to select references from individual journals or groups of journals. Available options for this filter are:

a.  All journals (default)
b.  Refereed journals only
c.  Non-refereed journals only
d.  Selected journals

If the last option is selected, the user can specify one or more journal abbreviations (e.g. ApJ, AJ (Astronomical Journal)) that should be selected. More than one abbreviation can be specified by separating them with semicolons or blanks. The filter for journals can also include the volume number (but not the page number). The journal abbreviation is compared with the bibliographic codes over the length of the specified abbreviation. For instance if the user specifies ApJ, the system selects all articles published in the ApJ, ApJ Supplement and ApJ Letters. ApJ.. will select only articles from the ApJ and the ApJ Letters. A special abbreviation, ApJL will select only articles from the ApJ Letters. If a journal abbreviation is specified with a prepended "-'', all references that are NOT from that journal are returned. The journal abbreviations (or bibstems) used in the ADS are available at:

http://adsdoc.harvard.edu/abs_doc/
       journal_abbr.html

4. Available links. This filter allows the user to select references that have specific other information available. The returned references can be filtered for instance to include only references that have data links or scanned articles available. As an example, a user needs to find on-line data about a particular object. A search for that object in the object field and a filter for references with on-line data returns all articles about that object that link to on-line data.

5. Groups. We provide the capability to build a reference collection for a specific subset of references. This can be either articles written by members of a particular research institute or about a particular subject. Currently there are 5 groups in the ADS. We encourage larger institutes or groups to compile a list of their references and send it to us to be included in the list of groups.

Up: The NASA Astrophysics Data