NASA ADS Abstract Service Mirroring Information

This document describes the steps to be taken by a system administrator to configure a server to be used as an ADS mirror host. Please read the entire document and contact us if you have any questions.



For the Impatient Admin (on a CentOS 6 system)

If you are installing the ADS mirror site on a Redhat or CentOS 6 server, the following list of commands (issued as root) will get you going rather quickly.

# create canonical ads directory structure (note that /ads could
# be a symbolic link to another directory):
mkdir -p /ads/www/{logs,htdocs,cgi,adstmp}

# install apache if not there yet
yum install httpd
# download and modify ads-specific http configuration
cd /etc/httpd/conf
wget http://ads.harvard.edu/mirror/httpd/ads.conf
vi ads.conf
echo 'Include ads.conf' >> httpd.conf
vi httpd.conf
# make sure httpd starts at boot time
chkconfig httpd on
# and start it
service httpd start
# add the directory where the ads httpd logs are stored
# (typically /ads/www/logs) to the list of directories that logrotate handles
perl -pi -e 's:\ {$: /ads/www/logs/*log {:' /etc/logrotate.d/httpd

# download the ads startup script
cd /etc/init.d
wget http://ads.harvard.edu/mirror/init.d/ads
chmod 744 ads
# configure it to startup ads services at boot time
chkconfig --add ads
chkconfig ads on

# create ads partition and user name
mkdir /ads
useradd --home /ads/ads --groups apache --create-home
chown -R ads /ads

# disable updatedb crawling of /ads partition
perl -pi -e 's:^(PRUNEPATHS\s*=\s*)":$1"/ads :' /etc/updatedb.conf

# disable fsck checking at boot time on /ads data disk
# (this assumes that the corresponding partition is /dev/sdb1)
tune2fs -c 0 /dev/sdb1

# make sure the at daemon is running
chkconfig atd on
service atd start

# allow public key authentication for ads user
echo 'Match User ads' >> /etc/ssh/sshd_config
echo '    PubkeyAuthentication yes' >> /etc/ssh/sshd_config
This covers most of the necessary configuration on your CentOS server. For more information on each step and to configure your firewall, please continue reading.

Server Hardware

The server used to host an ADS mirror should be an Intel or AMD machine running a recent version of the linux operating system. The particular linux distribution is not critical, as long as it's a recent one. We currently have mirror sites running CentOS 5 and 6 (recommended), Fedora and Debian. In all cases the 64-bit distribution of the OS (x86_64) is required, which is supported on all modern hardware. Memory is a very important matter since we load our index data in shared RAM (currently we are using about 4GB of shared memory just to hold indexes, so 8GB is the minimum amount needed before performance suffers significantly). Since our search engine is multi-threaded, a multi-core, multi-processor system does increases performance. As a point of reference, the server that ADS currently uses at the CfA is Dell PowerEdge 2950 server with two Intel Xeon X5450 quad-core processors (3.0GHz) and 64GB of RAM running CentOS 6.4. It is currently connected to the CfA LAN via a Gigabit ethernet card.

The abstract server should provide adequate storage capabilities for the full set of abstract files and indexes. Currently (September 2009) this amounts to approximately 150GB of data. Taking into account some extra disk space needed during the frequent database updates, and allowing some extra room for the logfiles, we suggest setting aside at least 250GB of dedicated space. Since the abstract records are stored in individual files and given the large amount of abstracts currently present in our system, one important requirement on the storage space set aside for the abstract server is that a large number of inodes must be available. We estimate that having approximately 16 million inodes gives us enough room to grow for the next few years. If necessary, a disk partition should be reformatted so that enough inodes are created.

The ADS article service can be co-located on the local ADS mirror or the mirror can be set up to connect to the main ADS site when fulltext articles are requested. Currently (May 2013) the ADS article archive consists of 2TB of scanned article data, which we expect to grow to approximately 3TB over the next two years. In order to allow us to automatically update the archive of articles, it is important that the disk space set up for this purpose be configured as a single (virtual) partition to avoid having to distribute the data across filesystems. If you need help or suggestions concerning creating filesystems and/or metadevices to be used by ADS services please let us know and we'll be happy to advise you.

Server Software

Currently we provide all the software to run the search engine as dynamically linked executables, but since other software is required to run the system itself and the HTTP daemon, the mirror site administrator should configure and if necessary install the following packages. Please note that modern versions of linux include all the packages listed below, so all is required is a minimum amount of configuration.

  • HTTP Daemon: We currently use apache 2.2 as our server, although any apache 2.x version should work as well. The important directives to specify in the configuration file httpd.conf are described below (here we assume that all relevant directories are located under the directory /ads; please make sure you modify this to reflect your local setup). It may be easier for you to set up the ads server as a virtual host and then simply include this ads.conf file in your system's httpd.conf.

    
    # disable DNS look-ups for efficiency
    HostnameLookups off
    
    # run as user ads
    User ads
    
    # define new log format so that the ADS cookie ID and 
    # Proxy information is saved
    LogFormat "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-agent}i\" \"%{Cookie}i\" \"%{Via}i\"" cookieid
    CustomLog /ads/logs/access_log cookieid
    
    # The timeout is high for our server because we serve article images,
    # which may require long transfers.  For a mirror site that only runs
    # the abstract server, setting the timeout to 10 minutes (600 seconds)
    # should be plenty.
    Timeout 600
    
    # define the directory index and access control file names
    DirectoryIndex index.html
    AccessFileName .htaccess
    
    # allow .htaccess files to override options set for the
    # document root directory and the cgi-bin directory
    DocumentRoot /ads/www/htdocs
    <Directory "/ads/htdocs">
       AllowOverride All
       Options FollowSymLinks
       Order allow,deny
       Allow from all
    </Directory>
    
    # define the cgi-bin directory
    ScriptAlias /cgi-bin/ /ads/www/cgi/bin/
    <Directory "/ads/www/cgi/bin">
       AllowOverride All
       Options FollowSymLinks
       Order allow,deny
       Allow from all
    </Directory>
    
    

  • PERL Interpreter: The programming language PERL is required to be installed on the mirror site. If the interpreter is not installed in /usr/bin, the administrator will need to create a symbolic link so that /usr/bin/perl points to the perl5 executable. Version 5.005 or higher is required.

  • Secure Shell (ssh) Package: Since updating the ADS databases requires estabilishing remote access from the ads server to the mirror machine, please make sure that you have an ssh server running on your machine. For security reasons we recommend that you only enable access via protocol version 2. The daemon should be configured to allow public key access at least for user ads from the server listed below:
        adsduo.cfa.harvard.edu   (131.142.185.23)
    
    Please note: Passwordless public-key authentication is the only way for us to automatically mirror data and software updates to the mirror sites, so it is important that you configure ssh to allow this type of access. We recommend that you restrict passwordless access to your server from these machines for better security.

  • OS Setup: No additional OS configurations are necessary for the ADS services. However, whatever version of the OS you go with, please make sure you keep it up to date. All modern operating system provide ways to automatically download and install patches and updates, and these are essential in keeping a system in good health.

    Network Setup

    If your institution or your server uses a firewall (and it should), please make sure that the following services are enabled:

    Creation of ads username

    The user ads should be created on the mirror server, with normal user priviledges (i.e. a working shell, home directory, etc.). In the user's login scripts (.cshrc and/or .profile) you should set the environment variable DOCUMENT_ROOT to point to the HTTP Document root tree, as set in the HTTP daemon configuration files.

    We will be responsible for properly mirroring server software and index and text files to and from the mirror sites as well as loading the indexes in shared memory when appropriate; to this end we need to be given the password for user ads on the mirror site.

    Server Startup Procedures

    In order for the HTTP server and the ADS abstract server to start up at boot time, startup scripts need to be installed in the proper directories. The following script can be used under recent versions of linux. It is best kept in /etc/init.d/ads:
    #!/bin/sh
    #
    # Startup script for the ADS Services
    # For more information, please see:
    # http://ads.harvard.edu/mirror
    # 
    # To install this script under RedHat linux, do the following (as root):
    #    cp init.d_ads /etc/init.d/ads
    #    chmod 744 /etc/init.d/ads
    #    chkconfig --add ads
    #    chkconfig ads on
    #
    # To install the startup procedure under solaris, do the following (as root):
    #    cp init.d_ads /etc/init.d/ads
    #    chmod 744 /etc/init.d/ads
    #    ln -s /etc/init.d/ads /etc/rc3.d/S93ads
    #
    # chkconfig: 5 86 16
    # description: NASA Astrophysics Data System services
    #
    
    case "$1" in
    'start')
        # load shared memory segments under ads's uid
        su - ads -c '$HOME/mirror/local/ads_startup'
        ;;
    'stop')
        # remove shared memory segments
        su - ads -c '$HOME/mirror/local/ads_shutdown'
        ;;
    *)
        echo "Usage: $0 start|stop" 1>&2
        ;;
    esac
    

    HTTP logs

    You can expect large HTTP logfiles based on access by ADS users. You should therefore enable log file rotation and crompression to avoid the creation of huge logs. If your system uses logrotate (see "man logrotate") you can create a file like the following which performs a weekly rotation:
    [root@adstrio etc]# cat /etc/logrotate.d/httpd 
    /var/log/httpd/*log /ads/logs/*log {
        weekly
        rotate 5200
        compress
        dateext
        missingok
        notifempty
        sharedscripts
        postrotate
            /sbin/service httpd reload > /dev/null 2>/dev/null || true
        endscript
    }
    

    Miscellanea

    Other notes that may be useful to the administrator:


    Additional steps required to get the abstract files transferred to the mirror site and to get the ADS database search interface up and running will be taken by ADS staff (needs ADS admin authentication).


    Last revision of this page: 4 June 2013 by aaccomazzi@cfa.harvard.edu