This document describes the steps to be taken by a system administrator to configure a server to be used as an ADS mirror host. Please read the entire document and contact us if you have any questions.
The server used to host an ADS mirror should be an Intel or AMD machine running a recent version of the linux operating system. The particular linux distribution is not critical, as long as it's a recent one. We currently have mirror sites running CentOS 4 and 5, Fedora Core 6 and 8, and Debian Etch. In all cases the 64-bit distribution of the OS (x86_64) is required, which is supported on all modern hardware. Memory is a very important matter since we load our index data in shared RAM (currently we are using about 4GB of shared memory just to hold indexes, so 8GB is the minimum amount needed before performance suffers significantly). Since our search engine is multi-threaded, a multi-core, multi-processor system does increases performance. As a point of reference, the server that ADS currently uses at the CfA is Dell PowerEdge 2950 server with two Intel Xeon X5450 quad-core processors (3.0GHz) and 64GB of RAM running CentOS 5.4. It is currently connected to the CfA LAN via a Gigabit ethernet card.
The abstract server should provide adequate storage capabilities for the full set of abstract files and indexes. Currently (September 2009) this amounts to approximately 150GB of data. Taking into account some extra disk space needed during the frequent database updates, and allowing some extra room for the logfiles, we suggest setting aside at least 250GB of dedicated space. Since the abstract records are stored in individual files and given the large amount of abstracts currently present in our system, one important requirement on the storage space set aside for the abstract server is that a large number of inodes must be available. We estimate that having approximately 16 million inodes gives us enough room to grow for the next few years. If necessary, a disk partition should be reformatted so that enough inodes are created.
The ADS article service can be co-located on the local ADS mirror or the mirror can be set up to connect to the main ADS site when fulltext articles are requested. Currently (May 2010) the ADS article archive consists of 1.3TB of scanned article data, which we expect to grow to approximately 2TB over the next two years. In order to allow us to automatically update the archive of articles, it is important that the disk space set up for this purpose be configured as a single (virtual) partition to avoid having to distribute the data across filesystems. If you need help or suggestions concerning creating filesystems and/or metadevices to be used by ADS services please let us know and we'll be happy to advise you.
Currently we provide all the software to run the search engine as dynamically linked executables, but since other software is required to run the system itself and the HTTP daemon, the mirror site administrator should configure and if necessary install the following packages. Please note that modern versions of linux include all the packages listed below, so all is required is a minimum amount of configuration.
We will be responsible for properly mirroring server software and
index and text files to and from the mirror sites as well as
loading the indexes in shared memory when appropriate; to this end
we need to be given the password for user ads on the mirror site.
Additional steps required to get the
abstract files transferred to the mirror site and to get the
ADS database search interface up and running will be taken by
ADS staff (needs ADS admin authentication).
httpd.conf are described below (here we assume that
all relevant directories are located under the directory
/ads; please make sure you modify this to reflect your
local setup):
# make sure to have at least the following modules configured
# in your http daemon. If these are provided as dynamic shared
# modules then you will need to load them using the LoadModule
# directives below, although the path to the shared object may
# vary between installations
LoadModule env_module modules/mod_env.so
LoadModule config_log_module modules/mod_log_config.so
LoadModule cgi_module modules/mod_cgi.so
LoadModule action_module modules/mod_actions.so
LoadModule alias_module modules/mod_alias.so
LoadModule access_module modules/mod_access.so
LoadModule auth_module modules/mod_auth.so
AddModule mod_env.c
AddModule mod_log_config.c
AddModule mod_cgi.c
AddModule mod_actions.c
AddModule mod_alias.c
AddModule mod_access.c
AddModule mod_auth.c
AddModule mod_so.c
# disable DNS look-ups for efficiency
HostnameLookups off
# run as user ads
User ads
# define new log format so that the ADS cookie ID and
# Proxy information is saved
LogFormat "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-agent}i\" \"%{Cookie}i\" \"%{Via}i\"" cookieid
CustomLog /ads/logs/access_log cookieid
# The timeout is high for our server because we serve article images,
# which may require long transfers. For a mirror site that only runs
# the abstract server, setting the timeout to 10 minutes (600 seconds)
# should be plenty.
Timeout 600
# define the directory index and access control file names
DirectoryIndex index.html
AccessFileName .htaccess
# allow .htaccess files to override options set for the
# document root directory and the cgi-bin directory
<Directory "/ads/htdocs">
AllowOverride All
Options FollowSymLinks
Order allow,deny
Allow from all
</Directory>
# define the cgi-bin directory
ScriptAlias /cgi-bin/ /ads/cgi/bin/
<Directory "/ads/cgi/bin">
AllowOverride All
Options FollowSymLinks
Order allow,deny
Allow from all
</Directory>
/usr/bin, the administrator will need to create a
symbolic link so that /usr/bin/perl points to the perl5
executable. Version 5.005 or higher is required.
adsduo.cfa.harvard.edu (131.142.185.23)
Please note:
Passwordless public-key authentication is the only way for us to
automatically mirror data and software updates to the mirror sites, so
it is important that you configure ssh to allow this type of access.
We recommend that you restrict passwordless access to your server from
these machines for better security.
Network Setup
adsduo.cfa.harvard.edu (131.142.185.23)
Please let us know once you have enabled ssh access so we can test the
setup.
adsduo.cfa.harvard.edu (131.142.185.23)
To test this connection, you should be able to simply type:
rsync rsync://adsduo.cfa.harvard.edu
and verify that our server responds to the query with a listing of
modules available for updating. (If rsync isn't installed on your
system, don't panic, we will take care of installing a private copy
under user ads's home directory.)
simbad.u-strasbg.fr (130.79.128.4)
simbad.harvard.edu (131.142.185.22)
To test this connection, you should be able to simply type:
telnet simbad.u-strasbg.fr 1674
and verify that you get a prompt back that looks like this:
Trying 130.79.128.4...
Connected to simbad.u-strasbg.fr (130.79.128.4).
Escape character is '^]'.
simbad/smbservc=1674:5077.6
nedsrv.ipac.caltech.edu (134.4.36.101)
To test this connection, you should be able to simply type:
telnet nedsrv.ipac.caltech.edu 10011
and verify that you get a prompt back that looks like this:
Trying 134.4.36.101...
Connected to abell-z1.ipac.caltech.edu (134.4.36.101).
Escape character is '^]'.
/usr/lib/sendmail, and that outgoing email must
be allowed to go through.
Creation of ads username
ads should be created on the mirror server, with
normal user priviledges (i.e. a working shell, home directory, etc.).
In the user's login scripts (.cshrc and/or
.profile) you should set the environment variable
DOCUMENT_ROOT to point to the HTTP Document root tree, as
set in the HTTP daemon configuration files.
Server Startup Procedures
/etc/init.d/ads:
#!/bin/sh
#
# Startup script for the ADS Services
# For more information, please see:
# http://ads.harvard.edu/mirror
#
# To install this script under RedHat linux, do the following (as root):
# cp init.d_ads /etc/init.d/ads
# chmod 744 /etc/init.d/ads
# chkconfig --add ads
# chkconfig ads on
#
# To install the startup procedure under solaris, do the following (as root):
# cp init.d_ads /etc/init.d/ads
# chmod 744 /etc/init.d/ads
# ln -s /etc/init.d/ads /etc/rc3.d/S93ads
#
# chkconfig: 5 86 16
# description: NASA Astrophysics Data System services
#
case "$1" in
'start')
# load shared memory segments under ads's uid
su - ads -c '$HOME/mirror/local/ads_startup'
;;
'stop')
# remove shared memory segments
su - ads -c '$HOME/mirror/local/ads_shutdown'
;;
*)
echo "Usage: $0 start|stop" 1>&2
;;
esac
HTTP logs
[root@adstrio etc]# cat /etc/logrotate.d/httpd
/var/log/httpd/*log /ads/logs/*log {
weekly
rotate 5200
compress
dateext
missingok
notifempty
sharedscripts
postrotate
/sbin/service httpd reload > /dev/null 2>/dev/null || true
endscript
}
Miscellanea
noatime, which prevents
the OS from updating inodes in the filesystem every time a single file
is accessed. Since the ADS search engine accesses these files a
lot, this can be a big win for a busy mirror site.
tune2fs (tune2fs -c 0 /dev/sda1).
Please see the man page for tune2fs for more info.
updatedb crawl through the
filesystem hosting the ADS data since it is just a big waste of
I/O resources. To disable this, simply edit
/etc/updatedb.conf and add the top-level directory
containing ADS data to the PRUNEPATHS variable.
Last revision of this page:
11 August 2011
by aaccomazzi@cfa.harvard.edu