This document describes the steps to be taken by a system administrator to configure a server to be used as an ADS mirror host. Please read the entire document and contact us if you have any questions.
If you are installing the ADS mirror site on a Redhat or CentOS 6 server, the following list of commands (issued as root) will get you going rather quickly.
This covers most of the necessary configuration on your CentOS server. For more information on each step and to configure your firewall, please continue reading.# create ads partition and user name (note that /ads could be # a symbolic link to some other directory) mkdir /ads useradd --home /ads/ads --groups apache --create-home chown -R ads /ads # create subdirectories used by http daemon mkdir -p /ads/www/{logs,htdocs,cgi,adstmp} # install apache if not there yet yum install httpd # download and modify ads-specific http configuration cd /etc/httpd/conf wget http://ads.harvard.edu/mirror/httpd/ads.conf vi ads.conf echo 'Include ads.conf' >> httpd.conf vi httpd.conf # make sure httpd starts at boot time chkconfig httpd on # and start it service httpd start # add the directory where the ads httpd logs are stored # (typically /ads/www/logs) to the list of directories that logrotate handles perl -pi -e 's:\ {$: /ads/www/logs/*log {:' /etc/logrotate.d/httpd # download the ads startup script cd /etc/init.d wget http://ads.harvard.edu/mirror/init.d/ads chmod 744 ads # configure it to startup ads services at boot time chkconfig --add ads chkconfig ads on # disable updatedb crawling of /ads partition perl -pi -e 's:^(PRUNEPATHS\s*=\s*)":$1"/ads :' /etc/updatedb.conf # disable fsck checking at boot time on /ads data disk # (this assumes that the corresponding partition is /dev/sdb1) tune2fs -c 0 /dev/sdb1 # make sure the at daemon is running chkconfig atd on service atd start # allow public key authentication for ads user echo 'Match User ads' >> /etc/ssh/sshd_config echo ' PubkeyAuthentication yes' >> /etc/ssh/sshd_config
The server used to host an ADS mirror should be an Intel or AMD machine running a recent version of the linux operating system. The particular linux distribution is not critical, as long as it's a recent one. We currently have mirror sites running CentOS 5 and 6 (recommended), Fedora and Debian. In all cases the 64-bit distribution of the OS (x86_64) is required, which is supported on all modern hardware. Memory is a very important matter since we load our index data in shared RAM (currently we are using about 10GB of shared memory just to hold indexes, so 32GB is the minimum amount needed before performance suffers significantly). Since our search engine is multi-threaded, a multi-core, multi-processor system does increases performance. As a point of reference, the server that ADS currently uses at the CfA is Dell PowerEdge 2950 server with two Intel Xeon X5450 quad-core processors (3.0GHz) and 64GB of RAM running CentOS 6.4. It is currently connected to the CfA LAN via a Gigabit ethernet card.
The abstract server should provide adequate storage capabilities for the full set of abstract files and indexes. Currently (September 2014) this amounts to approximately 250GB of data. Taking into account some extra disk space needed during the frequent database updates, and allowing some extra room for the logfiles, we suggest setting aside at least 500GB of fast I/O disk space. Since the abstract records are stored in individual files and given the large amount of abstracts currently present in our system, one important requirement on the storage space set aside for the abstract server is that a large number of inodes must be available. We estimate that having approximately 32 million inodes gives us enough room to grow for the next few years. If necessary, a disk partition should be reformatted so that enough inodes are created.
The ADS article service can be co-located on the local ADS mirror or the mirror can be set up to connect to the main ADS site when fulltext articles are requested. Currently (September 2014) the ADS article archive consists of 2TB of scanned article data, which we expect to grow to approximately 3TB over the next two years. In order to allow us to automatically update the archive of articles, it is important that the disk space set up for this purpose be configured as a single (virtual) partition to avoid having to distribute the data across filesystems. If you need help or suggestions concerning creating filesystems and/or metadevices to be used by ADS services please let us know and we'll be happy to advise you.
Currently we provide all the software to run the search engine as dynamically linked executables, but since other software is required to run the system itself and the HTTP daemon, the mirror site administrator should configure and if necessary install the following packages. Please note that modern versions of linux include all the packages listed below, so all is required is a minimum amount of configuration.
We will be responsible for properly mirroring server software and
index and text files to and from the mirror sites as well as
loading the indexes in shared memory when appropriate; to this end
we need to be given the password for user ads on the mirror site.
Additional steps required to get the
abstract files transferred to the mirror site and to get the
ADS database search interface up and running will be taken by
ADS staff (needs ADS admin authentication).
httpd.conf
are described below (here we assume that
all relevant directories are located under the directory
/ads
; please make sure you modify this to reflect your
local setup). It may be easier for you to set up the ads server
as a virtual host and then simply include
this ads.conf file
in your system's httpd.conf.
# disable DNS look-ups for efficiency
HostnameLookups off
# run as user ads
User ads
# define new log format so that the ADS cookie ID and
# Proxy information is saved
LogFormat "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-agent}i\" \"%{Cookie}i\" \"%{Via}i\"" cookieid
CustomLog /ads/logs/access_log cookieid
# The timeout is high for our server because we serve article images,
# which may require long transfers. For a mirror site that only runs
# the abstract server, setting the timeout to 10 minutes (600 seconds)
# should be plenty.
Timeout 600
# define the directory index and access control file names
DirectoryIndex index.html
AccessFileName .htaccess
# allow .htaccess files to override options set for the
# document root directory and the cgi-bin directory
DocumentRoot /ads/www/htdocs
<Directory "/ads/htdocs">
AllowOverride All
Options FollowSymLinks
# the following two lines for apache httpd <= 2.2
Order allow,deny
Allow from all
# the following line for apache httpd >= 2.4
# Require all granted
</Directory>
# define the cgi-bin directory
ScriptAlias /cgi-bin/ /ads/www/cgi/bin/
<Directory "/ads/www/cgi/bin">
AllowOverride All
Options FollowSymLinks
# the following two lines for apache httpd <= 2.2
Order allow,deny
Allow from all
# the following line for apache httpd >= 2.4
# Require all granted
</Directory>
/usr/bin
, the administrator will need to create a
symbolic link so that /usr/bin/perl
points to the perl5
executable. Version 5.005 or higher is required.
adsx.cfa.harvard.edu (131.142.184.210)
Please note:
Passwordless public-key authentication is the only way for us to
automatically mirror data and software updates to the mirror sites, so
it is important that you configure ssh to allow this type of access.
We recommend that you restrict passwordless access to your server from
these machines for better security.
Network Setup
adsx.cfa.harvard.edu (131.142.184.210)
Please let us know once you have enabled ssh access so we can test the
setup.
adsx.cfa.harvard.edu (131.142.184.210)
To test this connection, you should be able to simply type:
rsync rsync://adsx.cfa.harvard.edu
and verify that our server responds to the query with a listing of
modules available for updating. (If rsync isn't installed on your
system, don't panic, we will take care of installing a private copy
under user ads's home directory.)
simbad.u-strasbg.fr (130.79.128.4)
simbad.harvard.edu (131.142.185.22)
To test this connection, you should be able to simply type:
telnet simbad.u-strasbg.fr 1674
and verify that you get a prompt back that looks like this:
Trying 130.79.128.4...
Connected to simbad.u-strasbg.fr (130.79.128.4).
Escape character is '^]'.
simbad/smbservc=1674:5077.6
nedsrv.ipac.caltech.edu
To test this connection, you should be able to simply type:
telnet nedsrv.ipac.caltech.edu 10011
and verify that you get a prompt back that looks like this:
Trying 134.4.36.101...
Connected to abell-z1.ipac.caltech.edu (134.4.36.101).
Escape character is '^]'.
/usr/lib/sendmail
, and that outgoing email must
be allowed to go through.
Creation of ads username
ads
should be created on the mirror server, with
normal user priviledges (i.e. a working shell, home directory, etc.).
In the user's login scripts (.cshrc
and/or
.profile
) you should set the environment variable
DOCUMENT_ROOT
to point to the HTTP Document root tree, as
set in the HTTP daemon configuration files.
Server Startup Procedures
/etc/init.d/ads
:
#!/bin/sh
#
# Startup script for the ADS Services
# For more information, please see:
# http://ads.harvard.edu/mirror
#
# To install this script under RedHat linux, do the following (as root):
# cp init.d_ads /etc/init.d/ads
# chmod 744 /etc/init.d/ads
# chkconfig --add ads
# chkconfig ads on
#
# To install the startup procedure under solaris, do the following (as root):
# cp init.d_ads /etc/init.d/ads
# chmod 744 /etc/init.d/ads
# ln -s /etc/init.d/ads /etc/rc3.d/S93ads
#
# chkconfig: 5 86 16
# description: NASA Astrophysics Data System services
#
case "$1" in
'start')
# load shared memory segments under ads's uid
su - ads -c '$HOME/mirror/local/ads_startup'
;;
'stop')
# remove shared memory segments
su - ads -c '$HOME/mirror/local/ads_shutdown'
;;
*)
echo "Usage: $0 start|stop" 1>&2
;;
esac
HTTP logs
[root@adstrio etc]# cat /etc/logrotate.d/httpd
/var/log/httpd/*log /ads/logs/*log {
weekly
rotate 5200
compress
dateext
missingok
notifempty
sharedscripts
postrotate
/sbin/service httpd reload > /dev/null 2>/dev/null || true
endscript
}
Miscellanea
noatime
, which prevents
the OS from updating inodes in the filesystem every time a single file
is accessed. Since the ADS search engine accesses these files a
lot, this can be a big win for a busy mirror site.
tune2fs
(tune2fs -c 0 /dev/sda1
).
Please see the man page for tune2fs
for more info.
updatedb
crawl through the
filesystem hosting the ADS data since it is just a big waste of
I/O resources. To disable this, simply edit
/etc/updatedb.conf
and add the top-level directory
containing ADS data to the PRUNEPATHS
variable.
Last revision of this page:
4 June 2013
by aaccomazzi@cfa.harvard.edu