Sensors description
Nagios service which monitors availability of Core Services in Central European region performs the following tests:
CA Distribution
- fetches latest version information from www.eugridpma.org
- uses GridFTP to get CA info files
- compares version
- if site uses previous version, first 7 days sends warnings and
then criticals
- compares version of all CAs with policy-igtf-classic.info
- period: 24 hours
- TODO: check if the distribution is published in LCG repository before sending alerts
Certificate lifetime
- uses GridFTP (CE nodes & RBs) or HTTPS (R-GMA Tomcat server on MON nodes) to
fetch server certificate
- check uses command "globus-url-copy" and original Nagios plugin check_http
- period: 24 hours
- TODO: how to get certificate from MyProxy & LFC & other nodes?
Globus Gatekeeper
- performs user authentication & authorization check on remote Gatekeeper
- check uses command "globusrun -a"
- period: 15 min
Globus Gatekeeper hostname
- executes hostname command on remote host and verifies the output
- check uses command "globus-job-run"
- dependency: Globus Gatekeeper
- period: 1 hour
Globus GridFTP
- checks if there is a FTP server listening on GridFTP port
- check uses original Nagios check_ftp plugin
- period: 15 min
Globus GridFTP transfer
- transfers file to remote computer, back to local, checks the file
and removes the file on remote computer
- check uses commands "globus-url-copy" and "edg-gridftp-rm"
- dependency: Globus GridFTP
- period: 1 hour
BDII
- connects to BDII server and checks if specific base dn exists
- site BDII:
GlueClusterUniqueID=<hostname>,Mds-Vo-Name=<sitename>,O=Grid
- central BDII:
GlueSEUniqueID=lxn1183.cern.ch,Mds-Vo-Name=CERN-PROD,Mds-Vo-Name=local,O=Grid
- check uses original Nagios check_ldap plugin
- period: 15 min
- TODO: check if the services which are supposed to be running on
site are reported to site BDII (eqv. to GSTAT service check)
MDS
- connects to Globus GRIS and checks if specific base dn exists
- CE: GlueClusterUniqueID=<hostname>,Mds-Vo-Name=local,O=Grid
- SE: GlueSEUniqueID=<hostname>,Mds-Vo-Name=local,O=Grid
- RB: GlueServiceUniqueID=<hostname>:7772,Mds-Vo-Name=local,O=Grid
- check uses original Nagios check_ldap plugin
- period: 15 min
GridICE MDS
- connects to GridICE LDAP collector and checks if specific base dn exists
- check uses original Nagios check_ldap plugin
- period: 15 min
LCG Broker
- submits a simple test job "/bin/echo", waits for the job to finish, fetches and verifies the output
- if the job is not finished in 30 minutes status is set to UNKNOWN and the job is cancelled
- check uses commands: "edg-job-submit", "edg-job-status", "edg-job-get-output" and "edg-job-cancel"
- period: 1 hour
- TODO: add dependecies (e.g. Globus GridFTP, job-list-match)
- TODO: make eqv. check for WMS (replace grid-proxy-init with voms-proxy-init)
MyProxy
- creates proxy certificate on MyProxy server, gets the proxy info and destroys it
- check uses MyProxy commands: "myproxy-init", "myproxy-info" and "myproxy-destroy"
- period: 15 min
RGMA Tomcat
- connects to remote server (port 8443), authenticates and gets
/R-GMA web page
- check uses contrib Nagios plugin check_http-with-client-certificate
- period: 15 min
- TODO: create more advanced test with rgma-check-client script dependent and with longer period (similar to other tests)
SRMv1 ping
- performs ping of SRM service
- takes SRM port as argument, but also checks Globus MDS to fetch SRM URI - attribute GlueServiceURI
- check uses original commands "ldapsearch" for querying BDII and "glite-srm-ping" for pinging SRM
- period: 15 min
SRMv1 transfer
- transfers file to remote computer, back to local, checks the file
and removes the file on remote computer
- remote Globus MDS is used to fetch SRM URI and remote path (object GlueSAPolicy, attribute GlueSAPath)
- in case of MDS failure static configuration is used
- check uses commands "srmcp" and "srm-advisory-delete"
- dependency: SRMv1 ping
- period: 1 hour
- TODO: currently default srmcp protocol is used, should we test all possible protocols?
DPNS ping
- checks if anything is listening on port 5010
- check uses original Nagios plugin check_tcp
- period: 15 min
- TODO: where can I check if the site is running DPNS on different port? (GOCDB & BDII don't have this information)
DPNS
- executes dpns-ls /dpm and checks if the remote server's domain is
in the list
- check uses command "dpns-ls"
- dependency: DPNS ping
- period: 1 hour
VOMS proxy
- creates voms proxy for given VO
- check uses LCG command "voms-proxy-init"
- period: 15 min
VOMS Admin Tomcat
- connects to remote Tomcat server (port 8443) and authenticates
- check uses contrib Nagios plugin check_http-with-client-certificate
- period: 15 min
VOMS Admin gridmap
- created gridmap file for given VO and checks reported number of users
- check uses LCG command edg-mkgridmap.pl
- dependency: VOMS Admint Tomcat
- period: 1 hour
WMProxy delegation
- delegates proxy to WMProxy
- check uses commands: "glite-wms-job-delegate-proxy"
- period: 15 minutes
WMProxy
- this test is equivalent to EDG Broker test
- submits a simple test job "/bin/echo", waits for the job to finish, fetches and verifies the output
- if the job is not finished in 30 minutes status is set to UNKNOWN and the job is cancelled
- check uses commands: "glite-wms-job-submit", "glite-wms-job-status", "glite-wms-job-output" and "glite-wms-job-cancel"
- period: 1 hour
WMS
- this test is equivalent to EDG Broker test
- submits a simple test job "/bin/echo", waits for the job to finish, fetches and verifies the output
- if the job is not finished in 30 minutes status is set to UNKNOWN and the job is cancelled
- check uses commands: "glite-job-submit", "glite-job-status", "glite-job-output" and "glite-job-cancel"
- period: 1 hour
LFC
- executes lfc-ls /home and reports the list
- check uses command "lfc-ls"
- period: 15 min