Personal tools
You are here: Home Core Services Nagios Sensors description
Document Actions

Sensors description

by Marcin Radecki last modified 2007-02-06 15:40

Nagios service which monitors availability of Core Services in Central European region performs the following tests:


CA Distribution
    - fetches latest version information from www.eugridpma.org
    - uses GridFTP to get CA info files
    - compares version
    - if site uses previous version, first 7 days sends warnings and then criticals
    - compares version of all CAs with policy-igtf-classic.info
    - period: 24 hours
    - TODO: check if the distribution is published in LCG repository before sending alerts

Certificate lifetime
    - uses GridFTP (CE nodes & RBs) or HTTPS (R-GMA Tomcat server on MON nodes) to fetch server certificate
    - check uses command "globus-url-copy" and original Nagios plugin check_http
    - period: 24 hours
    - TODO: how to get certificate from MyProxy & LFC & other nodes?

Globus Gatekeeper
    - performs user authentication & authorization check on remote Gatekeeper
    - check uses command "globusrun -a"
    - period: 15 min

Globus Gatekeeper hostname
    - executes hostname command on remote host and verifies the output
    - check uses command "globus-job-run"
    - dependency: Globus Gatekeeper
    - period: 1 hour

Globus GridFTP
    - checks if there is a FTP server listening on GridFTP port
    - check uses original Nagios check_ftp plugin
    - period: 15 min

Globus GridFTP transfer
    - transfers file to remote computer, back to local, checks the file and removes the file on remote computer
    - check uses commands "globus-url-copy" and "edg-gridftp-rm"
    - dependency: Globus GridFTP
    - period: 1 hour

BDII
    - connects to BDII server and checks if specific base dn exists
    - site BDII: GlueClusterUniqueID=<hostname>,Mds-Vo-Name=<sitename>,O=Grid
    - central BDII: GlueSEUniqueID=lxn1183.cern.ch,Mds-Vo-Name=CERN-PROD,Mds-Vo-Name=local,O=Grid
    - check uses original Nagios check_ldap plugin
    - period: 15 min
    - TODO: check if the services which are supposed to be running on site are reported to site BDII (eqv. to GSTAT service check)

MDS
    - connects to Globus GRIS and checks if specific base dn exists
    - CE: GlueClusterUniqueID=<hostname>,Mds-Vo-Name=local,O=Grid
    - SE: GlueSEUniqueID=<hostname>,Mds-Vo-Name=local,O=Grid
    - RB: GlueServiceUniqueID=<hostname>:7772,Mds-Vo-Name=local,O=Grid
    - check uses original Nagios check_ldap plugin
    - period: 15 min

GridICE MDS
    - connects to GridICE LDAP collector and checks if specific base dn exists
    - check uses original Nagios check_ldap plugin
    - period: 15 min

LCG Broker
    - submits a simple test job "/bin/echo", waits for the job to finish, fetches and verifies the output
    - if the job is not finished in 30 minutes status is set to UNKNOWN and the job is cancelled
    - check uses commands: "edg-job-submit", "edg-job-status", "edg-job-get-output" and "edg-job-cancel"
    - period: 1 hour
    - TODO: add dependecies (e.g. Globus GridFTP, job-list-match)
    - TODO: make eqv. check for WMS (replace grid-proxy-init with voms-proxy-init)

MyProxy
    - creates proxy certificate on MyProxy server, gets the proxy info and destroys it
    - check uses MyProxy commands: "myproxy-init", "myproxy-info" and "myproxy-destroy"
    - period: 15 min

RGMA Tomcat
    - connects to remote server (port 8443), authenticates and gets /R-GMA web page
    - check uses contrib Nagios plugin check_http-with-client-certificate
    - period: 15 min
    - TODO: create more advanced test with rgma-check-client script dependent and with longer period (similar to other tests)

SRMv1 ping
    - performs ping of SRM service
    - takes SRM port as argument, but also checks Globus MDS to fetch SRM URI - attribute GlueServiceURI
    - check uses original commands "ldapsearch" for querying BDII and "glite-srm-ping" for pinging SRM
    - period: 15 min

SRMv1 transfer
    - transfers file to remote computer, back to local, checks the file and removes the file on remote computer
    - remote Globus MDS is used to fetch SRM URI and remote path (object GlueSAPolicy, attribute GlueSAPath)
    - in case of MDS failure static configuration is used
    - check uses commands "srmcp" and "srm-advisory-delete"
    - dependency: SRMv1 ping
    - period: 1 hour
    - TODO: currently default srmcp protocol is used, should we test all possible protocols?

DPNS ping
    - checks if anything is listening on port 5010
    - check uses original Nagios plugin check_tcp
    - period: 15 min
    - TODO: where can I check if the site is running DPNS on different port? (GOCDB & BDII don't have this information)

DPNS
    - executes dpns-ls /dpm and checks if the remote server's domain is in the list
    - check uses command "dpns-ls"
    - dependency: DPNS ping
    - period: 1 hour

VOMS proxy
    - creates voms proxy for given VO
    - check uses LCG command "voms-proxy-init"
    - period: 15 min

VOMS Admin Tomcat
    - connects to remote Tomcat server (port 8443) and authenticates
    - check uses contrib Nagios plugin check_http-with-client-certificate
    - period: 15 min

VOMS Admin gridmap
    - created gridmap file for given VO and checks reported number of users
    - check uses LCG command edg-mkgridmap.pl
    - dependency: VOMS Admint Tomcat
    - period: 1 hour

WMProxy delegation
    - delegates proxy to WMProxy
    - check uses commands: "glite-wms-job-delegate-proxy"
    - period: 15 minutes

WMProxy
   
- this test is equivalent to EDG Broker test
    - submits a simple test job "/bin/echo", waits for the job to finish, fetches and verifies the output
    - if the job is not finished in 30 minutes status is set to UNKNOWN and the job is cancelled
    - check uses commands: "glite-wms-job-submit", "glite-wms-job-status", "glite-wms-job-output" and "glite-wms-job-cancel"
    - period: 1 hour

WMS
   
- this test is equivalent to EDG Broker test
    - submits a simple test job "/bin/echo", waits for the job to finish, fetches and verifies the output
    - if the job is not finished in 30 minutes status is set to UNKNOWN and the job is cancelled
    - check uses commands: "glite-job-submit", "glite-job-status", "glite-job-output" and "glite-job-cancel"
    - period: 1 hour

LFC
    - executes lfc-ls /home and reports the list
    - check uses command "lfc-ls"
    - period: 15 min