SGE Cluster qsub email notifications not working

I am working on an SGE cluster and have some problems with the qsub email notification system. All my works are working fine, but I cannot change the default behavior to notify only when work is interrupted. The -M flag works correctly and I receive an email when a job is interrupted, however I would like to receive an email when a job starts, ends, interrupts, or pauses. I am using the following flags (and more) in my scripts, is there anything stupid that I am missing?

#!/bin/bash
#$ -S /bin/bash
#$ -M email@server
#$ -m beas

program

      

It also fails when I try the following:

qsub -M email@server -m baes script.sh

      

Is this a problem that I should be using with sys cluster admins, or did I do something wrong?

Thank you for your help.

+3


source to share


1 answer


The important thing to solve this problem is that your job status letter will be sent to the node where the job is running. For example, I have a test task with the following output:

#!/bin/bash
#
#$ -N MAIL
#$ -j y
#$ -m easb
#$ -M pkenyon

hostname

      

Now run the job and see where it works.

[pkenyon@head ~]$ qsub mail.sh
Your job 346 ("MAIL") has been submitted
[pkenyon@head ~]$ cat MAIL.o346
node03.cluster

      



If you look at the mail logs on the system, you will see delivery attempts made. You will have to diagnose from there. Here are some examples of failures (or even successes that didn't work out the way you want them to):

  • Sent to the compute node address using -M pkenyon

    ...
    Jun  5 13:56:00 node04 postfix/local[13141]: 14A3E143320: to=<pkenyon@node04.cluster>, orig_to=<pkenyon>, relay=local, delay=0.05, delays=0.03/0/0/0.01, dsn=2.0.0, status=sent (delivered to mailbox)
    ...
    
          

  • Node MX head is not configured correctly using -M pkenyon@head.cluster

    ...
    Jun  5 14:00:30 node04 postfix/smtp[13283]: 35CC4143320: to=<pkenyon@head.cluster>, relay=none, delay=0.36, delays=0.17/0/0.19/0, dsn=5.4.4, status=bounced (Host or domain name not found. Name service error for name=head.cluster type=AAAA: Host not found)
    ...
    
          

  • You need to configure your system to use a local mail relay when using -M someone@gmail.com

    ...
    Jun  5 12:20:47 node04 postfix/smtp[12798]: 1EEA5143320: to=<someone@gmail.com>, relay=ASPMX.L.GOOGLE.com[64.233.168.27]:25, delay=0.64, delays=0.04/0/0.59/0.02, dsn=5.0.0, status=bounced (host ASPMX.L.GOOGLE.com[64.233.168.27] said: 550 Relay not permitted (in reply to RCPT TO command))
    ...
    
          

So, yes, you need to speak with the cluster sysadmins, but these are the first steps to find out where your SGE messages are hanging. With a little more information, administrators can fix the configuration issue and help you get more out of your clustered environment.

+3


source







All Articles