Oct 212008
 

The Spec

The Perl system() function is very powerful and elegant. It forks a child process, suspends the parent process, executes the provided command (potentially calling your shell to help parse the arguments), while seamlessly redirecting all IO between the parent and child process! The usage is simple, but what happens behind the scenes is amazing!

Unfortunately, there is no way to interrupt the system() function in Perl. Sure, you can kill the main Perl program, but that’s not what I want. I want to call system() with 2 additional arguments: timeout and maxattempts. This subroutine would operate just like the traditional system() function, unless operation time exceeded the timeout value, in which case, the command would be killed and restarted, until the maximum number of attempts was exceeded.

A Dead End

You can find many resources that detail how to timeout a long Perl operation, like so:

eval {
    local $SIG{ALRM} = sub { die "alarm clock restart" };
    alarm 10;
    flock(FH, 2);   # blocking write lock
    alarm 0;
};
if ($@ and $@ !~ /alarm clock restart/) { die }

Unfortunately, there is a little footnote that says you should not try this with system calls; otherwise, you get zombies. Sure enough, if you substitute a system() function for the above flock, the parent Perl script is alarmed by the timeout and exits the eval. Normally, this would kill the flock or any other function. But the system function persits. The parent may even complete the remainder of its program and exit, but the child will keep on ticking – not what I wanted. The second problem is that there is no way to get, or access the process id of the command executed by the system() function; therefore, there is no way to kill a system function call by the parent Perl process – at least, no way that I have found.

The above link suggests using fork and exec to create your own function, which is ultimately what I did. So, let’s jump straight to the chase scene, shall we? Here’s my final solution.

Preferred Solution

#!/usr/bin/perl -w
use strict 'refs';
use POSIX "sys_wait_h";

sub timedSystemCall {

  my ($cmd, $timeout, $maxattempts, $attempt, $origmax) = @_;

  # degenerate into system() call - infinite timeout, if timeout is undefined or negative
  $timeout = 0 unless defined($timeout) && ($timeout > 0);
  # degenerate into system() call - 1 attempt, if max attempts is undefined or negative
  $maxattempts = 1 unless defined($maxattempts) && ($maxattempts > 0);
  $attempt = 1 unless defined($attempt) && ($attempt > 0);
  $origmax = $maxattempts unless defined $origmax;

  local ($rc, $pid);

  eval {
    local $SIG{ALRM} = sub { die "TIMEOUT" };

    # Fork child, system process
  FORK: {
      if ($pid = fork) {
        # parent picks up here, with $pid = process id of child; however...
        # NO-OP - Parent does nothing in this case, except avoid branches below...
      } elsif (defined $pid) {  # $pid is zero if here defined
        # child process picks up here - parent process available with getppid()
        # execute provided command, or die if fails
        exec($cmd) || die("(E) timedSystemCall: Couldn't run $cmd: $!\n");
        # child never progresses past here, because of (exec-or-die) combo
      } elsif ($! =~ /No more processes/) {
        # Still in parent:  EAGAIN, supposedly recoverable fork error
        print STDERR "(W) timedSystemCall: fork failed.  Retrying in 5-seconds. ($!)\n";
        sleep 5;
        redo FORK;
      } else {
        # unknown fork error
        die "(E) timedSystemCall:  Cannot fork: $!\n";
      }
    }

    # set alarm to go off in "timeout" seconds, and call $SIG{ALRM} at that time
    alarm($timeout);
    # hang (block) until program is finished
    waitpid($pid, 0);

    # program is finished - disable alarm
    alarm(0);
    # grab output of waitpid
    $rc = $?;
  };                            # end of eval

  # Did eval exit from an alarm timeout?
  if (($@ =~ "^TIMEOUT") || !defined($rc)) {
    # Yes - kill process
    kill(KILL => $pid) || die "Unable to kill $pid - $!";
    # Collect child's remains
    ($ret = waitpid($pid,0)) || die "Unable to reap child $pid (ret=$ret) - $!";
    # grab exit output of child process
    if ($rc = $?) {
      # exit code is lower byte: shift out exit code, leave top byte in $rc
      my $exit_value = $rc >> 8;
      # killing signal is lower 7-bits of top byte, which was shifted to lower byte
      my $signal_num = $rc & 127;
      # core-dump flag is top bit
      my $dumped_core = $rc & 128;
      # Notify grandparent of obituary
      print STDERR "(I) timedSystemCall:  Child $pid obituary: exit=$exit_value, kill_signal=$signal_num, dumped_core=$dumped_core\n";
    }
    # Can we try again?
    if ($maxattempts > 1) {
      # Yes! Increment counter, for print messages
      $attempt++;
      print STDERR "(W) timedSystemCall:  Command timed-out after $timeout seconds.  Restarting ($attempt of $origmax)...\n";
      # Recurse into self, while decrementing number of attempts. Return value from deepest recursion
      return timedSystemCall($cmd, $timeout, $maxattempts-1, $attempt, $origmax);
    } else {
      # No!  Out of attempts...
      print STDERR "(E) timedSystemCall:  Exhausted maximum attempts ($origmax) for command: $cmd\nExiting!\n";
      # Return error code of killed process - will require interpretation by parent
      return $rc;
    }
  } else {
    # No - process completed successfully!  Hooray!!!  Return success code (should be zero).
    return $rc;
  }
}

exit timedSystemCall("inf.pl", 5, 3);

The reason this solution is preferred is because it does not consume CPU while waiting for the child to complete or timeout. Furthermore, it’s the simplest and most elegant solution I have found.

This solution works because the child inherits the exact same environment as the parent, including its standard IO handles (STDOUT, STDIN, STDERR), just as does the command issued by the system() function. Therefore, when the child prints to its STDOUT, it is printing directly to the parent’s STDOUT. And, when the child requests input from its STDIN, it is querying its parent’s STDIN. Therefore, we are not required to perform any fancy polling to copy the child’s output to the parent’s output, or otherwise shuttle communication between the child and the parent’s environment. Moreover, if the parent is killed for some reason, our child process is also killed, so we don’t have to worry about zombies – as much.

The hints for this solution came from an example on pg. 167 of O’Reilly’s Programming Perl, under the fork function description, and from pg. 554-555 of O’Reilly’s Perl Cookbook, under the discussion, “16.1. Gathering Output from a Program”.

Unfortunately, this was not the first solution I created. If you are interested, a few other solutions I found are provided following a few usage examples. Both of these other solutions mostly work; however, they have drawbacks, when compared to the above, preferred solution.

Usage Examples

If the above script is called using the infinite output script as a child, you get output like so:

perl: timedSystemCall("inf.pl", 5, 3);
1
2
3
4
5
(I) timedSystemCall:  Child 14672 obituary: exit=0, kill_signal=9, dumped_core=0
(W) timedSystemCall:  Command timed-out after 5 seconds.  Restarting (2 of 3)...
1
2
3
4
5
(I) timedSystemCall:  Child 14683 obituary: exit=0, kill_signal=9, dumped_core=0
(W) timedSystemCall:  Command timed-out after 5 seconds.  Restarting (3 of 3)...
1
2
3
4
5
(I) timedSystemCall:  Child 14685 obituary: exit=0, kill_signal=9, dumped_core=0
(E) timedSystemCall:  Exhausted maximum attempts (3) for command: inf.pl
$ echo $?
9

This particular child produces output every second for infinity, except it is limited by our new function for a 5-second timeout with a maximum of 3 attempts. The function politely reports all restarts on standard error, so not to comingle with the standard output.

If the system() call does not exceed the timeout, or if the last two arguments are omitted, then the perl script ends as would be expected of a normal system() call, like so:

# complete in 3 seconds - before 5 sec timeout
perl: timedSystemCall("inf.pl 3", 5, 3);
1
2
3
$ echo $?
# degenerate into system() behavior
perl: timedSystemCall("inf.pl");
0
1
2
3
4
5
6
7
8
9
^C
Captured SIGINT.  Exiting after 9 seconds.
$ echo $?
130

Hopefully, you will find this function useful. If you are intested in better understanding a few alternatives, although lesser they may be, then read on! Otherwise, enjoy this new function!

Solution #1

My first solution hinges around the open3 function, which launches the input command and returns the essential process id, so we can kill it, if it runs too long. Output is synchronized by polling non-blocking versions of the child’s output handles, and dumping them to the parent’s output. This waiting loop is CPU bound, so it consumes 100% of one CPU, trying to keep the outputs synchronized – bad! Furthermore, the child’s input is not synchronized – very bad!

use IPC::Open3;
use Fcntl;
use POSIX "sys_wait_h";

sub timedSystemCall {

  local ($cmd, $timeout, $maxattempts, $retry, $origmax) = @_;

  # degenerate into system() call - infinite timeout, if timeout is undefined or negative
  $timeout = 0 unless defined($timeout) && ($timeout > 0);
  # degenerate into system() call - 1 attempt, if max attempts is undefined or negative
  $maxattempts = 1 unless defined($maxattempts) && ($maxattempts > 0);
  $attempt = 1 unless defined($attempt) && ($attempt > 0);
  $origmax = $maxattempts unless defined $origmax;

  local ($rc, $pid);

  eval {
    local $SIG{ALRM} = sub { die "TIMEOUT" };

    $pid = open3(\*WTR, \*RDR, \*ERR, $cmd) || die("(E) timedSystemCall: Unable to launch command - $cmd\n$!\n");
    # Make reads from RDR to be non-blocking
    my $rflags = 0;
    fcntl(RDR, F_GETFL, $rflags) || die $!;
    $rflags |= O_NONBLOCK;
    fcntl(RDR, F_SETFL, $rflags) || die $!;
    # Make reads from RDR to be non-blocking
    my $eflags = 0;
    fcntl(ERR, F_GETFL, $eflags) || die $!;
    $eflags |= O_NONBLOCK;
    fcntl(ERR, F_SETFL, $eflags) || die $!;
    #$pid = open3(">&STDIN", "<&STDOUT", "<&STDERR", $cmd) || die("(E) timedSystemCall: Unable to launch command - $cmd\n$!\n");

    alarm($timeout);

    # Is program finished?
    until (waitpid($pid, WNOHANG)) {
      # No!
      # NONBLOCKING: Did the program produce any output (STDOUT)?
      while () {
        # Yes - dump output to this program's STDOUT
        print STDOUT;
      }
      #NONBLOCKING: Did the program produce any errors (STDERR)?
      while () {
        # Yes - dump errors to this program's STDERR
        print STDERR;
      }
    } # exit until
    # program is finished - disable alarm
    alarm(0);
    # grab output of waitpid, and separate bytes
    $rc = $?;
    # close associated IO handles
    close(WTR);
    close(RDR);
    close(ERR);
  }; # end of eval

  # Did eval exit from an alarm timeout?
  if (($@ =~ "^TIMEOUT") || !defined($rc)) {
    # Yes - kill process
    kill(KILL => $pid) || die "Unable to kill $pid - $!";
    # Collect child's remains
    ($ret = waitpid($pid,0)) || die "Unable to reap child $pid (ret=$ret) - $!";
    # grab exit output of child process
    if ($rc = $?) {
      # exit code is lower byte: shift out exit code, leave top byte in $rc
      my $exit_value = $rc >> 8;
      # killing signal is lower 7-bits of top byte, which was shifted to lower byte
      my $signal_num = $rc & 127;
      # core-dump flag is top bit
      my $dumped_core = $rc & 128;
      # Notify grandparent of obituary
      print "(I) timedSystemCall:  Child $pid obituary: exit=$exit_value, kill_signal=$signal_num, dumped_core=$dumped_core\n";
    }
    # Can we try again?
    if ($maxattempts > 1) {
      # Yes! Increment counter, for print messages
      $retry++;
      print "(W) timedSystemCall:  Command timed-out after $timeout seconds.  Restarting ($retry of $origmax)...\n";
      # Recurse into self, while decrementing number of attempts. Return value from deepest recursion
      return timedSystemCall($cmd, $timeout, $maxattempts-1, $retry, $origmax);
    } else {
      # No!  Out of attempts...
      print "(E) timedSystemCall:  Exhausted maximum attempts ($origmax) for command: $cmd\nExiting!\n";
      # Return error code of killed process - will require interpretation by parent
      return $rc;
    }
  } else {
    # No - process completed successfully!  Hooray!!!  Return success code (should be zero).
    return $rc;
  }
}

The intense CPU utilization and lack of STDIN synchronization makes this solution undersirable and arguably a failure. It worked in my particular application, but it may not work in others. Between this issue and the unnecessary CPU utilization, this solution is an academic curiosity, but nothing more.

Solution #2

The second solution is similar to the first, because it depends on the open3 function. However, it directly connects the child’s IO handles to the parents, so that it behaves more like the final, preferred soltuion.

#!/usr/bin/perl -w
use strict 'refs';

use FileHandle;
use IPC::Open3;
use POSIX "sys_wait_h";

sub timedSystemCall {

  local ($cmd, $timeout, $maxattempts, $retry, $origmax) = @_;

  # degenerate into system() call - infinite timeout, if timeout is undefined or negative
  $timeout = 0 unless defined($timeout) && ($timeout > 0);
  # degenerate into system() call - 1 attempt, if max attempts is undefined or negative
  $maxattempts = 1 unless defined($maxattempts) && ($maxattempts > 0);
  $attempt = 1 unless defined($attempt) && ($attempt > 0);
  $origmax = $maxattempts unless defined $origmax;

  local ($rc, $pid, *DUPOUT, *DUPERR, *DUPIN);

  eval {
    local $SIG{ALRM} = sub { die "TIMEOUT" };

    # duplicate stdandard IO handles
    open DUPOUT, ">&STDOUT";
    open DUPERR, ">&STDERR";
    open DUPIN,  "<&STDIN";
    # launch child command, attached directly to standard IO handles
    $pid = open3("<&STDIN", ">&STDOUT", ">&STDERR", $cmd) || die("(E) timedSystemCall: Unable to launch command - $cmd\n$!\n");
    # select primary output, and then disable buffering (activate auto-flush)
    select STDERR; $| = 1;
    select STDOUT; $| = 1;

    # set alarm to go off in "timeout" seconds, and call $SIG{ALRM} at that time
    alarm($timeout);
    # hang (block) until program is finished
    waitpid($pid, 0);

    # program is finished - disable alarm
    alarm(0);
    # grab output of waitpid
    $rc = $?;
    # close child's associated IO handles
    close(STDOUT);
    close(STDERR);
    close(STDIN);
    # restore orig handles
    open STDOUT, ">&DUPOUT";
    open STDERR, ">&DUPERR";
    open STDIN, "<&DUPIN";
  }; # end of eval

  # Did eval exit from an alarm timeout?
  if (($@ =~ "^TIMEOUT") || !defined($rc)) {
    # Yes - kill process
    kill(KILL => $pid) || die "Unable to kill $pid - $!";
    # Collect child's remains
    ($ret = waitpid($pid,0)) || die "Unable to reap child $pid (ret=$ret) - $!";
    # close child's associated IO handles
    close(STDOUT);
    close(STDERR);
    close(STDIN);
    # restore orig handles
    open STDOUT, ">&DUPOUT";
    open STDERR, ">&DUPERR";
    open STDIN, "<&DUPIN";
    # grab exit output of child process
    if ($rc = $?) {
      # exit code is lower byte: shift out exit code, leave top byte in $rc
      my $exit_value = $rc >> 8;
      # killing signal is lower 7-bits of top byte, which was shifted to lower byte
      my $signal_num = $rc & 127;
      # core-dump flag is top bit
      my $dumped_core = $rc & 128;
      # Notify grandparent of obituary
      print "(I) timedSystemCall:  Child $pid obituary: exit=$exit_value, kill_signal=$signal_num, dumped_core=$dumped_core\n";
    }
    # Can we try again?
    if ($maxattempts > 1) {
      # Yes! Increment counter, for print messages
      $retry++;
      print "(W) timedSystemCall:  Command timed-out after $timeout seconds.  Restarting ($retry of $origmax)...\n";
      # Recurse into self, while decrementing number of attempts. Return value from deepest recursion
      return timedSystemCall($cmd, $timeout, $maxattempts-1, $retry, $origmax);
    } else {
      # No!  Out of attempts...
      print "(E) timedSystemCall:  Exhausted maximum attempts ($origmax) for command: $cmd\nExiting!\n";
      # Return error code of killed process - will require interpretation by parent
      return $rc;
    }
  } else {
    # No - process completed successfully!  Hooray!!!  Return success code (should be zero).
    return $rc;
  }
}

exit timedSystemCall("inf.pl", 5, 3);

The advantage of this solution is that the standard IO handles (STDOUT, STDERR, STDIN) are directly connected to the child. The parent does not have to poll the child with non-blocking reads, and dump that output to the parent’s IO. So, this solution is somewhat simpler. Plus, it does not consume excess CPU while in the tight polling loop.

The other interesting thing about this solution is that the standard IO handles must be duplicated, or saved before they are fed to the child process. Otherwise, when the child process is killed, the standard IO handles will be automatically closed. Any restarted children will not be able to duplicate them, so the open3 command fails. But, what is worse, the parent cannot communicate to the outside world to communicate the cause of the error. This brings sudden and silent death to the parent. However, if the standard IO handles are first saved, then they can be restored after the child is killed, hence the duplicate (“DUP”) IO handles. The above solution is a good example of this technique.

The hint for this technique came from pg. 193, of O’Reilly’s, Programming Perl, under the open function explanation. Another hint came from the middle of pg. 568 of O’Reilly’s, Perl Cookbook, under discussion on “16.8. Controlling Input and Output of Another Program“.

Further Thoughts

Here are a few links for further reading on the topic of busting out of a long Perl operation:

http://www.mail-archive.com/beginners@perl.org/msg81677.html
http://coding.derkeiler.com/Archive/Perl/comp.lang.perl.misc/2003-10/0422.html

Could there be a better way? Could you improve on my final solution? I am both an optomist and an optomizerist (sic), so if you can improve my solution, let me know! 🙂

Share