I recently had a need to add some error checking to a bash script that runs multiple copies of a Perl script in parallel to better utilize a multi-core server. I wanted a way to run these four processes in the background and gather up their exit values. Then, if any of them failed, I'd prematurely exit the bash script and report the error.
After a bit of reading bash docs, I came across some built-ins that I hadn't previously used or even seen. First, I'll show you the code:
This is the bash script that runs the parallel processes and gathers up the exit values.
And here's the Perl script that I wrote in order to test the functioning of wait.sh. It accepts to arguments. The first is the number of seconds to sleep (to simulate the delay associated with doing work) and the second is the exit value it should use (any non-zero value indicates a failure).
New to me was the use of let to do math on a variable so that I can count up the number of failures. Is there a better way? There's no native ++ operator in bash. Similarly, using jobs to get a list of pids to wait on provided to be a very useful idiom.
The code is straightforward and works for my purposes. But since 99% of my time is spent in Perl rather than bash, I wonder what I could have done differently and/or better. Feedback welcome.
And, if this is at all useful to you, feel free to take it and run...
Finally, I'm starting to really dig gist.github for showing off bits of code. It's good stuff.
Posted by jzawodn at November 21, 2008 07:21 AM
Yes, the 'let' built-in is the best way to do this in bash. It would be kind of silly to spawn another process just to do i++.
If you're going to use quotes around the let expression you might as well use whitespace around the operator :)
You don't need to use "jobs" to get the pids, since you can remember $! after each spawn. In this way you won't inadvertently end up waiting on any other background job you may have going at the time too (such as updatedb or the like) because you will only collect the relevant pids
./sleeper 2 0 &
./sleeper 2 1 &
./sleeper 3 0 &
./sleeper 2 0 &
for job in $pidlist
I have used that idiom successfully for some years now.
At first I don't know if the fail count you use would work as expected - by that I mean that if any job ends successfully while you are in a wait state on another job, wait may not find that pid when it comes to look for it and you wouldn't get its successful exit code.
I tested this on openbsd and found that despite a job having finished, wait will continue report its exit code within the same session.
false & # will return 1
#more stuff happens in the interim
echo $? # returns 1
Pfft, bash! Using any type of shell glue around Perl scripts is a huge mistake: Sooner or later you're gonna run out of options because you can't implement arbitrarily complex programs with a shell's limited capabilities.
Use POE::Wheel::Run to spawn your processes and run your jobs in POE in Perl.
The other shell operator you might want to play with is arithmetic expansion:
echo "i: $i"
Also, 'help let' shows that there *is* increment/decrement in bash (GNU bash, version 3.2.39(1)-release (x86_64-pc-linux-gnu)):
The levels are listed in order of decreasing precedence.
id++, id-- variable post-increment, post-decrement
++id, --id variable pre-increment, pre-decrement
-, + unary minus, plus
The suggestion to stick to Perl is probably well taken, however ;-)
The code snippets from gist look bad in Google Reader. All the lines are on a single line without any formatting at all.
Don't know if that's important to you or not but I thought I'd let you know. On the upside, it did force me to come to your site to look at them.
Weird. I'll check out the reader rendering. Thanks for the heads-up.
It's pretty cool that you're getting your hands this dirty in public. Tinkering is what gets our juices flowing, eh?
The arguments for sticking with Perl notwithstanding, it's nice to keep your bash teeth sharp, because it has such a ridiculously wide install base. Even on the most minimal or out-of-date *nix setup, you'll know you have at least that tool at your disposal.
Once the 'complex need' arrives, it's never too hard to switch over to a more powerful language.
You say you prematurely left the script, but it looks like you wait for all of the subprocesses to finish/fail before you end (presumably skipping some post-processing or other, so in that sense you are ending early).
I'm guessing a premature exit as soon as one of the subprocesses failed would involve a polling loop.
I know this post is old, but I just stumbled onto it via google and couldn't help but comment.
As the last commenter said, your script shouldn't do what you want Jeremy. I'm wondering if I'm missing something here about your goal, because your example doesn't make sense to me at all if you want wait.sh to exit prematurely on any fail of any subprocess. If you were to change the numbers for wait.sh to be something like
./sleeper 9 0 &
./sleeper 2 1 &
./sleeper 3 0 &
./sleeper 19 0 &
It would be crystal-clear that your launcher-script isn't exiting when subprocesses die at all. The for loop will be paused on the first iteration (on the first wait command) and won't continue until the first sleeper quits. Then it will get stuck on the last sleeper as well. ....
Here's what I whipped up to do the same kind of thing (used bash subshells to keep it simpler):
( sleep 15; exit )&
( sleep 30; exit )&
( sleep 5; exit )& echo "$! is the 5sec process"
( sleep 30; exit )&
#(while true; do sleep 1d; done)&
echo pids spawned: $pids
while [[ `jobs -p` == $pids ]] ; do
echo killing pids: `jobs -p`
kill `jobs -p`
So. All works well. Five seconds after starting my "launcher" script, the script exits, killing the remaining BG jobs. However, there's something odd here that I don't understand and am curious to get any insight on.
In my script, if you change the fourth subshell to oh.. let's say, "sleep 2" instead of "sleep 30" and then run the script again, it will not exit until after FIVE seconds have passed. What you will see is that using my method, if the LAST bg process dies, it will basically be ignored. I don't quite understand why [ `jobs -p` = $pids ] is still matching, even though the output of `jobs -p` is missing trailing data....
I decided to avoid that trap by simply throwing in one more unnecessary background command that absolutely will not fail. Uncomment that while true loop line I've got up there, and you'll see all will work as expected.
GNU Parallel http://www.gnu.org/software/parallel/ is a general tool for running shell scripts in parallel.
The original example can be rewritten as:
(echo 2 0; echo 2 1; echo 3 0; echo 2 0) | parallel ./sleeper
if [ "$?" == "0" ];
echo "FAIL! ($?)"