Category Archives: bash

running parallel bash tasks on OS X

How often did you needed to process huge amounts of small files, where a single task uses only a small amount of cpu and memory?
However, today I need a script which does exactly this.

I have a mysql table which contains the filenames located on my hard drive.
Now I created a little script which processes a single file in under 3 seconds. Unfortunately for 10.000+ files this would take more than 8 hours.

So what if I could run them in parallel with a maximum of 10 parallel task’s being executed? This would really speed up the computation!

Luckily in 2005 Ole Tange from GNU merged the command line tools xxargs and parallel into the a single tool ‘parallel‘.
With this great tool there is no need to write any complicated script to accomplish such tasks.
First you need to install it using homebrew.

brew install parallel

After that i had to add the path to my .profile

PATH=$PATH:/usr/local/Cellar/parallel/20110822/bin

Here’s the basic usage:

 $> echo -ne "1\n2\n3\n" | parallel -j2 "echo the number is {.}"

This would echo the numbers 1, 2, 3 to the stdout with a maximum of 2 parallel running echo’s.
Here’s the output:

the number is 1
the number is 3
the number is 2

As you can see printing a 3 outspeeds printing a 2 ;)

So here is my 1 liner to process all my files:

 $> mysql -uroot -p[secretPW] my_database < \ 
    <(echo "SELECT filename FROM files")\ 
    | grep -v 'filename' | parallel -j10 "./processFile.sh {.}"

After using this it took only 37min to process my 10000+ files :)

Share/Save

iTunes Sharing over ssh

Today I realized that I had not a single song on my notebook hard disk. Thanks to Last.FM :)
Unfortunately “Simplify Media” has been adopted by Google Inc. and they do not offer a similar service yet. So I needed a solution to stream my iTunes from home to the office. If found a great solution by “Robert Harder” which works like charm. (source).
This is the bash script:

#!/bin/sh
dns-sd -P "Home iTunes" _daap._tcp local 3689 localhost.local. \ 
127.0.0.1 "Arbitrary text record" &amp;
PID=$!
ssh -C -N -L 3689:localhost:3689 username@dyndns_name.dyndns.org
kill $PID

test proxy speed with bash and wget

I need to test the speed of some proxy server. Here’s a little script on how I achieved this.
I have a text-file ‘proxy.list’ which looks like: (I took out the last two digits of the ip).

...[lot's of ip's]...
193.196.*.*:3124       Germany
143.205.*.*:3127      Austria
64.161.*.*:3128         United States
.....

Here is the script which will run through the list of all proxy and will download 5 test pages from a specific site. Then it will determine the period of time which is needed for execution. It’s create/append to a file ‘time.list’ which will contain the needed Information to determine the best proxies. You also need to create a subdirectory called ‘raw_proxy’ where the raw html code is save that you retrieve from the proxies. The files are named ”raw_proxy/$ip.$port.$i.tmp’ where $i is the i.th test page you downloaded. I need to keep those files to determine if the proxy send me the right file or e.g. a login page .

#!/bin/bash
size=$(cat proxy.list | wc -l)
while read proxy
do
    #determine the first parameter (IP:Port)
    ad=$(echo $proxy | awk '{print $1}')
    ip=${ad%:*}   #extract ip
    port=${ad#*:} #extract port
    #set and export the proxy settings
    http_proxy=$ip:$port && HTTP_PROXY=$http_proxy && export http_proxy HTTP_PROXY
    #save start timestamp
    start=$(date +%s)
    #download 5 pages #(yes I know 'seq' but I'm on Mac and I needed a sth. quick&dirty)
    for i in $(echo "1 2 3 4 5")
    do
        #use wget to retrive the page. We want to try 1 time and set some specific timeouts. + we force to use a Mozilla User agent to hide that we are using wget.
    	wget -O "raw_proxy/$ip.$port.$i.tmp" --tries=1 --dns-timeout=10 --connect-timeout=8 --read-timeout=15 -U "Mozilla/5.0 (compatible; Konqueror/3.2; Linux)" "http://www.yourTestPage.com/$i.txt" &> /dev/null
    done
    #save end timestamp
    end=$(date +%s)
    #calculate the difference
    diff=$(( end - start ))
    #append this info to time.list
    echo -e "$ip:$port\t$diff" >> time.list
    #to have a nice and shiny output I use figlet, this is optional, if you don't want it comment out next 3 lines or just remove ' | figlet'
    clear
    echo "PC: #"$size" - "$diff"s" | figlet
    sleep 1
    size=$(( size-1 ))
done < proxy.list

If you used figlet your output looks like this:

 ____   ____       _  _    __  _____ _           ____   ___      
|  _ \ / ___|_   _| || |_ / /_|___ // |         |___ \ / _ \ ___ 
| |_) | |   (_) |_  ..  _| '_ \ |_ \| |  _____    __) | | | / __|
|  __/| |___ _  |_      _| (_) |__) | | |_____|  / __/| |_| \__ \
|_|    \____(_)   |_||_|  \___/____/|_|         |_____|\___/|___/

It shows how much proxies need to be checked and shows the last execution time.

After the script has finished you need to get a list of which proxy was best.
This is the command line which evaluates everything and gives me back a list of ip’s sorted by access time. It also removes all proxies where the downloaded page had a size of 0B.

#command line to list proxy with lowest time to download
clear && while read line; do ip=${line% *};time=$(echo $line | awk '{print $2}');ip=${ip%:*};echo -e $ip"\t"$time"\t"$(ls -alshr raw_proxy/ | grep 1.tmp | grep $ip | awk '{print $6}'); done < <(tr -s ' ' < time.list | sort -n -r -k2 | cut -d' ' -f3) | grep -v "0B"

This is the output:

201.22.*.*	43	52K
196.213.*.*	43	13K
....
147.102.*.*	1	2,1K
132.227.*.*	1	2,1K
....
130.75.*.*	1	52K

If you know the filesize the you can append a

 | grep "52K"

to the last command to show only files which have the right size.
This is it ;)

I know that out there are better and fast implementations to do this but …
but it was fun

awk and word frequencys

I have a list that I use for making gnuplots. The structure is the following

wordId wordWeight

e.g. 101 34.342 = Word with id 101 has a calculated weight of 34.342 times.

The bad thing is that this list is unordered. Now I wan’t to order this list getting the greatest weight and their corresponding word id. Bash doens’t seem to be the best solution for this so I made up my first awk script.

Here is it, it prints the most common word from file toPlot.stats.

awk ‘ BEGIN{ max=0; w=-1 } { if ( $2 >= max) { max=$2; w=$1 } } END{ print “id”,w,”has  most often with count of”, max; }’ toPlot.stats

and the output:

id 1545 has greates weight with  28199.40186090438

My file has 1.722.913 lines, execution time:

real    0m1.618s
user    0m1.576s
sys     0m0.032s

obfuscated statistic script

if you read “sql, statistics, bash and some gnulplot” here is a obfuscated looking version of this script  ;)

cB()
{ s=$2; a=0; while read L; do x=$(awk '{print $1}' \
<(echo $L)); [ $x -gt $s ] 2>/dev/null && [ $x -le \
$(( $s+$3 )) ] 2>/dev/null && a=$(( $a+1 )); [ $x -\
gt $(( $s + $(( 2*$3 )) )) ] 2>/dev/null && t=$(( $\
s+(2*$3) )) && diff=$(( $x-$t ))&& m=$(( 1+($diff/$\
3) )) && echo -e $s"\t"$a && a=1 && s=$(( $s+$3 )) \
&& for i in $(seq 1 $m); do [ $x -gt $s ] 2>/dev/nu\
ll && [ $x -le $(( $s+$3 )) ] 2>/dev/null && a=$(( \
$a+1 )); echo -e $s"\t0" && s=$(( $s+$3 )) && a=1; \
done; [ $x -gt $(( $s+$3 )) ] 2>/dev/null && echo -\
e $s"\t"$a && a=0 && s=$(( $s+$3 )) && [ $x -gt $s \
] 2>/dev/null && [ $x -le $(( $s+$3 )) ] && a=$(( $\
a+1 )); done < $1; echo -e $s"\t"$a; }

bash screen auto reattach

automatically search for an attached screen session at login and reattach

cb0@home:~/$cat >> ~/.bashrc << EOF
if [ \$SSH_TTY ] && [ ! \$WINDOW ]; then #comment out for local usage
  SCREENLIST=\`screen -ls | grep 'Att'\`
  if (( ! \$? )); then
    echo -e "Screen running and attached:\n \${SCREENLIST}"
  else
    screen -U -R
  fi
fi #comment out for local usage
EOF

[:edit:]

The script above only works when you connect through ssh. If you login from a real terminal $SSH_TTY and $WINDOW won’t be set.
Comment out lines marked for local usage if like screen as much as i do.