sql, statistics, bash and some gnulplot

Did you ever face the challenge of making gnuplots from huge amounts of statistical data that comes from sql ?

No ? Ok then you can stop reading right here or you will face the challenge of learning sth. new.
here’s a shot overview of the my problem:

  • there is a table called ‘sources’ in which I save a unique id for a source.
    (a source is a ‘file’ and it’s location)
  • another table ‘instructions’ which saves a unique id for an instruction
    (think of an instruction as a single word or an sequence of words)
  • and finally another table ‘stats’ which contains a crosslink between ‘sources’ and ‘instructions’
    This means which instruction occures in which source and how often dow it occure in it.my question is:
  • How many sources have more than ‘X’ but less or equal than ‘Y’ instructions ?

Hmm sound’s easy, right. Just a single sql statement and we get

mysql> select count(*) from (select sum(stats.instr_count) as is, stats.so_id
                            from sources inner join stats on
                            (sources.so_id=stats.so_id)
                            group by stats.so_id order by ss desc)
       as p where p.ss > 0 and p.ss <= 100;
+----------+
| count(*) |
+----------+
|      369 |
+----------+
1 row in set (0.00 sec)

Good, so we have 369 Documents that contain between 0 and 100 instructions.
Ok, I now wanted to take this query for the rest of my analyse. I was going to make a gnulot that will show
x = size of my sources
y = count of sources

Now I just have to rerun the query from above and replace 0 and 100 with my desired values. My biggest source i knwo contains about 1.600.000 instructions. So I came up with this:

for i in $(seq 0 16000)
do
start=$(( $i * 100 ))
end=$(( $start + 100 ))
echo $start
echo $start" "$(mysql -ucb0 -p$(echo "~/.cb0.pass") testDB < <(echo "select count(*)
     from (select sum(stats.instr_count) as ss, stats.so_id from sources
     inner join stats on (sources.so_id=stats.so_id) group by stats.so_id
     order by ss desc) as p where p.ss >= $start and p.ss < $end;"))
     >> instructionSize.stats
done

This will call sql with the 16.000 different ranges that are possible for this kind of analyse. Ok now just wait and then start gnulot ....

It would be a fine thing BUT one mysql call need 30 seconds. 16.000*30s=480000s=8000min=133h=5,555555555 days
But I don't want to wait 5,5 days for my finsihed stats.

So maybe rebuild the statment so it won't use nested selects or joins?¿  No chance, after a few tries I thought there must be another way.
The first thing that came up my mind was a tiny little bash script that does the following for me:

It uses a tab seperated input file that comes from sql in the format 'int\tint'. Two integer values seperated by a tabulator. The first value is the count of instructions and the second one the source id. I simply get this kind of list by using:

cb0@home:~/mysql -ucb0 -p$(echo "~/.cb0.pass") testDB < <(echo "
            select sum(stats.instr_count) as ss, stats.doc_id from sources
            inner join stats on (sources.so_id=stats.so_id) group by
            stats.doc_id order by ss asc
         ") > sources.stats;

Now I need a bash script which:

  • counts the amount of sources within a specific range
  • can be parameterited to use specific range width (e.g. count between 0..50..100..150....., or 0..100..200..300......)
  • prints out a file that can be processed by gnuplot

Example: Process InputFile, start at 0 and use 10 as range width

InputFile: cb0@home:~/cat test
3 2
10 3
23 1

Outputfile:

0 2
10 0
20 1

Note that even if there is no source with more than 10 butt less equal than 20 instructions there need to be a value 10 in the output file.

After nearly an hour of work I came up with this bash function: (please excuse me the too less commented code)

countBetween(){
    #$1 = file with stats, #$2 start counting at,
    #$3 step count means end = $start + $3
    start=$2; akCount=0
    while read LINE; do
    #whats in the first column
    x=$(awk '{print $1}' <(echo $LINE))
    #increment the range counter if range is ok
    [ $x -gt $start ] 2>/dev/null && [ $x -le $(( $start + $3 )) ] 2>/dev/null &&
    akCount=$(( $akCount + 1 ))
    #check if the value is bigger than the complete range. if true the echo the
    #missing steps between e.g. value = 230, akRange=0 -> means we look at values
    #between 0 and 100 but there seem to be no values so create the entries "0 0"
    #and "100 0" because in these ranges there are no entries
    [ $x -gt $(( $start + $(( 2* $3 )) )) ] 2>/dev/null &&
    tmp=$(( $start + $(( 2 * $3 )) )) && diff=$(( $x - $tmp ))&&
    mult=$(( 1 + ( $diff / $3 ) )) && echo -e $start"\t"$akCount && akCount=1 &&
    start=$(( $start + $3 )) && for i in $(seq 1 $mult); do [ $x -gt $start ]
    2>/dev/null && [ $x -le $(( $start + $3 )) ] 2>/dev/null && akCount=$((
    $akCount + 1 )); echo -e $start"\t0" && start=$(( $start + $3 )) &&
    akCount=1; done
    [ $x -gt $(( $start + $3 )) ] 2>/dev/null && echo -e $start"\t"$akCount &&
    akCount=0 && start=$(( $start + $3 )) && [ $x -gt $start ] 2>/dev/null &&
    [ $x -le $(( $start + $3 )) ] && akCount=$(( $akCount + 1 ))
    done < $1; echo -e $start"\t"$akCount
}

Now let check the result:

cb0@home:~/countBetween test 0 10
0       2
10      0
20      1

Great, lets see what happens when I change the range width from 10 to 1. I should recieve the original file with those positions inserted that don't exist in input file. Let see...

cb0@home:~/countBetween test 0 1
0       0
1       0
2       0
3       1
4       0
5       0
6       0
7       0
8       0
9       0
10      1
11      0
12      0
13      0
14      0
15      0
16      0
17      0
18      0
19      0
20      0
21      0
22      0
23      1

Great, this is what I wanted to have.

Now I only need the a small gnuplot script:

set terminal png
set output "/home/cb0/public_html/XXX.png"
set autoscale
set logscale x
plot '~/docsize.stat' using 1:2 smooth csplines with lines lt -1 title "Source Statistics"

now just call gnuplot and I'm done.

statistics from sources
statistics from sources

And the execution time for my example is now only:

time countBetween test 0 100
real    0m10.937s
user    0m4.828s
sys     0m5.812s

Sweet.....

If you have any question or think I have a error in my script please let me know. And feel free to use this for your tasks.

Leave a Reply