Building Histograms

June 2, 2015

Goals

Write in a functional style
Use higher order functions
Include math, stats, and finance
Have fun

Overview

Histograms
Factoring the Algorithm
Histogram Implementation
Sample Plots
Further Reading

Histograms

Graphical presentation of a data set’s distribution
More descriptive than summary statistics
Chart granularity critically depends on the number of bins

There have been many attempts over the past 80 years to compute the optimal number of bins given a specific dataset. There are also many different ways to plot the histogram data. If we factor the code properly, we can compose a custom histogram function with exactly the properties we want.

\l qtips.q
q).util.use `.hist
`.
q)myhist:chart[bar"*";30] count each bgroup[sturges]@

Uniform Distribution

q)myhist 100?1f
01976295| 11 "***********                   "
1414961 | 10 "**********                    "
2632293 | 14 "**************                "
3849625 | 13 "*************                 "
5066957 | 14 "**************                "
6284289 | 15 "***************               "
7501621 | 16 "****************              "
8718953 | 6  "******                        "
9936284 | 1  "*                             "

Normal Distribution

q)myhist .stat.bm 100?1f
-2.013341 | 4  "****                          "
-1.443895 | 11 "***********                   "
-0.8744492| 22 "**********************        "
-0.3050036| 26 "**************************    "
0.2644421 | 17 "*****************             "
0.8338878 | 12 "************                  "
1.403333  | 6  "******                        "
1.972779  | 1  "*                             "
2.542225  | 1  "*                             "

Factoring the Algorithm

Histogram Generator

/ create range of n buckets between (s)tart and (e)nd
nrng:{[n;s;e]s+til[1+n]*(e-s)%n}
    
/ group data by a (b)inning (f)unction
bgroup:{[bf;x]
 b:nrng[bf x;min x;max x];
 g:group b bin x;
 g:b!x g til count b;
 g}
    
/ use (p)lotting (f)unction to chart (d)ata with max (w)idth
chart:{[pf;w;d]
 n:"j"$(m&w)*n%m:max n:value d;
 d:d,'enlist each pf[w] each n;
 d}

Histogram Implementation

Binning Algorithms

/ square root bucket algorithm
sqrtn:{ceiling sqrt count x}
    
/ sturges' bucket algorithm
sturges:{ceiling 1f+2f xlog count x}
    
/ doane's bucket algorithm
doane:{ceiling 1f+(2f xlog count x)+2f xlog 1f+abs nskew x}
    
/ scott's windowing algorithm
scott:{nw[;x] 3.4908*sdev[x]*count[x] xexp -1f%3f}
    
/freedman-diaconis windowing algorithm
fd:{nw[;x] 2f*.stat.iqr[x]*count[x] xexp -1f%3f}

sqrtn - simplest algorithm (used by Excel)
sturges - (1926) assumes data is a normally distributed
doane - (1976) modified sturges for skewed data - skew
scott - (1979) mathematically rigorous because it uses stdev
fd - (1981) modified scott for skewed data - iqr

Plotting Algorithms

/ bar-chart plotting function
/ (c)haracter, (w)indow size, (n)umber of points
bar:{[c;w;n]w$n#c}
    
/ dot-chart plotting function
/ (c)haracter, (w)indow size, (n)umber of points
dot:{[c;w;n]w$neg[n]$1#c}
    
/ use (p)lotting (f)unction to chart (d)ata with max (w)idth
chart:{[pf;w;d]
 n:"j"$(m&w)*n%m:max n:value d;
 d:d,'enlist each pf[w] each n;
 d}

bar - generates a line of characters
dot - generates a single character
chart - generates a line of text for each bin

Sample Plots

Basic Strurges Bar Chart

q)chart[bar"*";30] count each bgroup[sturges] x:exp .stat.bm 100?1f
08791095| 66 "******************************"
448485  | 19 "*********                     "
80906   | 10 "*****                         "
169634  | 1  "                              "
530209  | 2  "*                             "
890783  | 1  "                              "
251358  | 0  "                              "
611932  | 0  "                              "
97251  | 1  "                              "

Robust Freedman-Diaconis Dot Chart

q)chart[bar"*";30] count each bgroup[fd] x
08791095| 32 "******************************"
7281813 | 29 "***************************   "
368452  | 11 "**********                    "
008722  | 13 "************                  "
648992  | 7  "*******                       "
289263  | 2  "**                            "
929533  | 1  "*                             "
569803  | 0  "                              "
210073  | 3  "***                           "
850344  | 0  "                              "
490614  | 0  "                              "
130884  | 1  "*                             "
771155  | 0  "                              "
411425  | 0  "                              "
051695  | 0  "                              "
691966  | 0  "                              "
33224  | 0  "                              "
97251  | 1  "*                             "

Scott Dot Chart with “@”

q)chart[dot"@";30] count each bgroup[scott] x
08791095| 59 "                             @"
29731   | 26 "            @                 "
50671   | 9  "    @                         "
716109  | 1  "@                             "
925509  | 3  " @                            "
134908  | 0  "                              "
344308  | 1  "@                             "
553707  | 0  "                              "
763107  | 0  "                              "
97251  | 1  "@                             "

Nick Psaris

Building Histograms

Goals

Overview

Histograms

Factoring the Algorithm

Histogram Implementation

Sample Plots

Further Reading

Share on

You May Also Enjoy

Improved Matching Algorithms in Q

KxCon 23

Matching Algorithms in Q

Nick Psaris and the Q Language