Building Histograms

Goals

  • Write in a functional style
  • Use higher order functions
  • Include math, stats, and finance
  • Have fun

Overview

  • Histograms
  • Factoring the Algorithm
  • Histogram Implementation
  • Sample Plots
  • Further Reading

Histograms

  • Graphical presentation of a data set’s distribution
  • More descriptive than summary statistics
  • Chart granularity critically depends on the number of bins

There have been many attempts over the past 80 years to compute the optimal number of bins given a specific dataset. There are also many different ways to plot the histogram data. If we factor the code properly, we can compose a custom histogram function with exactly the properties we want.

\l qtips.q
q).util.use `.hist
`.
q)myhist:chart[bar"*";30] count each bgroup[sturges]@
  • Uniform Distribution

    q)myhist 100?1f
    0.01976295| 11 "***********                   "
    0.1414961 | 10 "**********                    "
    0.2632293 | 14 "**************                "
    0.3849625 | 13 "*************                 "
    0.5066957 | 14 "**************                "
    0.6284289 | 15 "***************               "
    0.7501621 | 16 "****************              "
    0.8718953 | 6  "******                        "
    0.9936284 | 1  "*                             "
    
  • Normal Distribution

    q)myhist .stat.bm 100?1f
    -2.013341 | 4  "****                          "
    -1.443895 | 11 "***********                   "
    -0.8744492| 22 "**********************        "
    -0.3050036| 26 "**************************    "
    0.2644421 | 17 "*****************             "
    0.8338878 | 12 "************                  "
    1.403333  | 6  "******                        "
    1.972779  | 1  "*                             "
    2.542225  | 1  "*                             "
    

Factoring the Algorithm

  • Histogram Generator

    / create range of n buckets between (s)tart and (e)nd
    nrng:{[n;s;e]s+til[1+n]*(e-s)%n}
        
    / group data by a (b)inning (f)unction
    bgroup:{[bf;x]
     b:nrng[bf x;min x;max x];
     g:group b bin x;
     g:b!x g til count b;
     g}
        
    / use (p)lotting (f)unction to chart (d)ata with max (w)idth
    chart:{[pf;w;d]
     n:"j"$(m&w)*n%m:max n:value d;
     d:d,'enlist each pf[w] each n;
     d}
    

Histogram Implementation

  • Binning Algorithms

    / square root bucket algorithm
    sqrtn:{ceiling sqrt count x}
        
    / sturges' bucket algorithm
    sturges:{ceiling 1f+2f xlog count x}
        
    / doane's bucket algorithm
    doane:{ceiling 1f+(2f xlog count x)+2f xlog 1f+abs nskew x}
        
    / scott's windowing algorithm
    scott:{nw[;x] 3.4908*sdev[x]*count[x] xexp -1f%3f}
        
    /freedman-diaconis windowing algorithm
    fd:{nw[;x] 2f*.stat.iqr[x]*count[x] xexp -1f%3f}
    
  • sqrtn - simplest algorithm (used by Excel)
  • sturges - (1926) assumes data is a normally distributed
  • doane - (1976) modified sturges for skewed data - skew
  • scott - (1979) mathematically rigorous because it uses stdev
  • fd - (1981) modified scott for skewed data - iqr

  • Plotting Algorithms

    / bar-chart plotting function
    / (c)haracter, (w)indow size, (n)umber of points
    bar:{[c;w;n]w$n#c}
        
    / dot-chart plotting function
    / (c)haracter, (w)indow size, (n)umber of points
    dot:{[c;w;n]w$neg[n]$1#c}
        
    / use (p)lotting (f)unction to chart (d)ata with max (w)idth
    chart:{[pf;w;d]
     n:"j"$(m&w)*n%m:max n:value d;
     d:d,'enlist each pf[w] each n;
     d}
    
  • bar - generates a line of characters
  • dot - generates a single character
  • chart - generates a line of text for each bin

Sample Plots

  • Basic Strurges Bar Chart

    q)chart[bar"*";30] count each bgroup[sturges] x:exp .stat.bm 100?1f
    0.08791095| 66 "******************************"
    1.448485  | 19 "*********                     "
    2.80906   | 10 "*****                         "
    4.169634  | 1  "                              "
    5.530209  | 2  "*                             "
    6.890783  | 1  "                              "
    8.251358  | 0  "                              "
    9.611932  | 0  "                              "
    10.97251  | 1  "                              "
    
  • Robust Freedman-Diaconis Dot Chart

    q)chart[bar"*";30] count each bgroup[fd] x
    0.08791095| 32 "******************************"
    0.7281813 | 29 "***************************   "
    1.368452  | 11 "**********                    "
    2.008722  | 13 "************                  "
    2.648992  | 7  "*******                       "
    3.289263  | 2  "**                            "
    3.929533  | 1  "*                             "
    4.569803  | 0  "                              "
    5.210073  | 3  "***                           "
    5.850344  | 0  "                              "
    6.490614  | 0  "                              "
    7.130884  | 1  "*                             "
    7.771155  | 0  "                              "
    8.411425  | 0  "                              "
    9.051695  | 0  "                              "
    9.691966  | 0  "                              "
    10.33224  | 0  "                              "
    10.97251  | 1  "*                             "
    
  • Scott Dot Chart with “@”

    q)chart[dot"@";30] count each bgroup[scott] x
    0.08791095| 59 "                             @"
    1.29731   | 26 "            @                 "
    2.50671   | 9  "    @                         "
    3.716109  | 1  "@                             "
    4.925509  | 3  " @                            "
    6.134908  | 0  "                              "
    7.344308  | 1  "@                             "
    8.553707  | 0  "                              "
    9.763107  | 0  "                              "
    10.97251  | 1  "@                             "
    

Further Reading

  • Hyndman, R.J. (1995) The problem with Sturges’ rule for constructing histograms, Technical report, Monash University.

  • Wand, M.P. (1995) Data-based choice of histogram bin-width. Technical report, Australian Graduate School of Management, University of NSW.

Hyndman dissects each of the binning functions and explains the deficiencies of each.

Wand proposes what seems like the ‘grand unified theory’ of optimal histogram bin-width computation. The algorithm is much more complex than what we’ve encountered so far (includes the use fft), but matches Scott’s rule under first order approximations.