Building Histograms
Goals
- Write in a functional style
- Use higher order functions
- Include math, stats, and finance
- Have fun
Overview
- Histograms
- Factoring the Algorithm
- Histogram Implementation
- Sample Plots
- Further Reading
Histograms
- Graphical presentation of a data set’s distribution
- More descriptive than summary statistics
- Chart granularity critically depends on the number of bins
There have been many attempts over the past 80 years to compute the optimal number of bins given a specific dataset. There are also many different ways to plot the histogram data. If we factor the code properly, we can compose a custom histogram function with exactly the properties we want.
\l qtips.q
q).util.use `.hist
`.
q)myhist:chart[bar"*";30] count each bgroup[sturges]@
-
Uniform Distribution
q)myhist 100?1f 0.01976295| 11 "*********** " 0.1414961 | 10 "********** " 0.2632293 | 14 "************** " 0.3849625 | 13 "************* " 0.5066957 | 14 "************** " 0.6284289 | 15 "*************** " 0.7501621 | 16 "**************** " 0.8718953 | 6 "****** " 0.9936284 | 1 "* "
-
Normal Distribution
q)myhist .stat.bm 100?1f -2.013341 | 4 "**** " -1.443895 | 11 "*********** " -0.8744492| 22 "********************** " -0.3050036| 26 "************************** " 0.2644421 | 17 "***************** " 0.8338878 | 12 "************ " 1.403333 | 6 "****** " 1.972779 | 1 "* " 2.542225 | 1 "* "
Factoring the Algorithm
-
Histogram Generator
/ create range of n buckets between (s)tart and (e)nd nrng:{[n;s;e]s+til[1+n]*(e-s)%n} / group data by a (b)inning (f)unction bgroup:{[bf;x] b:nrng[bf x;min x;max x]; g:group b bin x; g:b!x g til count b; g} / use (p)lotting (f)unction to chart (d)ata with max (w)idth chart:{[pf;w;d] n:"j"$(m&w)*n%m:max n:value d; d:d,'enlist each pf[w] each n; d}
Histogram Implementation
-
Binning Algorithms
/ square root bucket algorithm sqrtn:{ceiling sqrt count x} / sturges' bucket algorithm sturges:{ceiling 1f+2f xlog count x} / doane's bucket algorithm doane:{ceiling 1f+(2f xlog count x)+2f xlog 1f+abs nskew x} / scott's windowing algorithm scott:{nw[;x] 3.4908*sdev[x]*count[x] xexp -1f%3f} /freedman-diaconis windowing algorithm fd:{nw[;x] 2f*.stat.iqr[x]*count[x] xexp -1f%3f}
sqrtn
- simplest algorithm (used by Excel)sturges
- (1926) assumes data is a normally distributeddoane
- (1976) modifiedsturges
for skewed data -skew
scott
- (1979) mathematically rigorous because it usesstdev
-
fd
- (1981) modifiedscott
for skewed data -iqr
-
Plotting Algorithms
/ bar-chart plotting function / (c)haracter, (w)indow size, (n)umber of points bar:{[c;w;n]w$n#c} / dot-chart plotting function / (c)haracter, (w)indow size, (n)umber of points dot:{[c;w;n]w$neg[n]$1#c} / use (p)lotting (f)unction to chart (d)ata with max (w)idth chart:{[pf;w;d] n:"j"$(m&w)*n%m:max n:value d; d:d,'enlist each pf[w] each n; d}
bar
- generates a line of charactersdot
- generates a single characterchart
- generates a line of text for each bin
Sample Plots
-
Basic Strurges Bar Chart
q)chart[bar"*";30] count each bgroup[sturges] x:exp .stat.bm 100?1f 0.08791095| 66 "******************************" 1.448485 | 19 "********* " 2.80906 | 10 "***** " 4.169634 | 1 " " 5.530209 | 2 "* " 6.890783 | 1 " " 8.251358 | 0 " " 9.611932 | 0 " " 10.97251 | 1 " "
-
Robust Freedman-Diaconis Dot Chart
q)chart[bar"*";30] count each bgroup[fd] x 0.08791095| 32 "******************************" 0.7281813 | 29 "*************************** " 1.368452 | 11 "********** " 2.008722 | 13 "************ " 2.648992 | 7 "******* " 3.289263 | 2 "** " 3.929533 | 1 "* " 4.569803 | 0 " " 5.210073 | 3 "*** " 5.850344 | 0 " " 6.490614 | 0 " " 7.130884 | 1 "* " 7.771155 | 0 " " 8.411425 | 0 " " 9.051695 | 0 " " 9.691966 | 0 " " 10.33224 | 0 " " 10.97251 | 1 "* "
-
Scott Dot Chart with “@”
q)chart[dot"@";30] count each bgroup[scott] x 0.08791095| 59 " @" 1.29731 | 26 " @ " 2.50671 | 9 " @ " 3.716109 | 1 "@ " 4.925509 | 3 " @ " 6.134908 | 0 " " 7.344308 | 1 "@ " 8.553707 | 0 " " 9.763107 | 0 " " 10.97251 | 1 "@ "
Further Reading
-
Hyndman, R.J. (1995) The problem with Sturges’ rule for constructing histograms, Technical report, Monash University.
-
Wand, M.P. (1995) Data-based choice of histogram bin-width. Technical report, Australian Graduate School of Management, University of NSW.
Hyndman dissects each of the binning functions and explains the deficiencies of each.
Wand proposes what seems like the ‘grand unified theory’ of optimal
histogram bin-width computation. The algorithm is much more complex
than what we’ve encountered so far (includes the use fft
), but
matches Scott’s rule under first order approximations.