Goals

  • Introduce you to Q and KDB+
  • Amaze you with 12 machine-learning algorithms
  • Impress you with efficient code
  • Dazzle you with ASCII plots
  • Motivate you to learn more about ML and Q
  • Cheer you up with a fun quote

Q Fundamentals

What is Q?

  • The language behind the world’s fastest time series database
  • Language Paradigms:
    • Interpreted
    • Dynamically Typed
    • Vectorized
    • Functional
    • Event-driven
  • Think Python but with native support for:
    • SQL
    • Parallelization
    • Large Tables (> 1 Billion Rows)
    • Distributed Computing

Why do I like Q?

  • Personally manage extremely large datasets with a single machine
    • Databases are directories
    • Partitions are sub-directories
    • Tables are sub-directories within partitions
    • Columns are files within table directories
  • Efficiently joins disparate datasets—making complex analysis easy
  • Vectorized computations reduces the interpreter overhead
  • Vectors, matrices, dictionaries and tables are all lists
  • Notation works intuitively across types
  • Code-injection allows iterative development
  • Q-SQL permits all q operators and even user-defined functions
  • Null and infinite values are available for integer and temporal types
  • Functional paradigm promotes parallelism by separating functions from data

Fun Q Overview

Where does Fun Q Fit in?

  • The original Coursera machine-learning course was taught in Matlab/Octave by Andrew Ng
  • Why can’t q be used for the same problems?
  • It turns out that many machine-learning algorithms have beautifully vectorized solutions in q
  • 5 years of continuous refactoring revealed common functional paradigms in machine learning
  • Small 1–5 line functions are the perfect tool for teaching
  • 12 distinct machine-learning algorithm families are taught using nothing but the 32-bit ‘personal’ edition or 64-bit ‘on-demand’ edition of the q language
  • Limiting all graphics to the terminal adds a bit of fun to the journey

Algorithms

  • K-Nearest Neighbors (KNN)
  • K-Means/Medians/Mediods Clustering
  • Hierarchical Agglomerative Clustering (HAC)
  • Expectation Maximization (EM)
  • Naive Bayes
  • Decision Tree (ID3,C4.5,CART)
  • Discrete Adaptive Boosting (AdaBoost)
  • Boosted Aggregating (BAG) and Random Forest
  • Linear Regression
  • Logistic Regression and One-vs.-All
  • Neural Network Classification/Regression
  • Content-Based/Collaborative Filtering (Recommender Systems)
  • Google PageRank

K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) Algorithm

  • Predictions are made by aggregating the K nearest neighbors
  • ‘Nearness’ requires the definition of distance metrics

    mnorm:sum abs::                           / Manhattan (taxicab) norm
    enorm2:{x wsum x}                         / Euclidean norm squared
    enorm:sqrt enorm2::                       / Euclidean norm
        
    hdist:sum (<>)::                / Hamming distance
    mdist:mnorm (-)::               / Manhattan (taxicab) distance
    edist2:enorm2 (-)::             / Euclidean distance squared
    edist:enorm (-)::               / Euclidean distance
    
  • Use data type to differentiate between classification and regression

    / weighted average or mode
    isord:{type[x] in 0 8 9h}               / is ordered
    aom:{$[isord x;avg;mode]x}              / average or mode
    waom:{[w;x]$[isord x;nwavg;wmode][w;x]} / weighted average or mode
    
/ find (k) smallest values from (d)istance vector (or matrix) and use
/ (w)eighting (f)unction to return the best estimate of y
knn:{[wf;k;y;d]
 if[not type d;:.z.s[wf;k;y] peach d];    / recurse for matrix d
 if[any n:null d;d@:i:where not n; y@:i]; / filter null distances
 p:(waom . (wf d::;y)@\:#[;iasc d]::) peach k&count d; / make predictions
 p}

K-Nearest Neighbors (KNN) Example

  • Load Fun Q library and famous Iris data set

    q)\l funq.q
    q)\l iris.q
    [down]loading iris data set
    q)iris.t
    species     slength swidth plength pwidth
    -----------------------------------------
    Iris-setosa 5.1     3.5    1.4     0.2   
    Iris-setosa 4.9     3      1.4     0.2   
    Iris-setosa 4.7     3.2    1.3     0.2   
    Iris-setosa 4.6     3.1    1.5     0.2   
    ..
    
  • Partition between train/test (and even validation)

    q)d:.ut.part[`test`train!4 1;0N?] iris.t
    q)y:first get first `Y`X set' 0 1 cut value flip d.train
    q)yt:first get first `Yt`Xt set' 0 1 cut value flip d.test
    
  • Train and test

    q)k:5
    q)avg yt=.ml.knn[0n<;k;y] .ml.f2nd[.ml.edist X] Xt
    0.975
    q)avg yt=.ml.knn[sqrt 1f%;k;y] .ml.pedist2[X;Xt]
    0.9583333
    

K-Means Clustering

K-Means Clustering Algorithm

  • Find K centroids which minimize the intra-cluster distances
  • Lloyd’s algorithm iteratively alternates between:
    • Assigning each data point to its nearest centroid
    • Updating centroid definitions by summarizing each cluster
  • Fun Q algorithm implements a single round of Lloyd’s algorithm
  • Iteration is controlled by the user using q adverbs
    • Fixed number of iterations
    • Run until convergence
    • Run until custom test returns false
/ using the (d)istance (f)unction, group matri(X) based on the closest
/ (C)entroid and return the cluster indices
cgroup:{[df;X;C]value group imin f2nd[df X] C}

/ Stuart Lloyd's algorithm. uses (d)istance (f)unction to assign the
/ matri(X) to the nearest (C)entroid and then uses the (c)entroid (f)unction
/ to update the centroid location.
lloyd:{[df;cf;X;C]cf X@\: cgroup[df;X;C]}

kmeans:lloyd[edist2;avg'']      / k-means

K-Means Clustering Example

  • Plot actual clusters of petal length/width

    q).ut.plt (t.plength;t.pwidth;{distinct[x]?x} t.species)
    3| "                   "
     | "             @@@@  "
     | "             @@@@ @"
    2| "            @@@@@@@"
     | "          ++x# @   "
     | "        +++++ @    "
    1| "      ++++         "
     | " .                 "
     | "...                "
    0| "...                "
    
  • Plot computed centroids within each cluster

    q)C:.ml.kmeans[X] over last 3 .ml.kmeanspp[X]// 2#()
    q).ut.plt .ml.append[0N;X 2 3],'.ml.append[1] C 2 3
    3| "                   "
     | "             ....  "
     | "             .... ."
    2| "            ...@..."
     | "          .... .   "
     | "        ..@.. .    "
    1| "      ....         "
     | " .                 "
     | "...                "
    0| ".@.                "
    

Hierarchical Agglomerative Clustering (HAC)

Hierarchical Agglomerative Clustering (HAC) Algorithm

  • Iteratively merges closest clusters — one at a time
  • Only the distances from previous round are needed
  • Implements a single round of the Lance-Williams algorithm

    lancewilliams:{[lf;D;a;L]
     n:count D;
     d:D@'di:imin peach D;                        / find closest distances
     if[null d@:i:imin d;:(D;a;L)]; j:di i;       / find closest clusters
     c:$[9h=type lf;lf;lf(freq a)@/:(i;j;til n)]; / determine coefficients
     nd:sum c*nd,(d;abs(-/)nd:D (i;j));           / calc new distances
     D[;i]:D[i]:nd;                               / update distances
     D[;j]:D[j]:n#0n;                             / erase j
     a[where j=a]:i;                / all elements in cluster j are now in i
     L:L,'(j;i);                    / append linkage stats
     (D;a;L)}
    
  • The .ml.link function iterates and returns the linkage statistics

    link:{[lf;D]
     D:@'[D;a:til count D;:;0n];    / define cluster assignments and ignore loops
     if[-11h=type lf;lf:get lf];    / dereference lf
     L:last .[lancewilliams[lf]] over (D;a;2#()); / obtain linkage stats
     L}
    

Lance-Williams Formula and Coefficients

\[ d_{(ij)k} = \alpha_i d_{ik} + \alpha_j d_{jk} + \beta d_{ij} + \gamma \lvert d_{ik} - d_{jk} \rvert \]

Method \(\alpha_i\) \(\alpha_j\) \(\beta\) \(\gamma\)
Single .5 .5 0 -.5
Complete .5 .5 0 .5
Average \(\frac{n_i}{n_i+n_j}\) \(\frac{n_j}{n_i+n_j}\) 0 0
Weighted .5 .5 0 0
Centroid \(\frac{n_i}{n_i+n_j}\) \(\frac{n_j}{n_i+n_j}\) \(-\frac{n_i n_j}{(n_i+n_j)^2}\) 0
Median .5 .5 -.25 0
Ward \(\frac{n_i + n_k}{n_i+n_j+n_k}\) \(\frac{n_j + n_k}{n_i+n_j+n_k}\) \(-\frac{n_k}{n_i+n_j+n_k}\) 0

Hierarchical Agglomerative Clustering (HAC) Example

  • Build linkage statistics

    q)X:iris.X
    q)L:.ml.link[`.ml.lw.ward] sqrt .ml.pedist2[X;X]
    
  • Visually inspect clusters of different sizes

    q)(,'/) (.ut.plt X[2 3],enlist .ut.ugrp@) each .ml.clust[L] 1+til 3
    3| "                                                         "
     | "             ++++               @@@@               +@@@  "
     | "             ++++ +             @@@@ @             @@@@ @"
    2| "            +++++++            @@@@@@@            +#@@@@@"
     | "          ++++ +             @@@@ @             ++++ @   "
     | "        +++++ +            @@@@@ @            +++++ @    "
    1| "      ++++               @@@@               ++++         "
     | " +                  .                  .                 "
     | "+++                ...                ...                "
    0| "+++                ...                ...                "
    

Expectation Maximization (EM)

Expectation Maximization (EM) Algorithm

  • Iteratively alternates between using the likelihood and MLE to:
    • Assign a probability that points belongs to a distribution
    • Compute new distribution parameters
  • Using Gaussian distribution, can be considered a soft K-Means
  • Introduces probability metrics to the toolbox
  • Implements a single round of Expectation Maximization

    / using (l)ikelihood (f)unction, (w)eighted (m)aximum likelihood estimator
    / (f)unction with prior probabilities (p)hi and distribution parameters
    / (THETA), optionally (f)fit (p)hi and perform expectation maximization
    em:{[fp;lf;wmf;X;phi;THETA]
     W:prb likelihood[0b;lf;X;phi;THETA]; / weights (responsibilities)
     if[fp;phi:avg each W];               / new phi estimates
     THETA:wmf[;X] peach W;               / new THETA estimates
     (phi;THETA)}
    

Expectation Maximization (EM) Example

  • Initialized multi-variate Gaussian distribution parameters

    q)`X`y set' iris`X`y;
    q)k:count distinct y
    q)phi:k#1f%k
    q)mu:X@\:/:neg[k]?count y
    q)SIGMA:k#enlist X cov\:/: X
    q)lf:.ml.gaussmvl
    q)mf:.ml.wgaussmvmle
    
  • Fit Iris data to the distribution and compare clusters

    q)pT:(.ml.em[1b;lf;mf;X]//) (phi;flip (mu;SIGMA))
    q)p:.ml.imax .ml.likelihood[1b;.ml.gaussmvll;X] . pT
    q)m:.ml.mode each y group p
    q)avg y=m p
    0.8866667
    

Confusion Matrix Example

  • A confusion matrix compares expected vs actual values

    / given true labels y and predicted labels p, return a confusion matrix
    cm:{[y;p]
     n:count u:asc distinct y,p;
     m:./[(n;n)#0;flip (u?p;u?y);1+];
     t:([]y:u)!flip (`$string u)!m;
     t}
    
  • Q tables clearly render the confusion matrix

    q).ut.totals[`TOTAL] .ml.cm[y;m p]
    y              | Iris-setosa Iris-versicolor Iris-virginica TOTAL
    ---------------| ------------------------------------------------
    Iris-setosa    | 50          0               0              50   
    Iris-versicolor| 0           49              1              50   
    Iris-virginica | 0           16              34             50   
                   | 50          65              35             150  
    

Naive Bayes

Naive Bayes Algorithm

  • ‘Naively’ assumes conditional independence between features
  • Does not require iterating
  • Uses weighted maximum likelihood estimator functions along with class labels to build a classifier

    / fit parameters given (w)eighted (m)aximization (f)unction returns a
    / dictionary with prior and conditional likelihoods
    fnb:{[wmf;w;y;X]
     if[(::)~w;w:count[y]#1f];      / handle unassigned weight
     pT:(odds g; (wmf . (w;X@\:) @\:) peach g:group y);
     pT}
    
  • Classifies by using [log]likelihood function and picking class with highest probability density

    / using a [log](l)ikelihood (f)unction and prior probabilities (p)hi and
    / distribution parameters (T)HETA, perform naive Bayes classification
    pnb:{[l;lf;pT;X]
     d:{(x . z) y}[lf]'[X] peach pT[1]; / compute probability densities
     c:imax $[l;log[pT 0]+sum flip d;pT[0]*prd flip d];
     c}
    

Naive Bayes Example

  • Use MLE to associate each class with prior and conditional likelihoods (not probabilities)

    q)flip pT:.ml.fnb[.ml.wgaussmle/:;::;iris.y;iris.X]
    Iris-setosa    | 0.3333333 (5.006 0.121764;3.428 0.140816;1.462 0.029556;0.246 0.0108..
    Iris-versicolor| 0.3333333 (5.936 0.261104;2.77 0.0965;4.26 0.2164;1.326 0.038324)   ..
    Iris-virginica | 0.3333333 (6.588 0.396256;2.974 0.101924;5.552 0.298496;2.026 0.0739..
    
  • Classification picks the class with the maximum likelihood density

    q)avg iris.y=.ml.pnb[0b;.ml.gaussl;pT] iris.X
    .96
    

Decision Tree

Decision Tree Algorithm

  • Recursively splits observations on features, progressively creating branches with more uniform target values
  • Parameters include ‘gain functions’, ‘impurity functions’ and options to control tree depth
dt:{[cgf;ogf;ipf;opt;w;t]
 if[(::)~w;w:n#1f%n:count t];       / compute default weight vector
 if[1=count d:flip t;:(w;first d)]; / no features to test
 opt:(`maxd`minss`minsl`ming`maxff!(0N;2;1;0;::)),opt; / default options
 if[0=opt`maxd;:(w;first d)];    / check if we've reached max depth
 if[identical a:first d;:(w;a)]; / check if all values are equal
 if[opt[`minss]>count a;:(w;a)]; / check if insufficient samples
 d:((neg floor opt[`maxff] count d)?key d)#d:1 _d;   / sub-select features
 d:{.[x isord z;y] z}[(cgf;ogf);(ipf;w;a)] peach d;  / compute gains
 d:(where (any opt[`minsl]>count each last::) each d) _ d; / filter on minsl
 if[0=count d;:(w;a)];          / check if all leaves have < minsl samples
 if[opt[`ming]>=first b:d bf:imax d[;0];:(w;a)]; / check gain of best feature
 c:count k:key g:last b;        / grab subtrees, feature names and count
 / distribute nulls down each branch with reduced weight
 if[c>ni:null[k]?1b;w:@[w;n:g nk:k ni;%;c-1];g:(nk _g),\:n];
 if[(::)~b 1;t:(1#bf)_t];       / don't reuse exhausted features
 b[2]:.z.s[cgf;ogf;ipf;@[opt;`maxd;-;1]]'[w g;t g]; / split sub-trees
 bf,1_b}

Decision Tree Impurity Functions

  • Impurity functions distill the impurity of branches into a single value
  • Classification impurity functions are used for categorical values

    misc:{1f-avg x=mode x}                  / misclassification
    gini:{1f-enorm2 odds group x}           / Gini
    entropy:{neg sum x*log x:odds group x}  / entropy
    
  • The Gini impurity function is used in the CART decision tree algorithm (arguably because it is more efficient than the entropy function which uses the log operator)
  • Regression impurity functions are used for ordered values

    mse:{enorm2[x-avg x]%count x}          / mean squared error
    mae:{avg abs x-avg x}                  / mean absolute error
    

Decision Tree Projections

  • Many famous decision tree implementations are just parameterizations of .ml.dt
  • Scikit-learn’s CART algorithm trees treats all features as ordered – requiring users to first us one-hot encoding
  • Ross Quinlan’s C4.5 algorithm differentiates between categorical and ordered features
  • Regression trees use the same algorithm, but use a mean squared error impurity function and predict values with the weighted average of the branch values instead of the weighted mode
aid:dt[sig;oig;wmse]            / automatic interaction detection
thaid:dt[sig;oig;wmisc]         / theta automatic interaction detection
id3:dt[ig;ig;wentropy]          / iterative dichotomizer 3
q45:dt[gr;ogr;wentropy]         / like c4.5
ct:dt[oig;oig;wgini]            / classification tree
rt:dt[oig;oig;wmse]             / regression tree

Decision Tree Example (ID3)

  • To play or not to play

    q)\l weather.q
    q)weather.t
    Play Outlook  Temperature Humidity Wind  
    -----------------------------------------
    No   Sunny    Hot         High     Weak  
    No   Sunny    Hot         High     Strong
    Yes  Overcast Hot         High     Weak  
    Yes  Rain     Mild        High     Weak  
    Yes  Rain     Cool        Normal   Weak  
    No   Rain     Cool        Normal   Strong
    Yes  Overcast Cool        Normal   Strong
    No   Sunny    Mild        High     Weak  
    Yes  Sunny    Cool        Normal   Weak  
    ..
    
  • Categorical features cannot be reused

    q)-1 .ml.ptree[0] .ml.id3[();::] weather.t;
    Yes (n = 14, err = 35.7%)
    |  Outlook :: Overcast: Yes (n = 4, err = 0%)
    |  Outlook :: Rain: Yes (n = 5, err = 40%)
    |  |  Wind :: Strong: No (n = 2, err = 0%)
    |  |  Wind :: Weak: Yes (n = 3, err = 0%)
    |  Outlook :: Sunny: No (n = 5, err = 40%)
    |  |  Humidity :: High: No (n = 3, err = 0%)
    |  |  Humidity :: Normal: Yes (n = 2, err = 0%)
    

Decision Tree Example (CART)

  • Each split is binary
  • Features can be reused
q)d:.ut.part[`train`test!3 1;iris.t.species] iris.t
q)-1 .ml.ptree[0] tr:.ml.ct[();::] d`train;
Iris-setosa (n = 111, err = 66.7%)
|  plength >[;2.45] 0: Iris-setosa (n = 37, err = 0%)
|  plength >[;2.45] 1: Iris-versicolor (n = 74, err = 50%)
|  |  pwidth >[;1.7] 0: Iris-versicolor (n = 39, err = 7.7%)
|  |  |  plength >[;4.95] 0: Iris-versicolor (n = 35, err = 0%)
|  |  |  plength >[;4.95] 1: Iris-virginica (n = 4, err = 25%)
|  |  |  |  pwidth >[;1.55] 0: Iris-virginica (n = 3, err = 0%)
|  |  |  |  pwidth >[;1.55] 1: Iris-versicolor (n = 1, err = 0%)
|  |  pwidth >[;1.7] 1: Iris-virginica (n = 35, err = 2.9%)
|  |  |  plength >[;4.85] 0: Iris-virginica (n = 3, err = 33.3%)
|  |  |  |  swidth >[;3.1] 0: Iris-virginica (n = 2, err = 0%)
|  |  |  |  swidth >[;3.1] 1: Iris-versicolor (n = 1, err = 0%)
|  |  |  plength >[;4.85] 1: Iris-virginica (n = 32, err = 0%)
q)avg d.test.species=.ml.pdt[tr] d`test
0.9487179

Discrete Adaptive Boosting (AdaBoost)

Discrete Adaptive Boosting Algorithm

  • An ensemble of weak learners generates a strong learner
  • Weak learners merely need to classify with an accuracy > 50%
  • Decision stumps are commonly used as weak learners
  • Each additional learner adjusts weights to fit the observations incorrectly classified during the previous round
adaboost:{[tf;cf;w;t]
 if[(::)~w;w:n#1f%n:count t];    / initialize weights
 m:tf[w] t;                      / train model
 p:cf[m] t;                      / make predictions
 e:sum w*not p=y:first flip t;   / compute weighted error
 a:.5*log (c:1f-e)%e;            / compute alpha (minimize exponential loss)
 / w*:exp neg a*y*p;               / increase/decrease weights
 / w%:sum w;                       / normalize weights
 w%:2f*?[y=p;c;e];               / increase/decrease and normalize weights
 (m;a;w)}

Wisconsin Diagnostic Breast Cancer Data Set

  • Includes mean, max and standard deviation of tumor features

    q)\l wdbc.q
    [down]loading wisconsin-diagnostic-breast-cancer data set
    q)10?wdbc.t
    diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactne..
    -------------------------------------------------------------------------------------..
    B         11.89       18.35        77.32          432.2     0.09363         0.1154   ..
    M         15.13       29.81        96.71          719.5     0.0832          0.04605  ..
    B         12.85       21.37        82.63          514.5     0.07551         0.08316  ..
    M         22.27       19.67        152.8          1509      0.1326          0.2768   ..
    B         9.465       21.01        60.11          269.4     0.1044          0.07773  ..
    B         9.683       19.34        61.05          285.7     0.08491         0.0503   ..
    B         11.54       14.44        74.65          402.9     0.09984         0.112    ..
    B         12.7        12.17        80.88          495       0.08785         0.05794  ..
    M         13.61       24.69        87.76          572.6     0.09258         0.07862  ..
    M         18.05       16.15        120.2          1006      0.1065          0.2146   ..
    
  • Malignant vs. Benign diagnosis feature can be used for binomial classification

Discrete Adaptive Boosting Example

  • Training accuracy improves with extra components

    q)d:.ut.part[`train`test!3 1;0N?] update -1 1 "M"=diagnosis from 11#/:wdbc.t
    q)stump:.ml.ct[(1#`maxd)!1#1]; k:50
    q)m:.ml.fab[k;stump;.ml.pdt] d.train
    q).ut.plt avg d.train.diagnosis = .ml.pab[1+til k;.ml.pdt;m] d.train
    1   | "                   "
    0.98| "                  +"
        | "            +++++++"
    0.96| "       ++++++ +    "
        | "  ++++++ +         "
        | "+++                "
    0.94| " +                 "
    0.92| "                   "
        | "+                  "
    0.9 | "                   "
    
  • Testing accuracy does as well, but not as consistently

    q).ut.plt avg d.test.diagnosis = .ml.pab[1+til k;.ml.pdt;m] d.test
    0.98| "         +       ++"
        | "       ++ +++++    "
    0.96| "++++++++    +  ++++"
        | "   ++++ +    +     "
    0.94| " ++                "
        | "                   "
        | "                   "
    0.92| "+                  "
        | "                   "
    0.9 | "                   "     
    

Boosted Aggregating (BAG) and Random Forest

Boosted Aggregating Algorithm

  • Instead of complicated pruning techniques, fit many independent trees and average the predictions
  • Boosting involves random sampling (with replacement) of the data
  • The difference between Boosted Aggregating and a Random Forest comes down to how the tree is grown, not the ensemble algorithm
  • To increase tree diversity, a Random Forest adds artificial limits on the number of observed features at each tree level
/ generate (n) decision trees by applying (f) to a resampled (with
/ replacement) (t)able
bag:{[n;f;t](f ?[;t]::) peach n#count t} / (b)ootstrap (ag)gregating

/ given an atom or list (k), and bootstrap aggregating (m)odel, make
/ prediction on (d)ictionary
pbag:{[k;m;d]
 if[count[m]<max k;'`length];
 if[98h=type d;:.z.s[k;m] peach d]; / iterate on a table
 p:k {aom x#y}\: pdt[;d] peach m;
 p}

Random Forest Example

  • The decision tree algorithm provides the maxff option which controls the maximum number of features visible at each branch
  • Random Forest classification (maxff = sqrt)

    q)k:20
    q)d:.ut.part[`train`test!3 1;0N?] wdbc.t
    q)m:.ml.bag[k;.ml.q45[(1#`maxff)!1#sqrt;::]] d`train
    q)avg d.test.diagnosis=.ml.pbag[k;m] d`test
    0.9370629
    
  • Random Forest regression (maxff = 1%3)

    q)\l winequality.q
    [down]loading wine-quality data set
    q)d:.ut.part[`train`test!3 1;0N?] winequality.red.t
    q)m:.ml.bag[k;.ml.rt[(1#`maxff)!1#%[;3];::]] d`train
    q).ml.rms d.test.quality-.ml.pbag[k;m] d`test
    0.5682814
    

Linear Regression

Linear Regression Algorithm

  • Q has a native multi-variate lsq operator
  • The implementation is similar to this normal equations function

    / given target matrix Y and data matri(X), return the THETA matrix resulting
    / from minimizing sum of squared residuals
    normeq:{[Y;X]mm[mmt[Y;X]] minv mmt[X;X]} / normal equations ols
    
  • Inverting matrices is slow and numerically unstable
  • So the lsq operator uses Cholesky decomposition to solve for the coefficients directly instead of using LU decomposition to first invert the normal equations matrix.
  • Ridge regression is simply the addition of L2 regularization

    / given (l2) regularization parameter, target matrix Y and data matri(X),
    / return the THETA matrix resulting from performing ridge regression
    ridge:{[l2;Y;X]mm[mmt[Y;X]] minv mmt[X;X]+diag count[X]#l2}
    

Linear Regression Example

  • Use Box-Muller to generate correlated random variables

    q)plt:.ut.plot[30;15;.ut.c10]
    q)X:(.ml.bm 10000?) each 1 1f
    q)rho:.8                          / correlation
    q)X[1]:(rho;sqrt 1f-rho*rho)$X
    q)plt[sum] X
    4 | "                       .   .  "
      | "                   .   . ..  ."
      | "                  ............"
    2 | "               .............  "
      | "             ....---::-.....  "
      | "          ....-:++++:-.....   "
      | "        ....:=+#%%x+--..      "
    0 | "      ....-=x%@@%x:-...       "
      | "     ....:+#%%%+:-...         "
      | "    ....-+++x=:-... .         "
      | "  ....---::--....             "
    -2| "  ..............              "
      | ". ..........                  "
      | " . ......                     "
    -4| " ..                           "
    
  • Prepending a vector of ones allows us to fit the intercept

    q)show THETA:(-1#X) lsq .ml.prepend[1f] 1#X
    -0.008971004 0.8033666
    

Logistic Regression

Logistic Regression Cost Function

  • Logistic regression applies the Sigmoid function to compress values between 0 and 1: \(S(x)=\frac{1}{1+e^{-x}}\)

    sigmoid:1f%1f+exp neg::                       / sigmoid
    
  • Cost function with L1 and L2 regularization \[ J(\theta) = \frac{1}{m} \left[ \sum_{i=1}^m \text{LogLoss}(y^{(i)},h_\theta(x^{(i)})) \right] + \frac{\lambda_1}{m} \sum_{j=1}^n\left|\theta_j\right| + \frac{\lambda_2}{2m} \sum_{j=1}^n\theta_j^2\]
  • Regularized logistic regression cost function

    / logistic regression cost
    logcost:{[rf;Y;X;THETA]
     J:(1f%m:count X 0)*revo[sum] logloss[Y] plog[X;THETA];    / cost
     if[count rf,:();THETA[;0]:0f; J+:sum rf[;m][;0][;THETA]]; / regularization
     J}
    

Logistic Regression Gradient Function

  • Gradient function with L1 and L2 regularization \[G(\theta_j)=\frac{\partial J(\theta)}{\partial \theta_j}= \frac{1}{m}\left(h_\theta(x^{(i)})-y^{(i)}\right)x_j^{(i)} + \frac{\lambda_1}{m} \text{sign}(\theta_j) + \frac{\lambda_2}{m} \theta_j \]
  • Regularized logistic regression gradient function

    / logistic regression gradient
    loggrad:{[rf;Y;X;THETA]
     G:(1f%m:count X 0)*mmt[sigmoid[mm[THETA;X]]-Y] X:prepend[1f] X; / gradient
     if[count rf,:();THETA[;0]:0f; G+:sum rf[;m][;1][;THETA]]; / regularization
     G}
    

Logistic Regression Example

  • Regularization choices have been factored out

    l1:{[l;m]((l%m)*revo[sum] abs::;(l%m)*signum::)}
    l2:{[l;m]((.5*l%m)*revo[sum] {x*x}::;(l%m)*)}
    
  • Elastic net is simply a combination of the two

    enet:{[a;lr](l1 a*lr;l2 a*1f-lr)}
    
  • Regularized gradient descent prevents over-fitting

    q)t:11#/:update "M"=diagnosis from wdbc.t
    q)d:.ut.part[`train`test!3 1;0N?] "f"$t
    q)y:first get first `Y`X set' 0 1 cut value flip d`train
    q)yt:first get first `Yt`Xt set' 0 1 cut value flip d`test
    q)THETA:10000 .ml.gd[.1;.ml.loggrad[.ml.l2[.1];Y;X]]/ enlist (1+count X)#0f
    q)avg yt="i"$p:first .ml.plog[Xt] THETA
    0.8601399
    q).ut.totals[`TOTAL] .ml.cm . "i"$(yt;p)
    y| 0  1  TOTAL
    -| -----------
    0| 83 7  90   
    1| 13 40 53   
     | 96 47 143
    

Neural Network Classification/Regression

Neural Network Algorithm

  • Conceptually similar to applying multiple rounds of linear and/or logistic regression in sequence
  • But our choice of activation function and cost function can vary
  • As can the number (and size of) of hidden layers
  • Gradient descent is once again the workhorse
  • Parameter adjustments are computed by applying the chain rule to the final error and intermediate gradients
/ (r)egularization (f)unction, (n)etwork topology dimensions, hgolf:
/ (h)idden (g)radient (o)utput (l)oss functions
nncostgrad:{[rf;n;hgolf;Y;X;theta]
 THETA:nncut[n] theta;          / unroll theta
 ZA:enlist[(X;X)],(X;X) {(z;x z:plin[y 1;z])}[hgolf`h]\ -1_THETA;
 P:hgolf[`o] plin[last[ZA]1;last THETA];      / prediction
 J:(1f%m:count X 0)*revo[sum] hgolf[`l][Y;P]; / cost
 G:hgolf[`g]@'`z`a!/:1_ZA;                    / activation gradient
 D:reverse{[D;THETA;G]G*1_mtm[THETA;D]}\[E:P-Y;reverse 1_THETA;reverse G];
 G:(1f%m)*(D,enlist E) mmt' prepend[1f] each ZA[;1]; / full gradient
 if[count rf,:();THETA[;;0]:0f;JG:rf[;m][;;THETA];J+:sum JG@'0;G+:sum JG@'1];
 (J;2 raze/ G)}

MNIST Hand Written Digits

  • Each digit is represented by a 28 x 28 pixel image with values ranging between 0 and 255

    q)\l mnist.q
    [down]loading handwritten-digit data set
    q)`X`Xt`y`yt set' mnist`X`Xt`y`yt;
    q)X:1000#'X;y:1000#y;
    q)X%:255f;Xt%:255f
    
  • The plotting utility allows us to visualize the images

    q)plt:value .ut.plot[28;14;.ut.c10;avg] .ut.hmap flip 28 cut
    q)-1 (,'/) plt each X@\:/: -3?count X 0;
        
        
        
                                                                                          :-   
                         .=#%@@x.          .    :=:=%@#-.-::x%:                         -%@%   
                       :x@%+--#+.              =@@@@@@@@@@@@@@:              .=:       =@%-    
                    -+%#=. .::               -%@@#=--:---:x@@@-             :%%-     .#@=      
                   x@%- -:x@@%.              .xx:      :%@@@+.             #@@=--...-@@:       
                   =@@%@@@@@x                        .x@@@x.             +@@@@@@%%%@@@=        
                        :@@=                       .x@@@#-              #@#:-     :%@:         
                       :@@+                      .x@@@#-                         =%@-          
                      +@@+                      x@@@%-                          -#x.           
                     x@@:                     +@@@%:                                           
                     -:.                     -:::-                                             
    

Neural Network Example

  • 100 rounds of gradient descent

    q)Y:.ml.diag[(1+max y)#1f]@\:y
    q)n:0N!"j"$.ut.nseq[2;count X;count Y]
    784 397 10
    q)theta:2 raze/ THETA:.ml.glorotu'[1+-1_n;1_n];
    q)rf:.ml.l2[1f];
    q)hgolf:`h`g`o`l!`.ml.sigmoid`.ml.dsigmoid`.ml.sigmoid`.ml.logloss
    q)theta:first .fmincg.fmincg[100;.ml.nncostgrad[rf;n;hgolf;Y;X];theta];
    Iteration 100 | cost: 0.5257291
    q)avg yt=p:.ml.imax .ml.pnn[hgolf;Xt] .ml.nncut[n] theta
    0.8722
    
  • The confusion matrix is intuitive

    q).ut.totals[`TOTAL] .ml.cm[yt;"i"$p]
    y| 0    1    2    3   4   5   6   7    8   9    TOTAL
    -| --------------------------------------------------
    0| 942  0    6    1   0   10  13  5    3   0    980  
    1| 0    1096 4    1   0   3   2   1    27  1    1135 
    2| 11   13   899  13  9   3   19  16   40  9    1032 
    3| 3    3    37   785 3   107 4   16   42  10   1010 
    4| 1    4    5    1   842 1   20  1    10  97   982  
    5| 13   5    10   30  20  730 18  12   34  20   892  
    6| 12   4    21   0   9   30  877 1    4   0    958  
    7| 4    17   30   13  8   7   1   915  0   33   1028 
    8| 21   14   6    17  15  43  21  16   793 28   974  
    9| 13   5    5    17  43  13  1   60   9   843  1009 
     | 1020 1161 1023 878 949 947 976 1043 962 1041 10000
    

Content-Based/Collaborative Filtering (Recommender Systems)

Generating Personal Movie Ratings

  • Loading the MovieLens data set

    q)\l mlens.q
    [down]loading latest movielens data set
    "unzip -n ml-latest-small.zip"
    
  • Append personal ratings

    q)r:1!select `mlens.movie$movieId,rating:0n from mlens.movie
    q)r,:([]movieId:173 208 260 435 1197 2005 1968i;rating:.5 .5 4 .5 4 4 4f)
    q)r,:([]movieId:2918 4006 53996 69526 87520 112370i;rating:5 5 4 4 5 5f)
    q)select movieId,rating,movieId.title from r where not null rating
    movieId rating title                                
    ----------------------------------------------------
    173     0.5    "Judge Dredd"                        
    208     0.5    "Waterworld"                         
    260     4      "Star Wars: Episode IV - A New Hope" 
    435     0.5    "Coneheads"                          
    1197    4      "Princess Bride, The"                
    1968    4      "Breakfast Club, The"                
    2005    4      "Goonies, The"                       
    2918    5      "Ferris Bueller's Day Off"           
    4006    5      "Transformers: The Movie"            
    53996   4      "Transformers"                       
    69526   4      "Transformers: Revenge of the Fallen"
    87520   5      "Transformers: Dark of the Moon"     
    112370  5      "Transformers: Age of Extinction"    
    

Content-Based Filtering Fitting

  • Assume that each movie can be described by a list of boolean flags indicating which genre the movie is in
  • Building a matrix of these flags allows us to use linear regression

    q)Y:value[r]1#`rating
    q)X:"f"$flip genre in/: value[mlens.movie]`genres
    q)theta:first 0N!THETA:(1;1+count X)#0f
    ,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0f
    q)rf:.ml.l2[.1]
    q)theta:first .fmincg.fmincg[20;.ml.lincostgrad[rf;Y;X];theta]
    Iteration 20 | cost: 8.972662e-05
    
  • Theta coefficients describe my genre preferences

    q){(5#x),-5#x}desc genre!1_theta
    1980    | 1.772478
    2010    | 1.546712
    1970    | 0.6938292
    Comedy  | 0.408703
    Sci-Fi  | 0.4022778
    Children| -0.2086186
    Romance | -0.2254342
    Drama   | -0.4333146
    Fantasy | -0.8020893
    1990    | -2.581026
    

Content-Based Filtering Prediction

  • Scores are generated by multiplying my preferences by the genres

    q)r:update score:first .ml.plin[X;enlist theta] from r
    
  • Highest recommendations actually score higher than 5

    q)select[5;>score] movieId,score,movieId.title from r
    movieId score    title                             
    ---------------------------------------------------
    85261   5.680565 "Mars Needs Moms"                 
    92681   5.521147 "Journey 2: The Mysterious Island"
    81564   5.442022 "Megamind"                        
    4121    5.441443 "Innerspace"                      
    8633    5.441443 "Last Starfighter, The"           
    
  • Highest recommendations ‘check all the boxes’

    q)select[5;>score] movieId.genres from r
    genres                                                             
    -------------------------------------------------------------------
    `genre$`Action`Adventure`Animation`Children`Comedy`Sci-Fi`IMAX`2010
    `genre$`Action`Adventure`Comedy`Sci-Fi`IMAX`2010                   
    `genre$`Action`Animation`Children`Comedy`Sci-Fi`IMAX`2010          
    `genre$`Action`Adventure`Comedy`Sci-Fi`1980                        
    `genre$`Action`Adventure`Comedy`Sci-Fi`1980                        
    

User-User Collaborative Filtering

  • Build a user/movie matrix and demean each user to remove biases

    q)n:20
    q)m:exec distinct movieId from rating where n<(count;i) fby movieId
    q)R:value exec (movieId!rating) m by userId from rating where movieId in m
    q)R,:r[([]movieId:m);`rating]
    q)U:R-au:avg each R
    
  • Recommendations are generated using KNN to average the ratings of the k ‘closest’ users

    q)k:30
    q)p:last[au]+.ml.fknn[1f-;.ml.cordist\:;k;U;0f^U] 0f^last U
    q)`score xdesc update score:p,movieId.title from ([]movieId:m)#r
    movieId| rating score    title                      
    -------| -------------------------------------------
    1967   |        5.117033 "Labyrinth"                
    475    |        4.761649 "In the Name of the Father"
    1276   |        4.470915 "Cool Hand Luke"           
    2067   |        4.470915 "Doctor Zhivago"           
    2968   |        4.421184 "Time Bandits"             
    2918   | 5      4.403138 "Ferris Bueller's Day Off" 
    908    |        4.346668 "North by Northwest"       
    71535  |        4.279887 "Zombieland"               
    8874   |        4.279887 "Shaun of the Dead"        
    

Item-Item Collaborative Filtering

  • User ratings are sparse and subject to frequent changes
  • Item ratings have more support and are more stable
  • Item (instead of user) biases are removed

    q)I-:ai:avg each I:flip R
    
  • Recommendations are generated using KNN to average our own rating of each movie’s k most similar movies

    q)D:((0^I) .ml.cosdist\:) peach 0^I
    q)p:ai+.ml.knn[1f-;k;last each I] D
    q)`score xdesc update score:p,movieId.title from ([]movieId:m)#r
    movieId| rating score    title                                            
    -------| -----------------------------------------------------------------
    912    |        5.39     "Casablanca"                                     
    926    |        5.379167 "All About Eve"                                  
    908    |        5.334211 "North by Northwest"                             
    356    |        5.314134 "Forrest Gump"                                   
    1262   |        5.277907 "Great Escape, The"                              
    1207   |        5.27069  "To Kill a Mockingbird"                          
    720    |        5.242593 "Wallace & Gromit: The Best of Aardman Animation"
    899    |        5.224468 "Singin' in the Rain"                            
    1247   |        5.213291 "Graduate, The"                                  
    1148   |        5.185714 "Wallace & Gromit: The Wrong Trousers"           
    

Matrix Decomposition Collaborative Filtering Algorithm

  • The null-riddled rating matrix cannot be decomposed with SVD
  • Alternatives include Gradient Descent, Stochastic Gradient Descent and Alternating Least Squares (ALS).
  • ALS elegantly transforms a non-convex problem into an iterative quadratic problem
/ ALS-WR (a)lternating (l)east (s)quares with (w)eighted (r)egularization
alswr:{[l2;Y;XTHETA]
 X:flip f2nd[wridge[l2;;XTHETA 1]] Y; / hold THETA constant, solve for X
 THETA:flip wridge[l2;;X] peach Y;    / hold X constant, solve for THETA
 (X;THETA)}

Matrix Decomposition Collaborative Filtering Example

  • Perform regularized ALS until cost is below supplied threshold

    q)nf:10;
    q)n:(ni:count U 0;nu:count U)
    q)XTHETA:(X:-1+ni?/:nf#1f;THETA:-1+nu?/:nf#2f)
    q)XTHETA:first .ml.iter[1;.0001;.ml.cfcost[();U] .;.ml.alswr[.01;U]] XTHETA
    iter: 144 | cost: 8.349892 | pct: 9.952313e-05
    
  • Recommendations are generated by combining item and user exposure matrices

    q)P:au+.ml.pcf . XTHETA
    q)show t:`score xdesc update score:last P,movieId.title from ([]movieId:m)#r
    movieId| rating score    title                                                
    -------| ---------------------------------------------------------------------
    4232   |        9.907802 "Spy Kids"                                           
    102903 |        9.573078 "Now You See Me"                                     
    3869   |        8.985205 "Naked Gun 2 1/2: The Smell of Fear, The"            
    265    |        8.948548 "Like Water for Chocolate (Como agua para chocolate)"
    84954  |        8.573156 "Adjustment Bureau, The"                             
    116823 |        8.445161 "The Hunger Games: Mockingjay - Part 1"              
    53121  |        8.140992 "Shrek the Third"                                    
    6942   |        7.996043 "Love Actually"                                      
    

Google PageRank

Google PageRank Algorithm

  • The PageRank algorithm relies on page links (not content)
  • Links (like citations) provide evidence of relative importance
  • The PageRank algorithm adds ‘random surfing’

    / given a (d)amping factor (1 - the probability of random surfing) and the
    / (A)djacency matrix, create the Markov Google matrix
    google:{[d;A]
     M:A%1f|s:sum each A;           / convert to Markov matrix
     M+:(0f=s)%n:count M;           / add links to dangling pages
     M:(d*M)+(1f-d)%n;              / dampen
     M}
    
  • The Google Matrix is a transition matrix with random surfing

    q).ml.google[.85] (10110b;11001b;01110b;00111b;00000b)
    0.3133333 0.03      0.3133333 0.3133333 0.03     
    0.3133333 0.3133333 0.03      0.03      0.3133333
    0.03      0.3133333 0.3133333 0.3133333 0.03     
    0.03      0.03      0.3133333 0.3133333 0.3133333
    0.2       0.2       0.2       0.2       0.2      
    

Google PageRank Example

  • Generate page links

    q)i:1 1 2 2 3 3 3 4 6
    q)j:2 6 3 4 4 5 6 1 1
    
  • Enumerate links and create a sparse matrix

    q)node:asc distinct raze l
    q)l:node?l
    q)show S:(1 2#1+max over l), .ml.prepend[1f] l
    6 6
    1 1 1 1 1 1 1 1 1f
    0 0 1 1 2 2 2 3 5
    1 5 2 3 3 4 5 0 0
    
  • Iteratively apply transition matrix until convergence

    q)node[i]!r i:idesc r:.ml.pageranks[d;S] over r:n#1f%n:S[0;0]
    1| 0.3210169
    6| 0.200744
    2| 0.170543
    4| 0.1367926
    3| 0.1065916
    5| 0.0643118
    

Sparklines

Sparkline Algorithm

A sparkline is a small intense, simple, word-sized graphic with typographic resolution. – Edward Tufte

  • Bucket all values into 8 bins
  • Map bins to 8 Unicode characters
/ allocate x into n bins
nbin:{[n;x](n-1)&floor n*.5^x%max x-:min x}

/ generate unicode sparkline
spark:raze("c"$226 150,/:129+til 8)nbin[8]::

Sparkline Example

  • Load sample DJIA data set

    q)\l dji.q
    [down]loading dji data set
    q)dji.t
    quarter stock date       open  high  low   close volume    percent_change_price percent_change_vo..
    -------------------------------------------------------------------------------------------------..
    1       AA    2011.01.07 15.82 16.72 15.78 16.42 239655616 3.79267                               ..
    1       AA    2011.01.14 16.71 16.71 15.64 15.97 242963398 -4.42849             1.380223         ..
    1       AA    2011.01.21 16.19 16.38 15.6  15.79 138428495 -2.47066             -43.02496        ..
    1       AA    2011.01.28 15.87 16.63 15.82 16.13 151379173 1.63831              9.3555           ..
    1       AA    2011.02.04 16.18 17.39 16.18 17.14 154387761 5.93325              1.987452         ..
    ..
    
  • Use q-sql to conveniently access close and stock columns

    q)-1@'10#exec ((4$string stock 0),": ",.ut.spark close) by stock from dji.t;
    AA  : ▅▄▃▄▇▇▇▅▅▄▄▆▇█▅▆▆▇▆▄▅▃▂▁▂
    AXP : ▁▃▃▁▁▄▃▁▁▁▁▃▂▃▃▄▆▇▇██▇▅▆▆
    BA  : ▁▁▂▁▂▃▃▃▃▂▁▄▄▄▃▅███▇▆▅▃▄▂
    BAC : ▇█▇▆▇██▇▇▇▆▅▅▆▄▄▃▄▃▂▂▂▁▁▁
    CAT : ▁▁▁▂▃▄▅▄▄▃▅▆█▇▆▆█▇▅▅▅▃▂▂▃
    CSCO: ▇█▇▇█▅▅▅▄▄▃▃▃▄▃▃▃▃▃▂▂▂▁▁▁
    CVX : ▁▁▂▁▃▃▄▅▆▄▆▇██▇██▆▅▅▆▅▄▄▃
    DD  : ▂▂▁▂▄▆█▆▆▅▅▆▇▇▇██▆▅▅▄▂▂▂▄
    DIS : ▃▃▃▂▅█████▅██▆▆▇██▆▆▆▃▂▁▁
    GE  : ▂▂▅▆▆██▇▆▆▃▅▆▆▅▅▆▅▅▄▄▂▁▂▁
    

Thank You

Never, ever underestimate the importance of having fun – Randy Pausch