Fun Q: QWAFAxNEW

November 10, 2020

Goals

Introduce you to Q and KDB+
Amaze you with 12 machine-learning algorithms
Impress you with efficient code
Dazzle you with ASCII plots
Motivate you to learn more about ML and Q
Cheer you up with a fun quote

Q Fundamentals

What is Q?

The language behind the world’s fastest time series database
Language Paradigms:
- Interpreted
- Dynamically Typed
- Vectorized
- Functional
- Event-driven
Think Python but with native support for:
- SQL
- Parallelization
- Large Tables (> 1 Billion Rows)
- Distributed Computing

Why do I like Q?

Personally manage extremely large datasets with a single machine
- Databases are directories
- Partitions are sub-directories
- Tables are sub-directories within partitions
- Columns are files within table directories
Efficiently joins disparate datasets—making complex analysis easy
Vectorized computations reduces the interpreter overhead
Vectors, matrices, dictionaries and tables are all lists
Notation works intuitively across types
Code-injection allows iterative development
Q-SQL permits all q operators and even user-defined functions
Null and infinite values are available for integer and temporal types
Functional paradigm promotes parallelism by separating functions from data

Fun Q Overview

Where does Fun Q Fit in?

The original Coursera machine-learning course was taught in Matlab/Octave by Andrew Ng
Why can’t q be used for the same problems?
It turns out that many machine-learning algorithms have beautifully vectorized solutions in q
5 years of continuous refactoring revealed common functional paradigms in machine learning
Small 1–5 line functions are the perfect tool for teaching
12 distinct machine-learning algorithm families are taught using nothing but the 32-bit ‘personal’ edition or 64-bit ‘on-demand’ edition of the q language
Limiting all graphics to the terminal adds a bit of fun to the journey

Algorithms

K-Nearest Neighbors (KNN)
K-Means/Medians/Mediods Clustering
Hierarchical Agglomerative Clustering (HAC)
Expectation Maximization (EM)
Naive Bayes
Decision Tree (ID3,C4.5,CART)
Discrete Adaptive Boosting (AdaBoost)
Boosted Aggregating (BAG) and Random Forest
Linear Regression
Logistic Regression and One-vs.-All
Neural Network Classification/Regression
Content-Based/Collaborative Filtering (Recommender Systems)
Google PageRank

K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) Algorithm

Predictions are made by aggregating the K nearest neighbors

‘Nearness’ requires the definition of distance metrics

mnorm:sum abs::                           / Manhattan (taxicab) norm
enorm2:{x wsum x}                         / Euclidean norm squared
enorm:sqrt enorm2::                       / Euclidean norm
    
hdist:sum (<>)::                / Hamming distance
mdist:mnorm (-)::               / Manhattan (taxicab) distance
edist2:enorm2 (-)::             / Euclidean distance squared
edist:enorm (-)::               / Euclidean distance

Use data type to differentiate between classification and regression

/ weighted average or mode
isord:{type[x] in 0 8 9h}               / is ordered
aom:{$[isord x;avg;mode]x}              / average or mode
waom:{[w;x]$[isord x;nwavg;wmode][w;x]} / weighted average or mode

/ find (k) smallest values from (d)istance vector (or matrix) and use
/ (w)eighting (f)unction to return the best estimate of y
knn:{[wf;k;y;d]
 if[not type d;:.z.s[wf;k;y] peach d];    / recurse for matrix d
 if[any n:null d;d@:i:where not n; y@:i]; / filter null distances
 p:(waom . (wf d::;y)@\:#[;iasc d]::) peach k&count d; / make predictions
 p}

K-Nearest Neighbors (KNN) Example

Load Fun Q library and famous Iris data set

q)\l funq.q
q)\l iris.q
[down]loading iris data set
q)iris.t
species     slength swidth plength pwidth
-----------------------------------------
Iris-setosa 5.1     3.5    1.4     0.2   
Iris-setosa 4.9     3      1.4     0.2   
Iris-setosa 4.7     3.2    1.3     0.2   
Iris-setosa 4.6     3.1    1.5     0.2   
..

Partition between train/test (and even validation)

q)d:.ut.part[`test`train!4 1;0N?] iris.t
q)y:first get first `Y`X set' 0 1 cut value flip d.train
q)yt:first get first `Yt`Xt set' 0 1 cut value flip d.test

Train and test

q)k:5
q)avg yt=.ml.knn[0n<;k;y] .ml.f2nd[.ml.edist X] Xt
0.975
q)avg yt=.ml.knn[sqrt 1f%;k;y] .ml.pedist2[X;Xt]
0.9583333

K-Means Clustering

K-Means Clustering Algorithm

Find K centroids which minimize the intra-cluster distances
Lloyd’s algorithm iteratively alternates between:
- Assigning each data point to its nearest centroid
- Updating centroid definitions by summarizing each cluster
Fun Q algorithm implements a single round of Lloyd’s algorithm
Iteration is controlled by the user using q adverbs
- Fixed number of iterations
- Run until convergence
- Run until custom test returns false

/ using the (d)istance (f)unction, group matri(X) based on the closest
/ (C)entroid and return the cluster indices
cgroup:{[df;X;C]value group imin f2nd[df X] C}

/ Stuart Lloyd's algorithm. uses (d)istance (f)unction to assign the
/ matri(X) to the nearest (C)entroid and then uses the (c)entroid (f)unction
/ to update the centroid location.
lloyd:{[df;cf;X;C]cf X@\: cgroup[df;X;C]}

kmeans:lloyd[edist2;avg'']      / k-means

K-Means Clustering Example

Plot actual clusters of petal length/width

q).ut.plt (t.plength;t.pwidth;{distinct[x]?x} t.species)
3| "                   "
 | "             @@@@  "
 | "             @@@@ @"
2| "            @@@@@@@"
 | "          ++x# @   "
 | "        +++++ @    "
1| "      ++++         "
 | " .                 "
 | "...                "
0| "...                "

Plot computed centroids within each cluster

q)C:.ml.kmeans[X] over last 3 .ml.kmeanspp[X]// 2#()
q).ut.plt .ml.append[0N;X 2 3],'.ml.append[1] C 2 3
3| "                   "
 | "             ....  "
 | "             .... ."
2| "            ...@..."
 | "          .... .   "
 | "        ..@.. .    "
1| "      ....         "
 | " .                 "
 | "...                "
0| ".@.                "

Hierarchical Agglomerative Clustering (HAC)

Hierarchical Agglomerative Clustering (HAC) Algorithm

Iteratively merges closest clusters — one at a time
Only the distances from previous round are needed

Implements a single round of the Lance-Williams algorithm

lancewilliams:{[lf;D;a;L]
 n:count D;
 d:D@'di:imin peach D;                        / find closest distances
 if[null d@:i:imin d;:(D;a;L)]; j:di i;       / find closest clusters
 c:$[9h=type lf;lf;lf(freq a)@/:(i;j;til n)]; / determine coefficients
 nd:sum c*nd,(d;abs(-/)nd:D (i;j));           / calc new distances
 D[;i]:D[i]:nd;                               / update distances
 D[;j]:D[j]:n#0n;                             / erase j
 a[where j=a]:i;                / all elements in cluster j are now in i
 L:L,'(j;i);                    / append linkage stats
 (D;a;L)}

The .ml.link function iterates and returns the linkage statistics

link:{[lf;D]
 D:@'[D;a:til count D;:;0n];    / define cluster assignments and ignore loops
 if[-11h=type lf;lf:get lf];    / dereference lf
 L:last .[lancewilliams[lf]] over (D;a;2#()); / obtain linkage stats
 L}

Lance-Williams Formula and Coefficients

\[ d_{(ij)k} = \alpha_i d_{ik} + \alpha_j d_{jk} + \beta d_{ij} + \gamma \lvert d_{ik} - d_{jk} \rvert \]

Method	\(\alpha_i\)	\(\alpha_j\)	\(\beta\)	\(\gamma\)
Single	.5	.5	0	-.5
Complete	.5	.5	0	.5
Average	\(\frac{n_i}{n_i+n_j}\)	\(\frac{n_j}{n_i+n_j}\)	0	0
Weighted	.5	.5	0	0
Centroid	\(\frac{n_i}{n_i+n_j}\)	\(\frac{n_j}{n_i+n_j}\)	\(-\frac{n_i n_j}{(n_i+n_j)^2}\)	0
Median	.5	.5	-.25	0
Ward	\(\frac{n_i + n_k}{n_i+n_j+n_k}\)	\(\frac{n_j + n_k}{n_i+n_j+n_k}\)	\(-\frac{n_k}{n_i+n_j+n_k}\)	0

Hierarchical Agglomerative Clustering (HAC) Example

Build linkage statistics

q)X:iris.X
q)L:.ml.link[`.ml.lw.ward] sqrt .ml.pedist2[X;X]

Visually inspect clusters of different sizes

q)(,'/) (.ut.plt X[2 3],enlist .ut.ugrp@) each .ml.clust[L] 1+til 3
3| "                                                         "
 | "             ++++               @@@@               +@@@  "
 | "             ++++ +             @@@@ @             @@@@ @"
2| "            +++++++            @@@@@@@            +#@@@@@"
 | "          ++++ +             @@@@ @             ++++ @   "
 | "        +++++ +            @@@@@ @            +++++ @    "
1| "      ++++               @@@@               ++++         "
 | " +                  .                  .                 "
 | "+++                ...                ...                "
0| "+++                ...                ...                "

Expectation Maximization (EM)

Expectation Maximization (EM) Algorithm

Iteratively alternates between using the likelihood and MLE to:
- Assign a probability that points belongs to a distribution
- Compute new distribution parameters
Using Gaussian distribution, can be considered a soft K-Means
Introduces probability metrics to the toolbox

Implements a single round of Expectation Maximization

/ using (l)ikelihood (f)unction, (w)eighted (m)aximum likelihood estimator
/ (f)unction with prior probabilities (p)hi and distribution parameters
/ (THETA), optionally (f)fit (p)hi and perform expectation maximization
em:{[fp;lf;wmf;X;phi;THETA]
 W:prb likelihood[0b;lf;X;phi;THETA]; / weights (responsibilities)
 if[fp;phi:avg each W];               / new phi estimates
 THETA:wmf[;X] peach W;               / new THETA estimates
 (phi;THETA)}

Expectation Maximization (EM) Example

Initialized multi-variate Gaussian distribution parameters

q)`X`y set' iris`X`y;
q)k:count distinct y
q)phi:k#1f%k
q)mu:X@\:/:neg[k]?count y
q)SIGMA:k#enlist X cov\:/: X
q)lf:.ml.gaussmvl
q)mf:.ml.wgaussmvmle

Fit Iris data to the distribution and compare clusters

q)pT:(.ml.em[1b;lf;mf;X]//) (phi;flip (mu;SIGMA))
q)p:.ml.imax .ml.likelihood[1b;.ml.gaussmvll;X] . pT
q)m:.ml.mode each y group p
q)avg y=m p
0.8866667

Confusion Matrix Example

A confusion matrix compares expected vs actual values

/ given true labels y and predicted labels p, return a confusion matrix
cm:{[y;p]
 n:count u:asc distinct y,p;
 m:./[(n;n)#0;flip (u?p;u?y);1+];
 t:([]y:u)!flip (`$string u)!m;
 t}

Q tables clearly render the confusion matrix

q).ut.totals[`TOTAL] .ml.cm[y;m p]
y              | Iris-setosa Iris-versicolor Iris-virginica TOTAL
---------------| ------------------------------------------------
Iris-setosa    | 50          0               0              50   
Iris-versicolor| 0           49              1              50   
Iris-virginica | 0           16              34             50   
               | 50          65              35             150  

Naive Bayes

Naive Bayes Algorithm

‘Naively’ assumes conditional independence between features
Does not require iterating

Uses weighted maximum likelihood estimator functions along with class labels to build a classifier

/ fit parameters given (w)eighted (m)aximization (f)unction returns a
/ dictionary with prior and conditional likelihoods
fnb:{[wmf;w;y;X]
 if[(::)~w;w:count[y]#1f];      / handle unassigned weight
 pT:(odds g; (wmf . (w;X@\:) @\:) peach g:group y);
 pT}

Classifies by using [log]likelihood function and picking class with highest probability density

/ using a [log](l)ikelihood (f)unction and prior probabilities (p)hi and
/ distribution parameters (T)HETA, perform naive Bayes classification
pnb:{[l;lf;pT;X]
 d:{(x . z) y}[lf]'[X] peach pT[1]; / compute probability densities
 c:imax $[l;log[pT 0]+sum flip d;pT[0]*prd flip d];
 c}

Naive Bayes Example

Use MLE to associate each class with prior and conditional likelihoods (not probabilities)

q)flip pT:.ml.fnb[.ml.wgaussmle/:;::;iris.y;iris.X]
Iris-setosa    | 0.3333333 (5.006 0.121764;3.428 0.140816;1.462 0.029556;0.246 0.0108..
Iris-versicolor| 0.3333333 (5.936 0.261104;2.77 0.0965;4.26 0.2164;1.326 0.038324)   ..
Iris-virginica | 0.3333333 (6.588 0.396256;2.974 0.101924;5.552 0.298496;2.026 0.0739..

Classification picks the class with the maximum likelihood density
```
q)avg iris.y=.ml.pnb[0b;.ml.gaussl;pT] iris.X
.96
```

Decision Tree

Decision Tree Algorithm

Recursively splits observations on features, progressively creating branches with more uniform target values
Parameters include ‘gain functions’, ‘impurity functions’ and options to control tree depth

dt:{[cgf;ogf;ipf;opt;w;t]
 if[(::)~w;w:n#1f%n:count t];       / compute default weight vector
 if[1=count d:flip t;:(w;first d)]; / no features to test
 opt:(`maxd`minss`minsl`ming`maxff!(0N;2;1;0;::)),opt; / default options
 if[0=opt`maxd;:(w;first d)];    / check if we've reached max depth
 if[identical a:first d;:(w;a)]; / check if all values are equal
 if[opt[`minss]>count a;:(w;a)]; / check if insufficient samples
 d:((neg floor opt[`maxff] count d)?key d)#d:1 _d;   / sub-select features
 d:{.[x isord z;y] z}[(cgf;ogf);(ipf;w;a)] peach d;  / compute gains
 d:(where (any opt[`minsl]>count each last::) each d) _ d; / filter on minsl
 if[0=count d;:(w;a)];          / check if all leaves have < minsl samples
 if[opt[`ming]>=first b:d bf:imax d[;0];:(w;a)]; / check gain of best feature
 c:count k:key g:last b;        / grab subtrees, feature names and count
 / distribute nulls down each branch with reduced weight
 if[c>ni:null[k]?1b;w:@[w;n:g nk:k ni;%;c-1];g:(nk _g),\:n];
 if[(::)~b 1;t:(1#bf)_t];       / don't reuse exhausted features
 b[2]:.z.s[cgf;ogf;ipf;@[opt;`maxd;-;1]]'[w g;t g]; / split sub-trees
 bf,1_b}

Decision Tree Impurity Functions

Impurity functions distill the impurity of branches into a single value

Classification impurity functions are used for categorical values

misc:{1f-avg x=mode x}                  / misclassification
gini:{1f-enorm2 odds group x}           / Gini
entropy:{neg sum x*log x:odds group x}  / entropy

The Gini impurity function is used in the CART decision tree algorithm (arguably because it is more efficient than the entropy function which uses the log operator)

Regression impurity functions are used for ordered values

mse:{enorm2[x-avg x]%count x}          / mean squared error
mae:{avg abs x-avg x}                  / mean absolute error

Decision Tree Projections

Many famous decision tree implementations are just parameterizations of .ml.dt
Scikit-learn’s CART algorithm trees treats all features as ordered – requiring users to first us one-hot encoding
Ross Quinlan’s C4.5 algorithm differentiates between categorical and ordered features
Regression trees use the same algorithm, but use a mean squared error impurity function and predict values with the weighted average of the branch values instead of the weighted mode

aid:dt[sig;oig;wmse]            / automatic interaction detection
thaid:dt[sig;oig;wmisc]         / theta automatic interaction detection
id3:dt[ig;ig;wentropy]          / iterative dichotomizer 3
q45:dt[gr;ogr;wentropy]         / like c4.5
ct:dt[oig;oig;wgini]            / classification tree
rt:dt[oig;oig;wmse]             / regression tree

Decision Tree Example (ID3)

To play or not to play

q)\l weather.q
q)weather.t
Play Outlook  Temperature Humidity Wind  
-----------------------------------------
No   Sunny    Hot         High     Weak  
No   Sunny    Hot         High     Strong
Yes  Overcast Hot         High     Weak  
Yes  Rain     Mild        High     Weak  
Yes  Rain     Cool        Normal   Weak  
No   Rain     Cool        Normal   Strong
Yes  Overcast Cool        Normal   Strong
No   Sunny    Mild        High     Weak  
Yes  Sunny    Cool        Normal   Weak  
..

Categorical features cannot be reused

q)-1 .ml.ptree[0] .ml.id3[();::] weather.t;
Yes (n = 14, err = 35.7%)
|  Outlook :: Overcast: Yes (n = 4, err = 0%)
|  Outlook :: Rain: Yes (n = 5, err = 40%)
|  |  Wind :: Strong: No (n = 2, err = 0%)
|  |  Wind :: Weak: Yes (n = 3, err = 0%)
|  Outlook :: Sunny: No (n = 5, err = 40%)
|  |  Humidity :: High: No (n = 3, err = 0%)
|  |  Humidity :: Normal: Yes (n = 2, err = 0%)

Decision Tree Example (CART)

Each split is binary
Features can be reused

q)d:.ut.part[`train`test!3 1;iris.t.species] iris.t
q)-1 .ml.ptree[0] tr:.ml.ct[();::] d`train;
Iris-setosa (n = 111, err = 66.7%)
|  plength >[;2.45] 0: Iris-setosa (n = 37, err = 0%)
|  plength >[;2.45] 1: Iris-versicolor (n = 74, err = 50%)
|  |  pwidth >[;1.7] 0: Iris-versicolor (n = 39, err = 7.7%)
|  |  |  plength >[;4.95] 0: Iris-versicolor (n = 35, err = 0%)
|  |  |  plength >[;4.95] 1: Iris-virginica (n = 4, err = 25%)
|  |  |  |  pwidth >[;1.55] 0: Iris-virginica (n = 3, err = 0%)
|  |  |  |  pwidth >[;1.55] 1: Iris-versicolor (n = 1, err = 0%)
|  |  pwidth >[;1.7] 1: Iris-virginica (n = 35, err = 2.9%)
|  |  |  plength >[;4.85] 0: Iris-virginica (n = 3, err = 33.3%)
|  |  |  |  swidth >[;3.1] 0: Iris-virginica (n = 2, err = 0%)
|  |  |  |  swidth >[;3.1] 1: Iris-versicolor (n = 1, err = 0%)
|  |  |  plength >[;4.85] 1: Iris-virginica (n = 32, err = 0%)
q)avg d.test.species=.ml.pdt[tr] d`test
0.9487179

Discrete Adaptive Boosting (AdaBoost)

Discrete Adaptive Boosting Algorithm

An ensemble of weak learners generates a strong learner
Weak learners merely need to classify with an accuracy > 50%
Decision stumps are commonly used as weak learners
Each additional learner adjusts weights to fit the observations incorrectly classified during the previous round

adaboost:{[tf;cf;w;t]
 if[(::)~w;w:n#1f%n:count t];    / initialize weights
 m:tf[w] t;                      / train model
 p:cf[m] t;                      / make predictions
 e:sum w*not p=y:first flip t;   / compute weighted error
 a:.5*log (c:1f-e)%e;            / compute alpha (minimize exponential loss)
 / w*:exp neg a*y*p;               / increase/decrease weights
 / w%:sum w;                       / normalize weights
 w%:2f*?[y=p;c;e];               / increase/decrease and normalize weights
 (m;a;w)}

Wisconsin Diagnostic Breast Cancer Data Set

Includes mean, max and standard deviation of tumor features

q)\l wdbc.q
[down]loading wisconsin-diagnostic-breast-cancer data set
q)10?wdbc.t
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactne..
-------------------------------------------------------------------------------------..
B         11.89       18.35        77.32          432.2     0.09363         0.1154   ..
M         15.13       29.81        96.71          719.5     0.0832          0.04605  ..
B         12.85       21.37        82.63          514.5     0.07551         0.08316  ..
M         22.27       19.67        152.8          1509      0.1326          0.2768   ..
B         9.465       21.01        60.11          269.4     0.1044          0.07773  ..
B         9.683       19.34        61.05          285.7     0.08491         0.0503   ..
B         11.54       14.44        74.65          402.9     0.09984         0.112    ..
B         12.7        12.17        80.88          495       0.08785         0.05794  ..
M         13.61       24.69        87.76          572.6     0.09258         0.07862  ..
M         18.05       16.15        120.2          1006      0.1065          0.2146   ..

Malignant vs. Benign diagnosis feature can be used for binomial classification

Discrete Adaptive Boosting Example

Training accuracy improves with extra components

q)d:.ut.part[`train`test!3 1;0N?] update -1 1 "M"=diagnosis from 11#/:wdbc.t
q)stump:.ml.ct[(1#`maxd)!1#1]; k:50
q)m:.ml.fab[k;stump;.ml.pdt] d.train
q).ut.plt avg d.train.diagnosis = .ml.pab[1+til k;.ml.pdt;m] d.train
1   | "                   "
0.98| "                  +"
    | "            +++++++"
0.96| "       ++++++ +    "
    | "  ++++++ +         "
    | "+++                "
0.94| " +                 "
0.92| "                   "
    | "+                  "
0.9 | "                   "

Testing accuracy does as well, but not as consistently

q).ut.plt avg d.test.diagnosis = .ml.pab[1+til k;.ml.pdt;m] d.test
0.98| "         +       ++"
    | "       ++ +++++    "
0.96| "++++++++    +  ++++"
    | "   ++++ +    +     "
0.94| " ++                "
    | "                   "
    | "                   "
0.92| "+                  "
    | "                   "
0.9 | "                   "     

Boosted Aggregating (BAG) and Random Forest

Boosted Aggregating Algorithm

Instead of complicated pruning techniques, fit many independent trees and average the predictions
Boosting involves random sampling (with replacement) of the data
The difference between Boosted Aggregating and a Random Forest comes down to how the tree is grown, not the ensemble algorithm
To increase tree diversity, a Random Forest adds artificial limits on the number of observed features at each tree level

/ generate (n) decision trees by applying (f) to a resampled (with
/ replacement) (t)able
bag:{[n;f;t](f ?[;t]::) peach n#count t} / (b)ootstrap (ag)gregating

/ given an atom or list (k), and bootstrap aggregating (m)odel, make
/ prediction on (d)ictionary
pbag:{[k;m;d]
 if[count[m]<max k;'`length];
 if[98h=type d;:.z.s[k;m] peach d]; / iterate on a table
 p:k {aom x#y}\: pdt[;d] peach m;
 p}

Random Forest Example

The decision tree algorithm provides the maxff option which controls the maximum number of features visible at each branch

Random Forest classification (maxff = sqrt)

q)k:20
q)d:.ut.part[`train`test!3 1;0N?] wdbc.t
q)m:.ml.bag[k;.ml.q45[(1#`maxff)!1#sqrt;::]] d`train
q)avg d.test.diagnosis=.ml.pbag[k;m] d`test
0.9370629

Random Forest regression (maxff = 1%3)

q)\l winequality.q
[down]loading wine-quality data set
q)d:.ut.part[`train`test!3 1;0N?] winequality.red.t
q)m:.ml.bag[k;.ml.rt[(1#`maxff)!1#%[;3];::]] d`train
q).ml.rms d.test.quality-.ml.pbag[k;m] d`test
0.5682814

Linear Regression

Linear Regression Algorithm

Q has a native multi-variate lsq operator

The implementation is similar to this normal equations function

/ given target matrix Y and data matri(X), return the THETA matrix resulting
/ from minimizing sum of squared residuals
normeq:{[Y;X]mm[mmt[Y;X]] minv mmt[X;X]} / normal equations ols

Inverting matrices is slow and numerically unstable
So the lsq operator uses Cholesky decomposition to solve for the coefficients directly instead of using LU decomposition to first invert the normal equations matrix.

Ridge regression is simply the addition of L2 regularization

/ given (l2) regularization parameter, target matrix Y and data matri(X),
/ return the THETA matrix resulting from performing ridge regression
ridge:{[l2;Y;X]mm[mmt[Y;X]] minv mmt[X;X]+diag count[X]#l2}

Linear Regression Example

Use Box-Muller to generate correlated random variables

q)plt:.ut.plot[30;15;.ut.c10]
q)X:(.ml.bm 10000?) each 1 1f
q)rho:.8                          / correlation
q)X[1]:(rho;sqrt 1f-rho*rho)$X
q)plt[sum] X
4 | "                       .   .  "
  | "                   .   . ..  ."
  | "                  ............"
2 | "               .............  "
  | "             ....---::-.....  "
  | "          ....-:++++:-.....   "
  | "        ....:=+#%%x+--..      "
0 | "      ....-=x%@@%x:-...       "
  | "     ....:+#%%%+:-...         "
  | "    ....-+++x=:-... .         "
  | "  ....---::--....             "
-2| "  ..............              "
  | ". ..........                  "
  | " . ......                     "
-4| " ..                           "

Prepending a vector of ones allows us to fit the intercept

q)show THETA:(-1#X) lsq .ml.prepend[1f] 1#X
-0.008971004 0.8033666

Logistic Regression

Logistic Regression Cost Function

Logistic regression applies the Sigmoid function to compress values between 0 and 1: \(S(x)=\frac{1}{1+e^{-x}}\)
```
sigmoid:1f%1f+exp neg::                       / sigmoid
```
Cost function with L1 and L2 regularization \[ J(\theta) = \frac{1}{m} \left[ \sum_{i=1}^m \text{LogLoss}(y^{(i)},h_\theta(x^{(i)})) \right] + \frac{\lambda_1}{m} \sum_{j=1}^n\left|\theta_j\right| + \frac{\lambda_2}{2m} \sum_{j=1}^n\theta_j^2\]

Regularized logistic regression cost function

/ logistic regression cost
logcost:{[rf;Y;X;THETA]
 J:(1f%m:count X 0)*revo[sum] logloss[Y] plog[X;THETA];    / cost
 if[count rf,:();THETA[;0]:0f; J+:sum rf[;m][;0][;THETA]]; / regularization
 J}

Logistic Regression Gradient Function

Gradient function with L1 and L2 regularization \[G(\theta_j)=\frac{\partial J(\theta)}{\partial \theta_j}= \frac{1}{m}\left(h_\theta(x^{(i)})-y^{(i)}\right)x_j^{(i)} + \frac{\lambda_1}{m} \text{sign}(\theta_j) + \frac{\lambda_2}{m} \theta_j \]

Regularized logistic regression gradient function

/ logistic regression gradient
loggrad:{[rf;Y;X;THETA]
 G:(1f%m:count X 0)*mmt[sigmoid[mm[THETA;X]]-Y] X:prepend[1f] X; / gradient
 if[count rf,:();THETA[;0]:0f; G+:sum rf[;m][;1][;THETA]]; / regularization
 G}

Logistic Regression Example

Regularization choices have been factored out

l1:{[l;m]((l%m)*revo[sum] abs::;(l%m)*signum::)}
l2:{[l;m]((.5*l%m)*revo[sum] {x*x}::;(l%m)*)}

Elastic net is simply a combination of the two
```
enet:{[a;lr](l1 a*lr;l2 a*1f-lr)}
```

Regularized gradient descent prevents over-fitting

q)t:11#/:update "M"=diagnosis from wdbc.t
q)d:.ut.part[`train`test!3 1;0N?] "f"$t
q)y:first get first `Y`X set' 0 1 cut value flip d`train
q)yt:first get first `Yt`Xt set' 0 1 cut value flip d`test
q)THETA:10000 .ml.gd[.1;.ml.loggrad[.ml.l2[.1];Y;X]]/ enlist (1+count X)#0f
q)avg yt="i"$p:first .ml.plog[Xt] THETA
0.8601399
q).ut.totals[`TOTAL] .ml.cm . "i"$(yt;p)
y| 0  1  TOTAL
-| -----------
0| 83 7  90   
1| 13 40 53   
 | 96 47 143

Neural Network Classification/Regression

Neural Network Algorithm

Conceptually similar to applying multiple rounds of linear and/or logistic regression in sequence
But our choice of activation function and cost function can vary
As can the number (and size of) of hidden layers
Gradient descent is once again the workhorse
Parameter adjustments are computed by applying the chain rule to the final error and intermediate gradients

/ (r)egularization (f)unction, (n)etwork topology dimensions, hgolf:
/ (h)idden (g)radient (o)utput (l)oss functions
nncostgrad:{[rf;n;hgolf;Y;X;theta]
 THETA:nncut[n] theta;          / unroll theta
 ZA:enlist[(X;X)],(X;X) {(z;x z:plin[y 1;z])}[hgolf`h]\ -1_THETA;
 P:hgolf[`o] plin[last[ZA]1;last THETA];      / prediction
 J:(1f%m:count X 0)*revo[sum] hgolf[`l][Y;P]; / cost
 G:hgolf[`g]@'`z`a!/:1_ZA;                    / activation gradient
 D:reverse{[D;THETA;G]G*1_mtm[THETA;D]}\[E:P-Y;reverse 1_THETA;reverse G];
 G:(1f%m)*(D,enlist E) mmt' prepend[1f] each ZA[;1]; / full gradient
 if[count rf,:();THETA[;;0]:0f;JG:rf[;m][;;THETA];J+:sum JG@'0;G+:sum JG@'1];
 (J;2 raze/ G)}

MNIST Hand Written Digits

Each digit is represented by a 28 x 28 pixel image with values ranging between 0 and 255

q)\l mnist.q
[down]loading handwritten-digit data set
q)`X`Xt`y`yt set' mnist`X`Xt`y`yt;
q)X:1000#'X;y:1000#y;
q)X%:255f;Xt%:255f

The plotting utility allows us to visualize the images

q)plt:value .ut.plot[28;14;.ut.c10;avg] .ut.hmap flip 28 cut
q)-1 (,'/) plt each X@\:/: -3?count X 0;
    
    
    
                                                                                      :-   
                     .=#%@@x.          .    :=:=%@#-.-::x%:                         -%@%   
                   :x@%+--#+.              =@@@@@@@@@@@@@@:              .=:       =@%-    
                -+%#=. .::               -%@@#=--:---:x@@@-             :%%-     .#@=      
               x@%- -:x@@%.              .xx:      :%@@@+.             #@@=--...-@@:       
               =@@%@@@@@x                        .x@@@x.             +@@@@@@%%%@@@=        
                    :@@=                       .x@@@#-              #@#:-     :%@:         
                   :@@+                      .x@@@#-                         =%@-          
                  +@@+                      x@@@%-                          -#x.           
                 x@@:                     +@@@%:                                           
                 -:.                     -:::-                                             

Neural Network Example

100 rounds of gradient descent

q)Y:.ml.diag[(1+max y)#1f]@\:y
q)n:0N!"j"$.ut.nseq[2;count X;count Y]
784 397 10
q)theta:2 raze/ THETA:.ml.glorotu'[1+-1_n;1_n];
q)rf:.ml.l2[1f];
q)hgolf:`h`g`o`l!`.ml.sigmoid`.ml.dsigmoid`.ml.sigmoid`.ml.logloss
q)theta:first .fmincg.fmincg[100;.ml.nncostgrad[rf;n;hgolf;Y;X];theta];
Iteration 100 | cost: 0.5257291
q)avg yt=p:.ml.imax .ml.pnn[hgolf;Xt] .ml.nncut[n] theta
0.8722

The confusion matrix is intuitive

q).ut.totals[`TOTAL] .ml.cm[yt;"i"$p]
y| 0    1    2    3   4   5   6   7    8   9    TOTAL
-| --------------------------------------------------
942  0    6    1   0   10  13  5    3   0    980  
0    1096 4    1   0   3   2   1    27  1    1135 
11   13   899  13  9   3   19  16   40  9    1032 
3    3    37   785 3   107 4   16   42  10   1010 
1    4    5    1   842 1   20  1    10  97   982  
13   5    10   30  20  730 18  12   34  20   892  
12   4    21   0   9   30  877 1    4   0    958  
4    17   30   13  8   7   1   915  0   33   1028 
21   14   6    17  15  43  21  16   793 28   974  
13   5    5    17  43  13  1   60   9   843  1009 
 | 1020 1161 1023 878 949 947 976 1043 962 1041 10000

Content-Based/Collaborative Filtering (Recommender Systems)

Generating Personal Movie Ratings

Loading the MovieLens data set

q)\l mlens.q
[down]loading latest movielens data set
"unzip -n ml-latest-small.zip"

Append personal ratings

q)r:1!select `mlens.movie$movieId,rating:0n from mlens.movie
q)r,:([]movieId:173 208 260 435 1197 2005 1968i;rating:.5 .5 4 .5 4 4 4f)
q)r,:([]movieId:2918 4006 53996 69526 87520 112370i;rating:5 5 4 4 5 5f)
q)select movieId,rating,movieId.title from r where not null rating
movieId rating title                                
----------------------------------------------------
173     0.5    "Judge Dredd"                        
208     0.5    "Waterworld"                         
260     4      "Star Wars: Episode IV - A New Hope" 
435     0.5    "Coneheads"                          
1197    4      "Princess Bride, The"                
1968    4      "Breakfast Club, The"                
2005    4      "Goonies, The"                       
2918    5      "Ferris Bueller's Day Off"           
4006    5      "Transformers: The Movie"            
53996   4      "Transformers"                       
69526   4      "Transformers: Revenge of the Fallen"
87520   5      "Transformers: Dark of the Moon"     
112370  5      "Transformers: Age of Extinction"    

Content-Based Filtering Fitting

Assume that each movie can be described by a list of boolean flags indicating which genre the movie is in

Building a matrix of these flags allows us to use linear regression

q)Y:value[r]1#`rating
q)X:"f"$flip genre in/: value[mlens.movie]`genres
q)theta:first 0N!THETA:(1;1+count X)#0f
,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0f
q)rf:.ml.l2[.1]
q)theta:first .fmincg.fmincg[20;.ml.lincostgrad[rf;Y;X];theta]
Iteration 20 | cost: 8.972662e-05

Theta coefficients describe my genre preferences

q){(5#x),-5#x}desc genre!1_theta
1980    | 1.772478
2010    | 1.546712
1970    | 0.6938292
Comedy  | 0.408703
Sci-Fi  | 0.4022778
Children| -0.2086186
Romance | -0.2254342
Drama   | -0.4333146
Fantasy | -0.8020893
1990    | -2.581026

Content-Based Filtering Prediction

Scores are generated by multiplying my preferences by the genres
```
q)r:update score:first .ml.plin[X;enlist theta] from r
```

Highest recommendations actually score higher than 5

q)select[5;>score] movieId,score,movieId.title from r
movieId score    title                             
---------------------------------------------------
85261   5.680565 "Mars Needs Moms"                 
92681   5.521147 "Journey 2: The Mysterious Island"
81564   5.442022 "Megamind"                        
4121    5.441443 "Innerspace"                      
8633    5.441443 "Last Starfighter, The"           

Highest recommendations ‘check all the boxes’

q)select[5;>score] movieId.genres from r
genres                                                             
-------------------------------------------------------------------
`genre$`Action`Adventure`Animation`Children`Comedy`Sci-Fi`IMAX`2010
`genre$`Action`Adventure`Comedy`Sci-Fi`IMAX`2010                   
`genre$`Action`Animation`Children`Comedy`Sci-Fi`IMAX`2010          
`genre$`Action`Adventure`Comedy`Sci-Fi`1980                        
`genre$`Action`Adventure`Comedy`Sci-Fi`1980                        

User-User Collaborative Filtering

Build a user/movie matrix and demean each user to remove biases

q)n:20
q)m:exec distinct movieId from rating where n<(count;i) fby movieId
q)R:value exec (movieId!rating) m by userId from rating where movieId in m
q)R,:r[([]movieId:m);`rating]
q)U:R-au:avg each R

Recommendations are generated using KNN to average the ratings of the k ‘closest’ users

q)k:30
q)p:last[au]+.ml.fknn[1f-;.ml.cordist\:;k;U;0f^U] 0f^last U
q)`score xdesc update score:p,movieId.title from ([]movieId:m)#r
movieId| rating score    title                      
-------| -------------------------------------------
1967   |        5.117033 "Labyrinth"                
475    |        4.761649 "In the Name of the Father"
1276   |        4.470915 "Cool Hand Luke"           
2067   |        4.470915 "Doctor Zhivago"           
2968   |        4.421184 "Time Bandits"             
2918   | 5      4.403138 "Ferris Bueller's Day Off" 
908    |        4.346668 "North by Northwest"       
71535  |        4.279887 "Zombieland"               
8874   |        4.279887 "Shaun of the Dead"        

Item-Item Collaborative Filtering

User ratings are sparse and subject to frequent changes
Item ratings have more support and are more stable
Item (instead of user) biases are removed
```
q)I-:ai:avg each I:flip R
```

Recommendations are generated using KNN to average our own rating of each movie’s k most similar movies

q)D:((0^I) .ml.cosdist\:) peach 0^I
q)p:ai+.ml.knn[1f-;k;last each I] D
q)`score xdesc update score:p,movieId.title from ([]movieId:m)#r
movieId| rating score    title                                            
-------| -----------------------------------------------------------------
912    |        5.39     "Casablanca"                                     
926    |        5.379167 "All About Eve"                                  
908    |        5.334211 "North by Northwest"                             
356    |        5.314134 "Forrest Gump"                                   
1262   |        5.277907 "Great Escape, The"                              
1207   |        5.27069  "To Kill a Mockingbird"                          
720    |        5.242593 "Wallace & Gromit: The Best of Aardman Animation"
899    |        5.224468 "Singin' in the Rain"                            
1247   |        5.213291 "Graduate, The"                                  
1148   |        5.185714 "Wallace & Gromit: The Wrong Trousers"           

Matrix Decomposition Collaborative Filtering Algorithm

The null-riddled rating matrix cannot be decomposed with SVD
Alternatives include Gradient Descent, Stochastic Gradient Descent and Alternating Least Squares (ALS).
ALS elegantly transforms a non-convex problem into an iterative quadratic problem

/ ALS-WR (a)lternating (l)east (s)quares with (w)eighted (r)egularization
alswr:{[l2;Y;XTHETA]
 X:flip f2nd[wridge[l2;;XTHETA 1]] Y; / hold THETA constant, solve for X
 THETA:flip wridge[l2;;X] peach Y;    / hold X constant, solve for THETA
 (X;THETA)}

Matrix Decomposition Collaborative Filtering Example

Perform regularized ALS until cost is below supplied threshold

q)nf:10;
q)n:(ni:count U 0;nu:count U)
q)XTHETA:(X:-1+ni?/:nf#1f;THETA:-1+nu?/:nf#2f)
q)XTHETA:first .ml.iter[1;.0001;.ml.cfcost[();U] .;.ml.alswr[.01;U]] XTHETA
iter: 144 | cost: 8.349892 | pct: 9.952313e-05

Recommendations are generated by combining item and user exposure matrices

q)P:au+.ml.pcf . XTHETA
q)show t:`score xdesc update score:last P,movieId.title from ([]movieId:m)#r
movieId| rating score    title                                                
-------| ---------------------------------------------------------------------
4232   |        9.907802 "Spy Kids"                                           
102903 |        9.573078 "Now You See Me"                                     
3869   |        8.985205 "Naked Gun 2 1/2: The Smell of Fear, The"            
265    |        8.948548 "Like Water for Chocolate (Como agua para chocolate)"
84954  |        8.573156 "Adjustment Bureau, The"                             
116823 |        8.445161 "The Hunger Games: Mockingjay - Part 1"              
53121  |        8.140992 "Shrek the Third"                                    
6942   |        7.996043 "Love Actually"                                      

Google PageRank

Google PageRank Algorithm

The PageRank algorithm relies on page links (not content)
Links (like citations) provide evidence of relative importance

The PageRank algorithm adds ‘random surfing’

/ given a (d)amping factor (1 - the probability of random surfing) and the
/ (A)djacency matrix, create the Markov Google matrix
google:{[d;A]
 M:A%1f|s:sum each A;           / convert to Markov matrix
 M+:(0f=s)%n:count M;           / add links to dangling pages
 M:(d*M)+(1f-d)%n;              / dampen
 M}

The Google Matrix is a transition matrix with random surfing

q).ml.google[.85] (10110b;11001b;01110b;00111b;00000b)
3133333 0.03      0.3133333 0.3133333 0.03     
3133333 0.3133333 0.03      0.03      0.3133333
03      0.3133333 0.3133333 0.3133333 0.03     
03      0.03      0.3133333 0.3133333 0.3133333
2       0.2       0.2       0.2       0.2      

Google PageRank Example

Generate page links

q)i:1 1 2 2 3 3 3 4 6
q)j:2 6 3 4 4 5 6 1 1

Enumerate links and create a sparse matrix

q)node:asc distinct raze l
q)l:node?l
q)show S:(1 2#1+max over l), .ml.prepend[1f] l
6 6
1 1 1 1 1 1 1 1 1f
0 0 1 1 2 2 2 3 5
1 5 2 3 3 4 5 0 0

Iteratively apply transition matrix until convergence

q)node[i]!r i:idesc r:.ml.pageranks[d;S] over r:n#1f%n:S[0;0]
0.3210169
0.200744
0.170543
0.1367926
0.1065916
0.0643118

Sparklines

Sparkline Algorithm

A sparkline is a small intense, simple, word-sized graphic with typographic resolution. – Edward Tufte

Bucket all values into 8 bins
Map bins to 8 Unicode characters

/ allocate x into n bins
nbin:{[n;x](n-1)&floor n*.5^x%max x-:min x}

/ generate unicode sparkline
spark:raze("c"$226 150,/:129+til 8)nbin[8]::

Sparkline Example

Load sample DJIA data set

q)\l dji.q
[down]loading dji data set
q)dji.t
quarter stock date       open  high  low   close volume    percent_change_price percent_change_vo..
-------------------------------------------------------------------------------------------------..
1       AA    2011.01.07 15.82 16.72 15.78 16.42 239655616 3.79267                               ..
1       AA    2011.01.14 16.71 16.71 15.64 15.97 242963398 -4.42849             1.380223         ..
1       AA    2011.01.21 16.19 16.38 15.6  15.79 138428495 -2.47066             -43.02496        ..
1       AA    2011.01.28 15.87 16.63 15.82 16.13 151379173 1.63831              9.3555           ..
1       AA    2011.02.04 16.18 17.39 16.18 17.14 154387761 5.93325              1.987452         ..
..

Use q-sql to conveniently access close and stock columns

q)-1@'10#exec ((4$string stock 0),": ",.ut.spark close) by stock from dji.t;
AA  : ▅▄▃▄▇▇▇▅▅▄▄▆▇█▅▆▆▇▆▄▅▃▂▁▂
AXP : ▁▃▃▁▁▄▃▁▁▁▁▃▂▃▃▄▆▇▇██▇▅▆▆
BA  : ▁▁▂▁▂▃▃▃▃▂▁▄▄▄▃▅███▇▆▅▃▄▂
BAC : ▇█▇▆▇██▇▇▇▆▅▅▆▄▄▃▄▃▂▂▂▁▁▁
CAT : ▁▁▁▂▃▄▅▄▄▃▅▆█▇▆▆█▇▅▅▅▃▂▂▃
CSCO: ▇█▇▇█▅▅▅▄▄▃▃▃▄▃▃▃▃▃▂▂▂▁▁▁
CVX : ▁▁▂▁▃▃▄▅▆▄▆▇██▇██▆▅▅▆▅▄▄▃
DD  : ▂▂▁▂▄▆█▆▆▅▅▆▇▇▇██▆▅▅▄▂▂▂▄
DIS : ▃▃▃▂▅█████▅██▆▆▇██▆▆▆▃▂▁▁
GE  : ▂▂▅▆▆██▇▆▆▃▅▆▆▅▅▆▅▅▄▄▂▁▂▁

Thank You

Never, ever underestimate the importance of having fun – Randy Pausch

Share on

Twitter Facebook LinkedIn

Nick Psaris