Parsing the MNIST Data Set
Recap of Competition Rules
- Convert a self-describing vector of bytes into an n-dimensional array of the specified type
- Points given for:
- Fastest submission
- Smallest submission
- Shortest submission
- Tie-breaker based on timing of first submission
Complete competition rules can be found here.
Parsing Data
text | binary | |
---|---|---|
read file | read0 |
read1 |
parse file or vector | 0: |
1: |
cast single element | TYPE $ cvec |
0x0 sv bvec |
fixed width | (width;type) 0: cvec |
(width;type) 1: bvec (big endian) |
(type;width) 1: bvec (little endian) |
||
delimited | (type;delim) 0: cvec |
Understanding the Header
- Bytes 1 and 2 are empty (presumably reserved for versioning)
- Byte 3 indicates the data type ([un]signed byte, short, int, real, float)
- Byte 4 specifies the dimensionality of th n-array
- The next n integers (n*4 bytes) list the sizes of each dimension
- Remaining bytes contain the data
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
00 | 00 | 09 | 02 | 00 | 00 | 00 | 02 | 00 | 00 | 00 | 10 |
Parsing the Header
ldidx:{
d:first (1#4;1#"i") 1: 4_(h:4*1+x 3)#x;
h
is the length of the headerd
is the vector of dimensions
Parsing the Data
ldidx:{
d:first (1#4;1#"i") 1: 4_(h:4*1+x 3)#x;
x:first ((1 1 0N 2 4 4 8;"xx hief")@\:(),x[2]-0x08) 1: h _x;
x
now has a single vector of data cast to the type specified in byte 3- There is no parse-time syntax for specifying a single element
literal list so we need an extra operation:
(),
Reshaping the Data
- For 1 dimension we can just return the data
-
For 2 dimensions we can use
cut
or#
q)10 cut til 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 q)2 10 # til 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
-
What about 3+ dimensions?1
shape:{y {y cut x}/reverse 1_x,()}
q)shape[2 10] til 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 q)shape[2 2 5] til 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
- Permit atomic argument by promoting
x
to list -
But what about the edge conditions? What would q do?
q)2 5 # til 20 0 1 2 3 4 5 6 7 8 9 q)shape[2 5] til 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
- We need to extend/truncate the initial vector before reshaping it.
shape:{(prd[x]#y){y cut x}/reverse 1_x,()}
k)shape:{((*/x)#y){y#x}/0N,'|1_x,()}
q)0N!shape[2 2 5] til 10;
((0 1 2 3 4;5 6 7 8 9);(0 1 2 3 4;5 6 7 8 9))
Fast Solutions
-
Basic Solution
ldidx:{ d:first (1#4;1#"i") 1: 4_(h:4*1+x 3)#x; x:first ((1 1 0N 2 4 4 8;"xx hief")@\:(),x[2]-0x08) 1: h _x; x:((prd[d])#x){y cut x}/reverse 1_d; x}
But why call
1:
to convert bytes to bytes? Just return the vector! -
Fastest (and Smallest) Solution
ldidx:{ d:first (1#4;1#"i") 1: 4_(h:4*1+x 3)#x; x:$[0>i:x[2]-0x0b;::;first ((2 4 4 8;"hief")@\:i,()) 1:] h _x; x:((prd[d])#x){y cut x}/reverse 1_d; x}
Short Solutions
-
Function calls (including lambdas) only add two bytes (no matter what is inside the function/lambda)
ldidx:{ {d:first (1#4;1#"i") 1: 4_(h:4*1+x 3)#x; x:$[0>i:x[2]-0x0b;::;first ((2 4 4 8;"hief")@\:i,()) 1:] h _x; x:((prd[d])#x){y cut x}/reverse 1_d; x}x}
q)count first get ldidx 5
Shortest Solution
-
Given a composition,
get
returns the list of composed functionsq)f:{}@ q)get f @ {} q)count first get f 1
ldidx:{ d:first (1#4;1#"i") 1: 4_(h:4*1+x 3)#x; x:$[0>i:x[2]-0x0b;::;first ((2 4 4 8;"hief")@\:i,()) 1:] h _x; x:((prd[d])#x){y cut x}/reverse 1_d; x}@
Runner-Up Solution
-
Skipped calling 1: when dimension was greater than 2
k)ldidx:{ r:,/(2 1 1 4;" xxi")1:8#x; f:([f:0x08090B0C0D0E]t:"xxhief";l:1 1 2 4 4 8)@r 0; x:8_x; $[0i~I:r 2;:(f`t)$();0x1~r 1; :((,f`l;,f`t)1:(I*f`l)#x)0;0x2~r 1; [M:((,4;,"i")1:4#x)[0;0];R:((,f`l;,f`t)1:(M*I*f`l)#x:4_x)0;:(M,I)#R]]; O:,/(4 4;"ii")1:8#x; (O#)'[I#((+[prd O])\[7h$I;0])_x:8_x]}
Winning Solution
-
1 byte-code and least bytes allocated
ldidx:{ {(prd[x]#y){y#x}/0Ni,'reverse 1_x}[0x0 sv'0N 4#x 4+til 4*x 3;first(1 2 4 4 8;"xhief")[;enlist 0|-10+x 2]1:(4*1+x 3)_ x]}@
-
Combining the Best Ideas and kdb+ 3.4’s native multi-dimensional
#
ldidx:{ d:first (1#4;1#"i") 1: 4_(h:4*1+x 3)#x; x:d#$[0>i:x[2]-0x0b;::;first ((2 4 4 8;"hief")@\:i,()) 1:] h _x; x}@
-
kdb+ 3.4 was updated to include a multi-dimensional
#
operator (specifically for this competition) ↩