Example: Best Subset Selection

By way of introduction consider the best subset selection problem. By this I mean minimizing the residual sum of squares subject to a cardinality constraint on the support of the regression coefficients. Lets say that the zero norm of the regression coefficients must be less than an integer k, for which I will assign 5 for this problem.

Generate Some Data

Lets generate some data with which to illustrate the application of sss to the best subset selection problem.

(ns best-subset.core
  (:require [clojure.core.matrix :refer :all]
            [distributions.core :refer :all]
            [sss4clj.core :refer :all]
            [clojure.core.matrix.linear :as la]))


(def n 200)
(def p 20)
(def X (reshape (sample (normal 0 1) (* n p)) [n p]))
(def B (matrix (for [i (range 0 p)] (if (< i 4) i 0))))
(def Y (add (mmul X B) (sample (normal 0 1) n)))

Note that sss requires that we sample a new state, for which the probability of being sampled is some function of the objective function. In this context we want models with a low residual sum of squares to have a high probability of being sampled. For this reason we take the objective function to the exponential of the negative RSS with some scaling. We also set the probability to zero if the cardinality constraint is broken.

(defn RSS [g]
  (if (empty? g)
    (dot Y Y)
    (let [p-g (count g)
          X-g (transpose (reshape (flatten (map #(mget (columns X) %) g)) [p-g n]))
          proj (mmul X-g (la/solve (mmul (transpose X-g) X-g)) (transpose X-g))
          orth-proj (sub (identity-matrix n) proj)]
      (dot Y (mmul orth-proj Y)))))

(def k 5)

(defn objective [g]
  (if (> (count g) k) 0 (exp (negate (/ (RSS g) n)))))


(def Omega (into #{} (range 0 p)))
(def top-subsets (run-sss objective #{} Omega 10 5))

Note Omega is the the set containing integers from 0 to p-1 corresponding to all possible values of the active subset of regression coefficients. In order to run sss we specify the previously defined objective, an initial set (the emtpy set in this example), Omega, the number of iterations to run and the top number of models discovered. The latter is only really an issue for long runs with memory constraints. If you want to see all the models visited, then just set this to be a really large interger.

{#{1 4 3 2 16} 0.36936833972859484, #{1 3 12 2 9} 0.36940132623186667, #{1 6 3 12 2} 0.37143277613460907, #{1 3 12 2 5} 0.3772822611159165, #{1 3 12 2 16} 0.37729521913614006}

Copyright © 2017 Michael Lindon

Distributed under the Eclipse Public License either version 1.0