Before the rise of (deep) neural network, Kernel trick is an important approach used in feature engineering to ‘define similarity’ and thus is powerful in extracting feature. Based on Reproducing Kernel Hilbert Space (RKHS), reproducing property and representer theorem make Kernel trick widely appliable to various model.
Theory of RKHS
Hilbert space $\mathcal{H}$, as introduced in Quantum mechanics, is a complete space with inner product defined ($\mathcal{H}$ could be infinitely dimensional). Examples include
- Vector space $\mathbb{R}^n$ with $\langle x,y\rangle= x’y$
- $L_2$ space of functions $\langle f,g\rangle =\int f(x)g(x)\,\mathrm{d}x$
Reproducing Kernel
A reproducing kernel $K_\mathcal{H}(\, \cdot \, ,\, \cdot \,):\mathcal{X}\times\mathcal{X}\to \mathbb{R} $ on Hilbert space $\mathcal{H}$ satisties:
- Symmetry: \(\begin{align} K(x,y)=K(y,x) \end{align}\)
- Positive semi-definition: \(\begin{align} \sum_{i,j=1}^nK(x_i,x_j)a_ia_j\geq 0,\quad \forall \{x_i\}_{i=1}^n,\{a_i\}_{i=1}^n,\quad \forall n\in\mathbb{Z}^+ \end{align}\)
- Reproducing: \(\begin{align} f(x)=\langle K(x,y ), f(y ) \rangle \end{align}\) we will see that a direct result would be \(\begin{align} K(x,y)=\langle K(x,\xi ),K(\xi ,y)\rangle \end{align}\)
Construct RKHS
With a Kernel $K(x,y)$ given, we could construct a corresponding RKHS $\mathcal{H}$ (and its inner product) as follows. Here we temporarily consider finite countable case, continuous case is similar with $\sum \to\int$
Note that for each given $y\in\mathcal{X}$, $K(x,y)$ is a function of $x$, denoted $\phi_y(x)=K(x,y)|y$. For function in $\mathrm{span}{\phi_i(x)}$ \(\begin{align} \mu (x)=\sum_{i:y_i\in\mathcal{X}} \mu _i\phi_i(x) \end{align}\)
inner product that could make $K(\, \cdot \, ,\, \cdot \, )$ a reproducing one satisfies \(\begin{align} \mu (y)=&\langle K(x,y ), \mu (x) \rangle\\ =&\langle \phi _{y_i}(x), \sum_{i} \mu _i\phi_i(x) \rangle\\ =&\sum_{i}\mu _i\langle \phi _{y}(x), \phi_i(x) \rangle =&\sum_{i} \mu _i\phi_i(y) \end{align}\)
i.e. the inner product should be set so that \(\begin{align} K(y,y_i)=\phi_i(y) = \langle \phi _{y}(x), \phi_i(x) \rangle=\langle K(y,x),K(x,y_i)\rangle \end{align}\)
We can also see that the RKHS induced by $K(x,y)$ is given by the $L^2$ space of the eigenfunctions of $K(x,y)$, i.e. \(\begin{align} \mathcal{H}=\left\{ f(x)=\sum_{i=1}^\infty f_i\phi_i(x): \sum_{i=1}^\infty \dfrac{ f_i^2 }{ \lambda _i } <\infty \right\},\quad \int K(x,y)\phi_i(y)\,\mathrm{d}y=\lambda _i\phi_i(x) \end{align}\)
Representer Theorem for RKHS
With Kernel $K(\, \cdot \, ,\, \cdot \, )$ and its correaponding RKHS $\mathcal{H}$ defined, we could write a loss+$\ell_2$penalty optimization problem in $\mathcal{H}$: \(\begin{align} \mathop{\arg\min}\limits_{f\in\mathcal{H}}\sum_{i=1}^N\mathcal{L}\left( y_i,f(x_i) \right)+\dfrac{\lambda }{2}\left\Vert f \right\Vert ^2,\qquad \Vert\, \cdot \, \Vert^2:=\langle\, \cdot \, ,\, \cdot \, \rangle \end{align}\)
using the basis generated by Kernel: $\phi_y(x)=K(x,y)$, the problem could be expressed using $f(\xi )=f_0+\sum_{j=1}^\infty f_j\phi_j(\xi )$ \(\begin{align} \sum_{i=1}^N\mathcal{L}\left( y_i,f(x_i) \right)+\dfrac{\lambda }{2}\left\Vert f \right\Vert ^2=&\sum_{i=1}^N\mathcal{L}\left( y_i,f_0+\sum_{j=1}^\infty f_j\phi_j(x_i) \right)+\dfrac{\lambda }{2}\left\langle \sum_{j=1}^\infty f_j\phi_j(\xi ), \sum_{j=1}^\infty f_j\phi_j(\xi )\right\rangle\\ =&\sum_{i=1}^N\mathcal{L}\left( y_i,f_0+\left\langle\sum_{j=1}^\infty f_j \phi _j(\xi ),\phi _{x_i}(\xi ) \right\rangle \right)+\dfrac{\lambda }{2}\left\langle \sum_{j=1}^\infty f_j\phi_j(\xi ), \sum_{j=1}^\infty f_j\phi_j(\xi )\right\rangle \end{align}\)
We could decompose $f(\xi )=\sum_{j=1}^\infty f_j\phi_j(\xi )$ into two parts: a part in the subspace (denoted $\parallel$) of \(\mathrm{span}\left\{ \phi _{x_i}(\xi ) \right\}_{i=1}^N\) and the part in the complementary space (denoted $\perp$). Here ‘in the subspace’ in the sense of \(\left\langle \, \cdot \, ,\, \cdot \, \right\rangle \neq 0\) \(\begin{align} f(\xi )=f_0+\sum_{j=1}^\infty f_j\phi_j(\xi )=&f_0+\sum_{j:\parallel}f_j\phi _{j}(\xi )+\sum_{j:\perp } f_j\phi_j(\xi )\\ =&f_0+f_{\parallel}(\xi )+f_{\perp}(\xi ) \end{align}\)
Naturally we could obtain that
- Loss part: \(\begin{align} \left\langle\sum_{j=1}^\infty f_j \phi _j(\xi ),\phi _{x_i}(\xi ) \right\rangle=\left\langle f_{\parallel}(\xi ),\phi _{x_i}(\xi ) \right\rangle \end{align}\)
- Penalty part: \(\begin{align} \left\langle \sum_{j=1}^\infty f_j\phi_j(\xi ), \sum_{j=1}^\infty f_j\phi_j(\xi )\right\rangle =&\left\langle f_\parallel(\xi ),f_\parallel(\xi ) \right\rangle + \left\langle f_\perp(\xi ),f_\perp(\xi ) \right\rangle\\ \geq& \left\langle f_\parallel(\xi ),f_\parallel(\xi ) \right\rangle \end{align}\)
i.e. to minimize the target function, we could just use $f_\parallel(\xi )$ as the estimator. Further since it lies in \(\mathrm{span}\left\{ \phi _{x_i}(\xi ) \right\}_{i=1}^N\), it could be represented as linear combination of \(\phi _{x_i}\) as \(\begin{align} \hat{f}(\xi )=f_0+\sum_{i=1}^N\hat{\alpha }_i\phi _{x_i}(\xi )=f_0+\sum_{i=1}^N\hat{\alpha }_iK(x_i,\xi ) \end{align}\)
And the optimization could then be written as \(\begin{align} &\mathop{\arg\min}\limits_{f\in\mathcal{H}}\sum_{i=1}^N\mathcal{L}\left( y_i,f(x_i) \right)+\dfrac{\lambda }{2}\left\Vert f \right\Vert ^2\\ \Rightarrow &\mathop{\arg\min}\limits_{\hat{\alpha }}\sum_{i=1}^N\mathcal{L}\left(y_i,f_0+ \sum_{j=1}^N\hat{\alpha }_jK(x_j,\xi )\right) +\dfrac{\lambda }{2}\sum_{j=1}^N\sum_{k=1}^N\hat{\alpha }_j\hat{\alpha }_kK(x_j,x_k) \end{align}\)
Comment: With Kernel given, we could solve the optimization problem w.r.t. to $N$ variables $\hat{\alpha }_i$, and Kernel would do the job of feature extraction.
Some Useful Kernel
- Linear Kernel: \(\begin{align} K(x,y):=\left\langle x,y \right\rangle \end{align}\)
- $d^\mathrm{th}$ Degree Polynomial Kernel: \(\begin{align} K(x,y):=\left(1+\left\langle x,y \right\rangle \right)^d \end{align}\)
- Radical Base Function Kernel (RBF Kernel): \(\begin{align} K(x,y):=\exp\left[ -\dfrac{\left\Vert x-y \right\Vert ^2}{\sigma ^2} \right] \end{align}\)
- Sigmoid Kernel: \(\begin{align} K(x,y):=\tanh (1+\left\langle x,y \right\rangle ) \end{align}\)
Thought Bubble: Kernel in Primal and Dual
At first Kernel method was induced through replacing the inner product term in dual problem of soft margin SVM, and optimize dual problem w.r.t. $\alpha $ \(\begin{align} &\mathop{\arg\max}\limits_{\alpha } \sum_{i=1}^N\alpha _i-\dfrac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\alpha _i\alpha _jy_iy_j K(x_i,x_j) \\ &s.t.\, 0\leq \alpha _i\leq C,\qquad \sum_{i=1}^N\alpha _iy_i=0 \end{align}\)
Here we would try go in opposite direction: first use representer theorem (i.e. substitute Kernel in the primal problem) to induce Kernel, then obtain the Dual problem. We will see that the $\alpha $ here is not $\hat{\alpha }$, but primal problem is still equivalent to dual problem. Here’s the proof:
Kernel SVM
The loss+penalty form of (soft margin) SVM (Primal): \(\begin{align} \mathop{\arg\min}\limits_{\beta _0,\beta }\sum_{i=1}^N[1-y_if(x_i)]_++\dfrac{\lambda }{2}\left\Vert \beta \right\Vert ^2 \end{align}\)
using representer theorem: \(\begin{align} \mathop{\arg\min}\limits_{\hat{\alpha }}\sum_{i=1}^N\left[ 1-y_i\left(f_0+\sum_{j=1}^N\hat{\alpha }_jK(x_i,x_j)\right) \right]_{+} + \dfrac{\lambda }{2}\sum_{j=1}^N\sum_{k=1}^N\hat{\alpha }_j\hat{\alpha }_kK(x_j,x_k) \end{align}\)
which could be written back to the form of SVM Primal: \(\begin{align} \mathop{\arg\min}\limits_{\hat{\alpha }}&\dfrac{1 }{2} \sum_{j=1}^N\sum_{k=1}^N\hat{\alpha }_j\hat{\alpha }_kK(x_j,x_k)+C\sum_{i=1}^N\xi _i,\quad \lambda =1\big/ C\\ s.t.\,&\xi _i\geq 0,& \forall i=1,2,\ldots,N\\ &\xi _i\geq 1-y_i\left(f_0+\sum_{j=1}^N\hat{\alpha }_jK(x_i,x_j)\right) ,&\forall i=1,2,\ldots,N \end{align}\)
Generalized Lagrange Function: \(\begin{align} \mathcal{L}(\xi ,\hat{\alpha },f_0; \alpha ,\mu )=&\dfrac{1 }{2}\sum_{i=1}^N\sum_{j=1}^N\hat{\alpha }_i\hat{\alpha }_jK(x_i,x_j)+C\sum_{i=1}^N\xi _i-\sum_{i=1}^N\mu _i\xi _i\\ &+\sum_{i=1}^N\alpha _i\left( 1-\xi_i-y_i\left(f_0+\sum_{j=1}^N\hat{\alpha }_jK(x_i,x_j)\right) \right)\\ s.t.\,\,&\mu _i\geq 0,\quad \alpha _i\geq 0,\quad i=1,2,\ldots,N \end{align}\)
Dual problem obtained by $\dfrac{\partial^{} \mathcal{L}}{\partial \xi ,\hat{\alpha }^{},f_0}=0$: \(\begin{align} 0=\begin{cases} \dfrac{\partial^{} \mathcal{L}}{\partial \xi _i^{}}=C-\mu _i-\alpha _i \\ \dfrac{\partial^{} \mathcal{L}}{\partial \hat{\alpha }_i^{}}=\sum_{j=1}^N\hat{\alpha }_jK(x_i,x_j)-\sum_{j=1}^N\alpha _jy_jK(x_i,x_j)\quad (*)\\ \dfrac{\partial^{} \mathcal{L}}{\partial f_0^{}}=\sum_{i=1}^N\alpha _iy_i \end{cases} \end{align}\)
Equation $(*)$ could be written in matrix form as \(\begin{align} \begin{bmatrix} K({x_1,x_1})&K({x_1,x_2})&\ldots&K({x_1,x_N})\\ K({x_2,x_1})&K({x_2,x_2})&\ldots&K({x_2,x_N})\\ \vdots&\vdots&\ddots&\vdots\\ K({x_N,x_1})&K({x_N,x_2})&\ldots&K({x_N,x_N})\\ \end{bmatrix} \begin{bmatrix} \hat{\alpha }_1-\alpha _1y_1\\ \hat{\alpha }_2-\alpha _2y_2\\ \vdots\\ \hat{\alpha }_N-\alpha _Ny_N \end{bmatrix}=0 \end{align}\)
if ${K(x_i,x_j}$ is non-degraded, then $ \hat{\alpha }_i=\alpha _iy_i,\, \forall i$
Substitute back to generalized lagrange to obtain the dual problem: \(\begin{align} \mathop{\arg\max}\limits_{\alpha }&-\dfrac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\alpha _i\alpha _jy_iy_jK(x_i,x_j)+\sum_{i=1}^N\alpha _i\\ s.t.\,\,&0\leq \alpha _i\leq C\\ &\sum_{i=1}^N\alpha _iy_i=0 \end{align}\)
which is exactly the same as the dual problem with replacement $\langle x,y\rangle\to K(x,y)$
Comment: Note that the constraint $\sum_{i=1}^N\alpha _iy_i$ comes from the extra intercept term $f_0$. You could also demand $f_0=0$ to avoid this coustraint.