\chapter{Problem Definition}\label{ch: problem definition} Henceforth, we denote the norm-2 $\norm{ - }_2$ with the generic norm symbol $\norm{ - }$.\newline Given $\hat{X} \in \mathbb{R}^{(m + n) \times m},\ \hat{y} \in \mathbb{R}^{m + n},\ $ we want to find \[\min_{w}\ \norm{\hat{X}w-\hat{y}}\] \section{QR} By performing a QR factorization on $\hat{X}$ we can reformulate the problem as follows: \[ \min_{w}\ \norm{\hat{X}w - \hat{y}} = \min_{w}\ \norm{\vphantom{\hat{X}}QRw - \hat{y}} \] with $Q \in \mathbb{R}^{(m + n) \times (m + n)}$ being an orthogonal matrix and $R \in \mathbb{R}^{(m + n) \times m}$ being an upper triangular matrix. Knowing that $R_{ij} = 0,\ \ \forall i > j,\ i = 1, \ldots, m + n,\ j = 1, \ldots, m,\ $ we can write \begin{equation*} \begin{aligned} &R = \begin{bmatrix} R_0 \\ 0 \end{bmatrix},\ &R_0 \in \mathbb{R}^{m \times m} \\ &Q = \begin{bmatrix} Q_0\ Q_c \end{bmatrix},\ &Q_0 \in \mathbb{R}^{(m+n) \times m},\ &Q_c \in \mathbb{R}^{(m+n) \times n} \end{aligned} \end{equation*} Since orthogonal matrices preserve norm-2, we have: \begin{equation*} \begin{aligned} &\min_{w}\ \norm{QRw - \hat{y}} = \min_{w}\ \norm{Q^T(QRw - \hat{y})} = \\ &\min_{w}\ \norm{Q^{T}QRw - Q^T\hat{y}} = \\ &\min_{w}\ \norm{Rw - Q^T\hat{y}} = \\ &\min_{w}\ \norm{ \begin{bmatrix} R_0 \\ 0 \end{bmatrix} w - \begin{bmatrix} Q^T_0 \\ Q^T_c \end{bmatrix}\hat{y}} = \\ & \min_{w}\ \norm{ \begin{bmatrix} R_0w - Q^T_0\hat{y} \\ - Q^T_c\hat{y} \end{bmatrix}} \end{aligned} \end{equation*} The entries of the second block $- Q^T_c\hat{y}$ do not depend on $w$, meaning that they will appear in the norm independently from $w$. Thus, we can simplify the problem and solve the triangular system \begin{equation*} R_0w - Q^T_0\hat{y} = 0 \iff R_0w = Q^T_0\hat{y} \end{equation*} provided that $R_0$ is invertible. \begin{center} $R_0$ is invertible $\iff \hat{X}$ has full column rank $\iff \hat{X}^T\hat{X} \succ 0$. \end{center} $R_0$ is invertible and the triangular system can be solved via backsubstitution. This claim is proved in \hyperref[proofs: fullcolumn]{the last section}. \section{L-BFGS}\label{ch: L-BFGS} We can define \begin{equation} \begin{aligned} g(w) = {f(w)}^2 = \norm{\hat{X}w-\hat{y}}^2 \end{aligned} \end{equation} and reformulate the problem equivalently in terms of $g(w)$, since it is monotonic. \begin{equation*} \begin{aligned} \min_{w}\ g(w) = \min_{w}\ \norm{\hat{X}w-\hat{y}}^2 = \min_{w}\ {\bigl(\hat{X}w - \hat{y}\bigr)}^T\bigl(\hat{X}w - \hat{y}\bigr) \end{aligned} \end{equation*} The gradient of $g$ with respect to $w$ is \begin{equation*} \begin{aligned} \nabla g(w) = 2\hat{X}^T\bigl(\hat{X}w - \hat{y}\bigr) \end{aligned} \end{equation*} Likewise the gradient of $f(w)$ is as follows: \begin{equation*} \nabla f(w) = \frac{1}{\norm{\hat{X} w - \hat{y}}} \hat{X}^T\bigl(\hat{X}w - \hat{y}\bigr) \end{equation*} but gives much worse performance since it is no longer quadratic. The function is L-smooth since $\forall w, w' \in \mathbb{R}^m,\ \text{with } w \neq w'$: \vspace{6pt} \begin{tblr}{colspec={crl}, colsep={0pt}} & \(\norm{\nabla g(w) - \nabla g(w')}\) &\(\ \leq L \norm{w - w'}\)\\ \(\iff\) & \(\norm{\hat{X}^T(\hat{X}w - w') - \hat{X}^T (\hat{X} w' -\hat{y})}\) & \(\ \leq L \norm{w - w'}\) \\ \(\iff\) & \(\norm{\hat{X}^T \hat{X} (w-w')}\) & \(\ \leq L \norm{w - w'}\) \\ \(\Longleftarrow\) & \(\norm{\hat{X}^T \hat{X}} \norm{w-w'}\) & \(\ \leq L \norm{w - w'}\) \\ \(\iff\) & \(\norm{\hat{X}^T \hat{X}}\) & \(\ \leq L\ \) \end{tblr} \vspace{6pt} The function $g$ is also strongly convex since \( \nabla^2g(w) = \hat{X}^T \hat{X} \succ 0\). The tomography of $g(w)$ with respect to the direction $p$ is: \begin{align} \phi(\alpha)&={(\hat{X}(w+\alpha p) - \hat{y})}^T \cdot (\hat{X}(w+\alpha p) - \hat{y}) \notag\\ \frac{d \phi(\alpha)}{d \alpha} &= 2 w^T \hat{X}^T \hat{X} p - 2 \hat{y}^T \hat{X} p + 2 \alpha p^T \hat{X}^T \hat{X} p \notag\\ \frac{d^2 \phi(\alpha)}{d \alpha^2} &= 2 p^T \hat{X}^T \hat{X} p \label{definitions: hessian tomography} \end{align} Since $\frac{d^2 \phi(\alpha)}{d \alpha^2}$ is constant, the tomography is simply a parabola and since $\hat{X}^T \hat{X}$ is positive definite, the dot product $\langle p, p \rangle_{\hat{X}^T \hat{X}}$ is always positive and the parabola always has a minimum. The minimum is found by solving $\frac{d \phi(\alpha)}{d \alpha}$ for $0$: \[ \alpha_{\min} = \frac{\hat{y}^T \hat{X} p - w^T \hat{X}^T \hat{X} p}{p^T \hat{X}^T \hat{X} p} \] \section{Conditioning}\label{subsec:conditioning} We check the condition number $\kappa(\hat{X})$ when the regularization term $\lambda > 0$ varies. \[ \kappa(\hat{X}) = \norm{\hat{X}} \norm{\hat{X}^{T}} = \frac{\sigma_1}{\sigma_m} = \sqrt{\frac{\lambda_{\max}}{\lambda_{\min}}} \] with $\sigma_1, \sigma_m$ being respectively the largest and smallest singular values of $\hat{X}$ and $\lambda_{\max}, \lambda_{\min}$ being the largest and smallest eigenvalues of $\hat{X}^T\hat{X}$.\\ Knowing that $\hat{X}^T\hat{X} = XX^T + \lambda^2I_m$, we have that \begin{center} \begin{tblr}{colspec={c}, colsep={0pt}, column{1} = {mode = math}} \lambda_{max} = \lambda_1 + \lambda^2 \\ \lambda_{min} = \lambda_m + \lambda^2 \\ \end{tblr} \end{center} with $\lambda_1, \lambda_m$ being the largest and smallest eigenvalues of $XX^T$, which are translated by $\lambda^2$ as a result of adding $\lambda^2I_m$ (\autoref{proof:eigenvalues_translation})\\ In \autoref{proofs: eigenvalues} we show that $\lambda_m = 0$ and conclude that $\kappa(\hat{X})$ scales linearly with $\frac{1}{\lambda}$: \[ \kappa(\hat{X}) = \sqrt{\frac{\lambda_{\max}}{\lambda_{\min}}} = \sqrt{\frac{\lambda_{1} + \lambda^2}{\lambda_{m} + \lambda^2}} = \frac{\sqrt{\lambda_{1} + \lambda^2}}{{\sqrt{\lambda^2}}} = \frac{\sqrt{\lambda_{1} + \lambda^2}}{\lambda} \] if $\lambda_1 > 0$. For lambda close to zero we have $\frac{\sqrt{\lambda_{1} + \lambda^2}}{\lambda} \approx O\left(\frac{1}{\lambda}\right)$. This property is witnessed in \autoref{fig:condition}, which is in logarithmic scale: \begin{figure}[htbp] \centering \includegraphics[width=0.7\linewidth]{(2) - problem definition/images/conditioning.png} % chktex 8 \caption{$\kappa(\hat{X})$ \textit{for different values of} $\lambda$}\label{fig:condition} \end{figure} %%% Local Variables: %%% mode: latex %%% TeX-master: "../main" %%% TeX-command-extra-options: "-shell-escape" %%% End: