\chapter{Experiments}\label{ch: experiments} In this chapter we present the results of the experiments run on both algorithms as well as a comparison of the two methods in terms of accuracy and time scalability. To test the behaviour of the two methods we handle both cases in which the matrix $\hat{X}$ is well-conditioned, with $\kappa(\hat{X}) \approx 5$, and ill-conditioned, with $\kappa(\hat{X}) \approx 5 \times 10^5$. To accomplish so, we randomly generated the matrix $X$ forcing its values to be in the range $[-1, 1]$, the dimensions $m = 1000$ and $n = 20$ (except for time and memory scalability tests), and, as we have seen in \autoref{subsec:conditioning}, since we can control the conditioning directly with the \textit{hyperparameter} $\lambda$, we choose for the first case $\lambda = 10^{-4}$ and for the latter $\lambda = 10^{-12}$. For the QR factorization we check how the relative error and residual change with respect to different values of $\lambda$. Then, we confirm the backward stability of the decomposition over different values of $\lambda$ and check its forward stability as well. For what concerns L-BFGS we fix the relative tolerance $\epsilon = 10^{-14}$, the memory size $k = 7$ and the maximum number of function evaluations to $200$, knowing that the function we have to optimize can be easily optimized by the method. The last kind of tests we present concerns the scalability of the methods in terms of time and memory, which has been compared by modifying the matrix $\hat{X} \in \mathbb{R}^{(m+n) \times m}$ by generating random matrices with increasing $m$ and $n$ separately. As mentioned before, we brought this experiment to the case in which $\hat{X}$ is either ill-conditioned or well-conditioned. In the case of the thin-QR factorization we expect a linear dependency between the number of rows and the time needed to converge to the optimal solution, assuming a fixed number of columns. If instead we vary the number of columns we expect a quadratic dependency. In \autoref{sec:other_experiments} we first explore better the effect of the memory size for L-BFGS and then we provide a deeper comparison of other Quasi-Newton methods we manually implemented (even if not really required by the project instructions). All tests have been executed with the benchmark library \texttt{BenchmarkTools.jl}\cite{BenchmarkTools} which ignores startup and compilation time and repeated 10 times in order to get accurate estimates. \section{QR} Since we know from theory that the QR decomposition is backward stable, we expect that $\frac{\norm{\hat{X} - Q R}}{\norm{\hat{X}}} \approx u$. Or more explicitly that for $QR = \hat{X} + \delta \hat{X}$, $\frac{\norm{\delta \hat{X}}}{\norm{\hat{X}}} = O(u)$. The results in \autoref{fig:QR-error-lambda} show a decreasing trend for relative error and residual when increasing $\lambda$ and hence decreasing the condition number $\kappa(\hat{X})$. The errors are acceptable even for the smallest lambda: $\lambda = 10^{-16}$, $\kappa(\hat{X}) \approx 5 \times 10^{15}$. The algorithm is backward stable as well, as it can be noticed from the green part of the plot.\\ To check the forward error we QR-decomposed the original matrix $\hat{X}$ to get $Q$ and $R$ and then we perturbed it with a random matrix multiplied by a factor $\delta = 10^{-10}$. Then, we ran another QR-decomposition on the perturbed version of $\hat{X}$ to get the factors $\tilde{Q}$ and $\tilde{R}$. Finally, we evaluated $\norm{Q - \tilde{Q}}$ and $\frac{\norm{R - \tilde{R}}}{\norm{R}}$, that are both much larger, as reported in \autoref{fig:QR-forward}. The forward error on $Q$ is slightly worse than on $R$ due to its orthogonality property that needs to be maintained in the factorization, fixing $\kappa(\hat{X})$. However, we can see a generally decreasing trend of the forward error with respect to the condition number of the matrix $\hat{X}$. \begin{figure}[H] \centering \begin{subfigure}{0.48\linewidth} \centering \includegraphics[width=\linewidth]{(4) - experiments/images/QR-lambda-error.png} % chktex 8 \caption{\textit{QR decomposition errors and backward stability for different} $\lambda$}\label{fig:QR-error-lambda} \end{subfigure} \hspace{0cm} \begin{subfigure}{0.48\linewidth} \centering \includegraphics[width=\linewidth]{(4) - experiments/images/QR-forward_error.png} % chktex 8 \caption{\textit{QR factorization forward stability on Q and R for different} $\lambda$}\label{fig:QR-forward} \end{subfigure} \caption{\textit{Errors and scalability of the QR decomposition for different values of} $\lambda$}\label{fig:qrtests} \end{figure} \vspace{1em}%% insert computations (square matrices) % The least squares problem solved by QR factorization is also stable and we expect that the relative error $\frac{\norm{w - w^*}}{\norm{w^*}} = O(\kappa(\hat{X}) u)$ where $w$ is the solution found by the algorithm, $w^*$ is the optimal solution and $\kappa(\hat{X})$ is the condition number of the matrix $\hat{X}$. % \textbf{HERE IS} $O(u)$ \textbf{AND NOT} $O(\kappa(\hat{X}u))$ % \vspace{1em}%% insert computations \newpage \section{L-BFGS} For the first experiment regarding this algorithm we compute the relative gap, the residual and the number of iterations employed by the algorithms to converge. The relative gap is defined as \[ \frac{\norm{w- w^*}}{\norm{w^*}} \] where $w$ is the solution found by our algorithm and $w^*$ is Julia's \textit{ground truth} coming from its standard linear system solver.\\ The residual, instead, is defined as \[ \frac{\norm{\hat{X}w - \hat{y}}}{\norm{\hat{y}}} \] The results are shown in \autoref{fig:gradnorm-res-rel}, satisfying constraints we imposed on $\kappa(\hat{X})$. It is evident from the plots that the convergence of the method is linear and that it is able to compute a relatively good solution in a small number of iterations. \begin{figure}[H] \centering \begin{subfigure}{0.48\linewidth} \centering \includegraphics[width=\linewidth]{(4) - experiments/images/LBFGS-iterations-gradient-ill.png} % chktex 8 \caption{\textit{Ill-conditioned matrix}}\label{fig:gradnorm-res-rel-ill} \end{subfigure} \hspace{0cm} \begin{subfigure}{0.48\linewidth} \centering \includegraphics[width=\linewidth]{(4) - experiments/images/LBFGS-iterations-gradient-well.png} % chktex 8 \caption{\textit{Well-conditioned matrix}}\label{fig:gradnorm-res-rel-wellll} \end{subfigure} \caption{$\norm{\nabla f}$\textit{, Residual, Relative Error of L-BFGS execution on ill and well-conditioned matrices}}\label{fig:gradnorm-res-rel} \end{figure} The other test we propose regards checking the convergence of the method, when using different line search algorithms. We checked how the gradient norm changes when using Exact Line Search and Armijo-Wolfe Line Search only on the well-conditioned matrix. \begin{figure}[htbp] \centering \includegraphics[width=0.75\linewidth]{(4) - experiments/images/LBFGS-LS-gradient-comparison.png} % chktex 8 \caption{\textit{Line Search algorithms comparison}}\label{fig:LS-comparison} \end{figure} From \autoref{fig:LS-comparison} we can notice that the exact line search behaves better than the inexact line search because of the nature of the function we are optimizing. AWLS computes a step size which may lead to instability, but does converge. \section{Comparison between QR and L-BFGS} The tests have been performed by fixing one between $m = 200$ and $n = 50$ and varying the other dimension from an initial value of $500$ to a value of $5500$, at intervals of $500$. The results of fixing $m$ and varying $n$ be summarized in \autoref{fig:QRvsLBFGS-time-comparison-n}, which shows a linear growth of running time with increasing $n$ for the QR decomposition and a better performance for L-BFGS, in both the ill and well-conditioned case. The allocated memory is consistent and on the same trend as the running time as expected. \begin{figure}[H] \centering \begin{subfigure}{0.48\linewidth} \centering \includegraphics[width=\linewidth]{(4) - experiments/images/QRvsLBFGS-scalability-time-illcond-n.png} % chktex 8 \caption{\textit{Ill-conditioned matrix}}\label{fig:time-comparison-illcond-n} \end{subfigure} \hspace{0cm} \begin{subfigure}{0.48\linewidth} \centering \includegraphics[width=\linewidth]{(4) - experiments/images/QRvsLBFGS-scalability-time-wellcond-n.png} % chktex 8 \caption{\textit{Well-conditioned matrix}}\label{fig:time-comparison-wellcond-n} \end{subfigure} \caption{\textit{Time and Memory scalability comparison of QR and L-BFGS on ill and well-conditioned matrices, varying \textbf{n}}}\label{fig:QRvsLBFGS-time-comparison-n} \end{figure} Instead, if we fix $n$ and let $m$ vary, we get the following curves as shown in \autoref{fig:QRvsLBFGS-time-comparison-m}. Both the running time and the allocated memory of QR grows more or less quadratically with the number of columns, confirming what the theory suggests. \begin{figure}[H] \centering \begin{subfigure}{0.48\linewidth} \centering \includegraphics[width=\linewidth]{(4) - experiments/images/QRvsLBFGS-scalability-time-illcond-m.png} % chktex 8 \caption{\textit{Ill-conditioned matrix}}\label{fig:time-comparison-illcond-m} \end{subfigure} \hspace{0cm} \begin{subfigure}{0.48\linewidth} \centering \includegraphics[width=\linewidth]{(4) - experiments/images/QRvsLBFGS-scalability-time-wellcond-m.png} % chktex 8 \caption{\textit{Well-conditioned matrix}}\label{fig:time-comparison-wellcond-m} \end{subfigure} \caption{\textit{Time and Memory scalability comparison of QR and L-BFGS on ill and well-conditioned matrices, varying \textbf{m}}}\label{fig:QRvsLBFGS-time-comparison-m} \end{figure} For QR the allocated memory is in the order of MiB even in the worst case while L-BFGS allocates much less memory, in the order of KiB. The conditioning of the matrix has no impact on the time taken to compute a solution for the two algorithms compared, but rather has an impact for L-BFGS in the quality of the solution when dealing with a very flat function (small $\lambda)$. When the function is flat it means that its curvature is low and the gradients change slowly, so the algorithm struggles to rapidly descent towards the minimum with a reasonable relative error. \section{Other Experiments} \label{sec:other_experiments} \subsection{The Effect of the Memory Size} It is interesting to check the behaviour of L-BFGS when changing the memory size. We compare the relative error decrease at each iteration with a memory size that varies from $1$ to $11$, as shown in \autoref{fig:error-memory-size}: \begin{figure}[H] \centering \begin{subfigure}{0.48\linewidth} \centering \includegraphics[width=\linewidth]{(4) - experiments/images/LBFGS-iterations-memory-ill.png} % chktex 8 \caption{\textit{Ill-conditioned matrix}}\label{fig:error-memory-size-illcond} \end{subfigure} \hspace{0cm} \begin{subfigure}{0.48\linewidth} \centering \includegraphics[width=\linewidth]{(4) - experiments/images/LBFGS-iterations-memory-well.png} % chktex 8 \caption{\textit{Well-conditioned matrix}}\label{fig:error-memory-size-wellcond} \end{subfigure} \caption{\textit{The effect of the memory size}}\label{fig:error-memory-size} \end{figure} In accordance with the suggestions provided by \cite{Numerical-Optimization-2006}, the memory size $k$ should be chosen such that $3 \leq k \leq 20$, as it is empirically a good trade-off between number of function evaluations and number of additional operations required to reconstruct the hessian with the two loop formula (\autoref{algo: L-BFGS Two-Loop Recursion}). However, since the function that has to be optimized is quadratic, the algorithm is fast at finding the optimal solution more or less independently of the memory size, but depends still on the curvature $\kappa(\hat{X})$ for the convergence. When the memory size is $1$, the algorithm is a normal gradient descent and still reaches similar convergence rate with respect to higher memory sizes. For higher memory sizes the convergence rate is almost indistinguishable for the well-conditioned case (\autoref{fig:error-memory-size-wellcond}). However, for the ill-conditioned case, shown in \autoref{fig:error-memory-size-illcond}, the algorithm can still converge in $16$ iterations without depending on the memory size $k$, but the relative error is constant in each different setting. This is the consequence of the fact that the algorithm was terminating in a flatter region in which the curvature is so low that satisfies the stopping criterion imposed on the gradient, but with a bad approximation of the optimum. \subsection{A Comparison of Quasi-Newton Methods} To further check the behaviour of our implementation of L-BFGS, we implemented and tested a version of BFGS. In the beginning of this section we provide two additional tests, performed only on well-conditioned matrices and in which we compare the two solvers together. As far as the setup is concerned, we stick to the default setup stated in \autoref{ch: experiments} and for BFGS we set the tolerance to the same as L-BFGS. The first test is shown in \autoref{fig:BFGSvsLBFGS-rate} and shows how for the least squares problem the two algorithms are almost identical in terms of convergence rate. In the plot we have the relative error, residual and gradient norm to be almost equal between the two algorithms. To understand deeply and check the differences in the implementation, we also checked the time and memory scalability. It is not surprising that, as \autoref{fig:BFGSvsLBFGS-scalability} suggests, BFGS is much slower in finding the optimum than its limited version, even for this small optimization problem. This aspect reflects the theory and confirms that this method is more expensive in terms of time and memory with respect to L-BFGS. \begin{figure}[H] \centering \begin{subfigure}{0.48\linewidth} \centering \includegraphics[width=\linewidth]{(4) - experiments/images/BFGS-LBFGS-gradient-comparison.png} % chktex 8 \caption{\textit{Convergence rate}}\label{fig:BFGSvsLBFGS-rate} \end{subfigure} \hspace{0cm} \begin{subfigure}{0.46\linewidth} \centering \includegraphics[width=\linewidth]{(4) - experiments/images/BFGSvsLBFGS-time-m.png} % chktex 8 \caption{\textit{Time and memory scalability}}\label{fig:BFGSvsLBFGS-scalability} \end{subfigure} \caption{\textit{BFGS vs L-BFGS}}\label{fig:BFGSvsLBFGS-comparison} \end{figure} To enhance and point out better the effectiveness of the implementation, we analyze other Quasi-Newton methods. We provide a final comparison between a relevant subset, in particular the final comparison consists of confronting L-BFGS, BFGS, DFP and SR1. Both DFP and BFGS are tested in its variants with the Dogleg alternative to line search as well \cite{Dogleg}. These algorithms have been implemented and finally optimized in terms of efficiency in memory allocations, since they are prone to huge memory allocation. The plot in \autoref{fig:Quasi-newton-time} shows the running time of the algorithms on growing size well-conditioned matrices. In \autoref{fig:Quasi-newton-dogleg-time}, we can see that combining the update formula (BFGS and DFP) with the Dogleg (trust region) brings a lot of inefficiency in finding the minimum of that region since with line search an exact solution is used. In the plot in \autoref{fig:Quasi-newton-no-dogleg-time} we can have a clearer visualization of the difference in efficiency between the methods, since the running time when using the Dogleg is much higher and worsens the plot. In particular, the results stick with the theory from which we expect exact line search to be better than dogleg method for finding appropriate steps; we expect also for L-BFGS to be the fastest method, followed by BFGS and SR1 that are almost equally efficient on average. The slowest is DFP that is more than twice slower than the two previously mentioned algorithms. \begin{figure}[H] \centering \begin{subfigure}{0.48\linewidth} \centering \includegraphics[width=\linewidth]{(4) - experiments/images/Quasi-Newton-Comparison-time-wellcond-Dogleg.png} \caption{\textit{With Dogleg}} \label{fig:Quasi-newton-dogleg-time} \end{subfigure} \hspace{0cm} \begin{subfigure}{0.48\linewidth} \centering \includegraphics[width=\linewidth]{(4) - experiments/images/Quasi-Newton-Comparison-time-wellcond.png} \caption{\textit{Without Dogleg}} \label{fig:Quasi-newton-no-dogleg-time} \end{subfigure} \caption{\textit{Quasi-Newton methods running time comparison}}\label{fig:Quasi-newton-time} \end{figure} The last test regards the memory allocation provided by the algorithms. This test is the equivalent of the time scalability, but the metric is the number of allocated bytes on average by the algorithms. Our implementation has been optimized as much as possible, for instance by using Julia's in-place operators in order to minimize the number of allocations. The last plot we display, in \autoref{fig:Quasi newton comparison memory}, shows the trend of increasing allocated bytes by the algorithms. Methods that converge more slowly or use more memory per iteration, by using more complex update rules, perform worse. \begin{figure}[htbp] \centering \includegraphics[width=0.65\linewidth]{(4) - experiments/images/Quasi-Newton-Comparison-memory-wellcond.png} \caption{\textit{Memory allocation of the different Quasi-Newton methods}} \label{fig:Quasi newton comparison memory} \end{figure}