Compleated written content of report

2023-08-26 18:37:05 +02:00
parent ace44a4848
commit 3d06944c05
2 changed files with 59 additions and 18 deletions
--- a/report/document.pdf
+++ b/report/document.pdf
--- a/report/document.tex
+++ b/report/document.tex
@ -185,9 +185,11 @@ prefixes-as-symbols=false,
 \newcommand{\newgroupplot}[1]{
  \nextgroupplot[
  title = #1,
-  xmin = 0, xmax = 64,
-  ymin = 0,
+  xmin = 1, xmax = 64,
+  ymin = 0.005, ymode = log,
+  log basis y={2},
  xtick distance = 8,
+  log ticks with fixed point,
  grid = both,
  minor tick num = 1,
  major grid style = {lightgray},
@ -209,6 +211,7 @@ prefixes-as-symbols=false,
    \addlegendentryexpanded{\colname}
  }
  \addplot[mark=none, black, samples=2, domain=0:64] {1};
+  \addplot[domain=1:64,samples=200,color=gray!70,] {x};
 }

 \graphicspath{ {./import/} }
@ -225,6 +228,20 @@ prefixes-as-symbols=false,

 \begin{document}

+\section{Building and Executing the project}
+
+The project uses \texttt{cmake} to create the native makefiles. The flag \texttt{CMAKE\_BUILD\_TYPE} can be used to specify the type of build; two options are supported: \texttt{Debug} and \texttt{Release}.
+The main file creates a \texttt{.csv} file with the execution time of different test cases with input files located in \texttt{./tests}. On MacOS, thread pinning for the Fastflow library is disabled since is not supported by the operating system.
+
+To compile and run the project:
+
+\begin{minted}{bash}
+  cmake -DCMAKE_BUILD_TYPE=Release -S . -B build/
+  cd build/
+  make
+  ./main
+\end{minted}
+
 \section{Implementation Design}
 %% - - - - - - - - - - - - - - - - - - %%
 \subsection{Design Choices}
@ -236,7 +253,9 @@ The class \texttt{Stencil} holds both the parallel implementation using the Fast
  \caption{}
 \end{figure}

-The class \texttt{Reader} reads a binary file composed of 4 bytes representing the number of rows, 4 bytes representing the number of columns and then the raw matrix data. Each element is a \texttt{char}. The result is stored in the class \texttt{Task} which will be passed to the next node. If instead the operator \texttt{()} is called, only the data will be returned as a pointer.
+The class \texttt{Reader} reads a binary file composed of 4 bytes representing the number of rows, 4 bytes representing the number of columns and then the raw matrix data. Each element is a \texttt{char} in all the test cases. The result is stored in the class \texttt{Task} which will be passed to the next node. If instead the operator \texttt{()} is called, only the data will be returned as a pointer.
+
+The \texttt{Task} class can support matrixes of different element type rather than \texttt{char}.

 The \texttt{Writer} instead writes to disk the task to the same folder, overwriting existing files if present.

@ -245,7 +264,7 @@ The \texttt{Writer} instead writes to disk the task to the same folder, overwrit

 The structure of the implementation with native C++ threads is as follows:

-\begin{algorithm}
+\begin{algorithm}[H]
  \begin{algorithmic}[1]
    \Procedure{stdthread}{$Input,Output$}
      \For{$result \in Input$}
@ -258,7 +277,7 @@ The structure of the implementation with native C++ threads is as follows:
          \State $iter = iter - 1$
        \EndWhile
        \State wait for the threadpool to finish
-        \State push $result$ to $Output$
+        \State append $result$ to $Output$
      \EndFor
    \EndProcedure
  \end{algorithmic}
@ -292,16 +311,16 @@ Since the \texttt{Stencil} class is a subclass of \texttt{ff\_Map}, the method u

 A custom emitter and collector would not have been faster and so the simpler approach of inheriting the methods from \texttt{ff\_Map} was chosen.

-\begin{algorithm}
+\begin{algorithm}[H]
  \begin{algorithmic}[1]
    \Procedure{fastflow}{$Task$}
      \State $arena = Task$
      \While{$iter>0$}
        \State \texttt{parallel\_for} with LAMBDA as the function to execute
-        \State swap $arena$ with $result$
+        \State swap $arena$ with $Task$
        \State $iter = iter - 1$
      \EndWhile
-      \State return task
+      \State return $Task$
    \EndProcedure
  \end{algorithmic}

@ -320,12 +339,25 @@ A custom emitter and collector would not have been faster and so the simpler app
 %% - - - - - - - - - - - - - - - - - - %%
 \section{Performance Analysis}

-The matrix data inside the class \texttt{Task} was both tested for performance as a vector of vectors and as a simple contiguous arena. The performance was exaclty the same so the simpler vector of vectors implementation was preferred.
+The matrix data inside the class \texttt{Task} was both tested for performance as a vector of vectors and as a simple contiguous arena. The performance was exactly the same so the simpler vector of vectors implementation was preferred.

 In the file \texttt{main.cpp} a csv file is created from various tests on files from the \texttt{tests/} directory.
 The time computed is for reading the file from disk, computing the stencil with different parameters and finally writing again to disk.
+Instead of averaging the times of different runs, the minimum of the runs is chosen since outliers skew the mean greatly.
 Reading and writing to disk are much faster than the computation except for the largest examples. In those cases the minimum time of reading and writing is subtracted.

+Since
+
+\[ T_{\text{total}} = T_{\texttt{Reader}} + T_{\texttt{Stencil}} + T_{\texttt{Writer}} \]
+
+and the value of $T_{\texttt{Reader}} + T_{\texttt{Writer}}$ is known on average then the values speedup, scalability and efficiency are calculated as follows
+
+\begin{align*}
+  \text{Speedup}(n) &= \frac{T_{\text{seq}}}{T_{\text{par}}(n) - (T_{\texttt{Reader}} + T_{\texttt{Writer}})} \\
+  \text{Scalability}(n) &= \frac{T_{\text{par}}(1) - (T_{\texttt{Reader}} + T_{\texttt{Writer}})}{T_{\text{par}}(n) - (T_{\texttt{Reader}} + T_{\texttt{Writer}})} \\
+  \text{Efficiency}(n) &= \frac{\text{Speedup}(n)}{n} \\
+\end{align*}
+
 For very small matrices the efficiency, the speedup and the scalability is very poor for both versions.
 For larger examples instead a significant speedup is seen, but the implementation using native threads is slightly faster.

@ -333,7 +365,7 @@ For larger examples instead a significant speedup is seen, but the implementatio
  \begin{tblr}{
      colspec = {Q[l,m]|Q[r,m]|Q[r,m]},
    }
-    Image & Time in \si{\microsecond} & Size in \si{\byte} \\
+    Image & $T_{\texttt{Reader}} + T_{\texttt{Writer}}$ in \si{\microsecond} & Size in \si{\byte} \\
    \hline % chktex 44
    empty2x2 & 2218 & 12 \\ % chktex 29
    increasing4x6 & 2054 & 32 \\ % chktex 29
@ -347,42 +379,50 @@ For larger examples instead a significant speedup is seen, but the implementatio
 \begin{center}
  \begin{tikzpicture}
    \begin{groupplot}[group style={group size=1 by 2, vertical sep = 1.5cm}]
-      \plotfile{data/increasing300x200ff.dat}{fastflow}
+      \plotfile{data/increasing300x200ff.dat}{Fastflow}

-      \plotfile{data/increasing300x200std.dat}{stdthread}
+      \plotfile{data/increasing300x200std.dat}{Native Threads}
    \end{groupplot}
    \node (title) at ($(group c1r1.center)+(0,4.5cm)$) {\color{red}{increasing300x200}};
  \end{tikzpicture}
 \end{center}

+For the file \texttt{increasing300x200} % chktex 29
+the fastflow has a peek of speedup and scalability when using 4 workers in the stencil stage but quickly looses performance due to the small size of the input. For the native thread version instead the speedup and the scalability always stays above $1$ but has a peek at 32 workers.
+
 \begin{center}
  \begin{tikzpicture}
    \begin{groupplot}[group style={group size=1 by 2, vertical sep = 1.5cm}]
-      \plotfile{data/random400x2500ff.dat}{fastflow}
+      \plotfile{data/random400x2500ff.dat}{Fastflow}

-      \plotfile{data/random400x2500std.dat}{stdthread}
+      \plotfile{data/random400x2500std.dat}{Native Threads}
    \end{groupplot}
    \node (title) at ($(group c1r1.center)+(0,4.5cm)$) {\color{red}{random400x2500}};
  \end{tikzpicture}
 \end{center}

+The file \texttt{random400x2500} % chktex 29
+performs best with 16 workers in the Fastflow implementation and slightly better at 64 workers compared to 32 workers in terms of speedup and scalability but has a significand drop in efficiency from $0.361$ to $0.184$. The relationship between number of workers and speedup is close to linear up to 8 workers.
+
 \begin{center}
  \begin{tikzpicture}
    \begin{groupplot}[group style={group size=1 by 2, vertical sep = 1.5cm}]
-      \plotfile{data/equationff.dat}{fastflow}
+      \plotfile{data/equationff.dat}{Fastflow}

-      \plotfile{data/equationstd.dat}{stdthread}
+      \plotfile{data/equationstd.dat}{Native Threads}
    \end{groupplot}
    \node (title) at ($(group c1r1.center)+(0,4.5cm)$) {\color{red}{equation}};
  \end{tikzpicture}
 \end{center}

+The file \texttt{equation} more closely follows a linear relationship between speedup or scalability and number of workers for both versions.
+
 \begin{center}
  \begin{tikzpicture}
    \begin{groupplot}[group style={group size=1 by 2, vertical sep = 1.5cm}]
-      \plotfile{data/equation2ff.dat}{fastflow}
+      \plotfile{data/equation2ff.dat}{Fastflow}

-      \plotfile{data/equation2std.dat}{stdthread}
+      \plotfile{data/equation2std.dat}{Native Threads}
    \end{groupplot}
    \node (title) at ($(group c1r1.center)+(0,4.5cm)$) {\color{red}{equation2}};
  \end{tikzpicture}
@ -392,6 +432,7 @@ As the size of the input increases the speedup and the scalability both follow l

 The scalability for both test files \texttt{equation} and \texttt{equation2} never go below $0.37$, but is slightly better for the implementation with native C++ threads.

+The difference in the three quantities between the test with file \texttt{equation} and the test with file \texttt{euqation1} is much smaller for the Fastflow version. In the native thread version instead there is a small improvement expecially with a higher number of workers.

 \end{document}