Triple Shot - Linear Methods Based Pruning¶

The Idea:

Linear Algebra has Matrix Approximation Methods. I.E) Select a subset of values in a matrix that best approximate the orginal
Neural Network Pruning selects subsets of neurons... or in other words selects subsets of matrices
Lets use these approximation methods to Prune!

We will use ideas from Leverage Sampling and Volume Sampling.

Mathematical Foundations¶

Linear algebra has a handful of techniques for making finding an approximate matrix. For example, if there is a matrix M that has one million rows and columns ($\ M \in \mathbb{R}^{1,000,000 \times 1,000,000}$), we may want to find an approximate matrix $\widetilde{A}$. That way things can be computed much much faster.

Therefore, in the following section I go over what a "good" approximation is, and a few algorithms used to find the approximate matrices.

Qualifications of a "Good" Approximation¶

In the most general sense, given a large matrix $A \in \mathbb{R}^{n \times d}$, we would like an approximate matrix $\widetilde{A} \in R^{s \times d}$ where $s << n$, which preserves the geometry of A. More specifically what this means is that a good approximation will retain the column span of A, and will include directions that are dominant, or in other words carry a lot of information.

When the column span a A is equal to the column span of $\widetilde{A}$, where span formally is

\[ \text{span}(A) = \left\{ \sum_{i=1}^k \alpha_i \mathbf{a}_i \,\middle|\, \alpha_i \in \mathbb{R},\ \mathbf{a}_i \text{ is a column of } A \right\} \]

This means the set of every linear combination of the columns in A is equal to the set of every linear combination of the columns of $\widetilde{A}$. Thus, every direction that A included is retained in $\widetilde{A}$. Do note, in this report I will focus on retaining of column span to make the discussion smaller, as when sampling rows, the goal is to preserve as much information about the column space as possible. Therefore, later in this report, "column span of A" will be shortened to "span of A". A byproduct of retaining the span of A, is that $\widetilde{A}$ will also have the same rank as A. In other words, by retaining the original span of A, the approximate matrix cannot be rank deficient. This is a beneficial property for the $\widetilde{A}$ matrix to have and may result in better conditioning of the matrix, unique solutions for the least-squares problem, and more. Generally, we can therefore think of retaining span as being a good indicator that the overall geometry of the original matrix is expressed by the approximate matrix.

The magnitude of a vector in a matrix is trivially the 'size' of the vector. This is typically calculated with a chosen vector norm, and most commonly the Euclidean norm. Of interest to sampling, however, is the importance of the magnitude of the vectors chosen for the approximate matrix. That is, rows that have a greater magnitude are generally considered to carry more information or energy. This is related to the Frobenius norm of A: $$ ||A||^2_F = \sum_{i=1}^n ||a_{i,*}||_2^2 , A\in {\mathbb{R}^{n \times d}} $$ and that a good approximate matrix may have a similar total energy, meaning a similar Frobenius norm (this requires rescaling of samples). Therefore, a goal of the sampling algorithm is two include elements of A that carry high amounts of information, and better yet to include elements that carry high amounts of information along some principal direction of matrix A.

The ideal approximate matrix $\widetilde{A}$ of a large matrix A is usually considered to be the best-k approximation that is found using Singular Value Decomposition (SVD Decomp), $A_k=U\Sigma V^T$ and $A_k \in \mathbb{R}^{n \times d}$ and has rank k. In this decomposition, $U \in \mathbb{R}^{n \times k}$ is an orthogonal basis for the column space of $A \in \mathbb{R}^{n \times d}$. These values are the left singular vectors. The matrix $\Sigma \in \mathbb{R}^{k \times k}$, contains the singular values. Do note, the best-k approximation is a rank-k projection of A, meaning that, while it represents the subspace of A well, it is not a subset of rows of A and does not reduce the size of the matrix at all. This provides some additional motivation for sampling techniques.

The ideal approximation can be used to reason about how good a given sampled matrix $\widetilde{A}$ is. In other words, we can fairly easily compute error bounds using the best-k approximation. That is, we can use the squared Frobenius norm of the difference between the original A matrix and A projected onto the sampled, or the best-k approximate matrix. Doing so will allow us to compare the information loss from the best-k approximation and our sampled matrix. These produce relatively loose bounds, but are fairly easily understood. The bounds will follow the form of %For this report, I will consider the relatively loose bounds given by the squared Frobenius norm of the difference between the original matrix A and the projection of matrix A onto the sampled matrix $\widetilde{A}$. This equation follows the form of

\[ \mathbb{E}[\,\bigl\|A - \pi_S(A)\,\bigr\|_F^2] \;\le\; \alpha\bigl\|A - A_k\bigr\|_F^2 + \beta \]

where $A_k$ is the best-k approximation, $\pi_S(A)$ is the original matrix projected onto the sampled matrix, $\alpha$ is some multiplicative factor, and $\beta$ is some additive factor. Both $\alpha$ and $\beta$ depend on the sampling technique used, and the means by which the bound was calculated (which is out of scope of this report). We can therefore understand this equation as comparing the expected information loss from the approximate matrix to the expected information loss from the best-k approximation. A great sampled subset will have expected information loss nearly equal to that of the information loss from the best-k approximation.

L2-Norm Sampling¶

L2-Norm Sampling uses the matrix (or vector) 2-norm, the euclidean norm, to compute an importance score $l_i$ for a given row $A_{i,*}$, $l_i = ||A_{i,*}||^2_2$. The intuition behind this is, rows with larger 2-norms carry more signal or more important information, or more energy, than rows with smaller 2-norms. Therefore, by using the 2-norm we can differentiate rows from each other based on how dominant they relatively are. As discussed early, one of the criteria for an approximate matrix is that is includes elements that carry a large amount of signal. Thus, we can bias the sampling probability to pick rows more frequently that are considered more important due to their 2-norm. By using a sampling algorithm that is 2-norm where, the approximate matrix may be better than that of one generate by uniform random sampling.

Specifically, given the importance score of each row $l_i$, we sample $s$ from the original matrix A, with each row receiving a probability proportional to its importance score over the sum of all importance scores. That is $p_i =\tfrac{l_i}{\sum_j l_j}$. The output of the algorithm is an approximate matrix $\widetilde{A} \in \mathbb{R}^{s \times d}$. The steps of this sampling process are shown in L2-Norm Algorithm. In this algorithm, there is an optional final step to rescale the sampled matrix $\widetilde{A}$.

Rescaling the sampled set of rows transforms the sample into an unbiased sketch of A. This means that the expected Frobenius norm of A is equal to the Frobenius norm of the original A matrix:

\[ \mathbb{E}||\widetilde{A}||^2_F = s\sum_i p_i \frac{1}{sp_i} ||A_{i,*} ||^2_2 = ||A||^2_F \]

Furthermore, the expected Gram-Matrix is retained:

\[ \mathbb{E}[\widetilde{A}^T\widetilde{A}] = s\sum_{i=1}^n p_i \frac{1}{sp_i} A^T_{i,*}A_{i,*} = A^TA \]

These two properties that are "nice-to-haves" for other linear algebra-related algorithms, however, it is interesting to take note of them here [^drineas2017].

L2-Norm Algorithm¶

L₂-Norm Sampling of Rows

Procedure L2NormSample(A ∈ ℝ^{n×d}, s)

For each row i = 1 … n:
Compute $ l_i \gets \lVert A_{i,*} \rVert_2^2 $
Let $ P \gets \sum_{j=1}^n l_j $
For each row i = 1 … n:
Set $ p_i \gets l_i / P $
Initialize $\tilde A \gets [\,]$ (empty list / $0 \times d$ matrix)
For t = 1 … s:
Draw index $i_t$ with probability $\Pr[i_t = i] = p_i$
Let $\mathbf r \gets A_{i_t,*}$
Rescale: $\mathbf r \gets \frac{\mathbf r}{\sqrt{s\,p_{i_t}}}$
Stack: $\tilde A \gets \begin{bmatrix} \tilde A \\ \mathbf r \end{bmatrix}$
Return $\tilde A$

The runtime of L2-Norm sampling for a matrix $A \in \mathbb{R}^{n \times d}$ is O(nd). The relatively fast runtime and understandability of L2-norm sampling are two of its greatest strengths.

Error Bounds. The expected squared Frobenius norm of the difference between A and $\pi_S(A)$ is:

\[ \mathbb{E}\,\bigl\|A - \pi_S(A)\bigr\|_F^2 \;\le\; \bigl\|A - A_k\bigr\|_F^2 \;+\;\frac{\alpha}{k}\,\bigl\|A\bigr\|_F^2 \]

where $\alpha = \frac{||A||^2_F}{||A||^2_2}$ is the stable rank of a, k is the target rank, and $A_k \in \mathbb{R}^{n x d}$ [^drineas2017]. Stable rank can be thought of as the number of "large enough" singular values. Therefore, we can interpret this bound as saying, the sampled matrix has an accuracy penalty of $\frac{\alpha}{k}||A||^2_F$ when compared to the best-k approximation. The key take away is that the expected error of the approximate matrix will not exceed the error of the optimal matrix plus an additive factor.

Leverage Sampling¶

The leverage score of a given row i captures the row's importance for defining the column space, in other words, how much leverage the row has along the principal component of the column space. To gain an intuitive grasp of this, Figure: Leverage Scores, shows an example set of points with the first principal component graphed as the line. We can observe that points that, when projected onto the principal component line, have a greater magnitude have larger leverage scores indicated by the larger marker size.

The procedure for using leverage importance scores for leverage sampling is shown in the Leverage Sampling Algorithm. Similar to L2-Norm sampling, the row-wise probability of being selected is proportional to the given rows leverage score over the total sum of leverage scores: $p_i =\frac{l_i}{\sum_j l_j}$.

Leverage score sampling is closely connected to the idea that SVD returns the optimal rank-k subset. That is, leverage score sampling uses the left singular vectors in the U matrix of an economy (or thin) SVD. The left singular vectors form an orthogonal basis for the column space of matrix A, and can be thought of as providing a degree to which A stretches an input vector x along A's column space. Therefore, by considering the importance score as $l_i = ||UU^T||_{i,i} = ||U_{i,*}||_2^2$, then the importance score captures how much the current row attributes towards the stretching of a vector x in $xA$. Lastly, there is an alternative algorithm that, instead of using the U matrix from the SVD decomposition, the Q from the QR decomposition is used. Specifically, then $l_i = ||QQ^T||_{i,i}$.

Plot of Leverage Scores¶

Leverage Sampling Algorithm¶

Leverage Sampling Algorithm

Procedure: "LeverageSample(A ∈ ℝ^{n×d}, s)"

Compute the thin SVD: $ A = U \Sigma V^\top $.
For each row $ i = 1 \ldots n $:
$ \ell_i \gets \lVert U_{i,*} \rVert_2^2 $
Let $ P \gets \sum_{j=1}^n \ell_j $.
For each $ i = 1 \ldots n $:
$ p_i \gets \ell_i / P $
Initialize $ \tilde{A} \gets [\,] $ (empty list / $0 \times d$ matrix).
For $ t = 1 \ldots s $:
Draw index $ i_t $ with probability $ \Pr[i_t = i] = p_i $.
Set $ \mathbf{r} \gets A_{i_t,*} $.
(Optional rescale) $ \mathbf{r} \gets \dfrac{\mathbf{r}}{\sqrt{s\,p_{i_t}}} $.
Stack: $ \tilde{A} \gets \begin{bmatrix} \tilde{A} \\ \mathbf{r} \end{bmatrix} $.
Return $ \tilde{A} $.

Runtime of Leverage Score The runtime of leverage score sampling for a matrix $A \in \mathbb{R}^{n \times d}$ is $O(nd \times \text{min}(n,d))$, which in our case is always $O(nd^2)$, when using a thin or economy SVD [^drineas2017]. The QR decomposition variant of the algorithm shares the same runtime, but may have better memory performance depending on it's implementation [^drineas2017].

Error Bounds on leverage score are very similar to those of L2-Norm. $$ \mathbb{E}\,\bigl|A - \pi_S(A)\bigr|_F^2 \;\le\; \bigl|A - A_k\bigr|_F^2 \;+\;\frac{r}{s}\,\bigl|A\bigr|_F^2 $$

Where r is the rank of matrix A, k is the target rank, $A_k \in \mathbb{R}^{n \times d}$ is the best-k approximation, s is the sample size, and $\pi_S(A)$ is the original matrix projected onto $\widetilde{A} \in \mathbb{R}^{s \times d}$. Similarly, here we see that the penalty of using leverage scores over the optimal rank-k matrix is additive. However, this time the penalty is $\frac{r}{s}$ where k is the target rank and s is the size of the subset. Therefore, again, as long as a sufficient number of rows are sampled, that is, we sample many more rows than directions we hope to keep, then the error bound approaches the error given by the optimal top-k subset.

Approximation of leverage scores can be achieved using a random permutation to first "downsample" matrix $A \in \mathbb{R}^{n \times d}$ to matrix $Y = PA$ where $P \in \mathbb{R}^{l \times n}, Y \in \mathbb{R}^{l \times d}$. Then, we can follow the same procedure as shown in Levarege Sampling Algorithm, but with $A=Y$. Therefore, as long as $l<<n$, it can trivially be seen that the algorithm will run faster, and its runtime is $O(ndl+nl^2)$.

Volume Sampling¶

The intuition behind volume sampling is that the technique aims to maximize the hyper-dimensional volume of the sampled matrix $\widetilde{A}$, constrained by the sample size s. A small example of this, in a two dimensional space is shown in Volume Figure.

Volume sampling maximizes the volume of the shape of the sampled subset. Formally, volume sampling calculates the probability of selecting a given subset of rows to be proportional to its squared volume of the sum of squared volumes. The probability of selecting a given subset S is $$ Pr(S) = \frac{det(A_{S,} A_{S,}^T)}{\sum_{T:|T|=s} det(A_{T,} A_{T,}^T) } = \frac{Vol(A_{S,})^2}{\sum_{T:|T|=s} Vol(A_{T,})^2} $$ This formulation has two significant results. By calculating the volume of the hyper-dimensional figure, figures that have both (1) High pairwise angles and (2) large magnitudes will score higher values. What this means is the subset will contain both the important or dominant rows from the original matrix, as well as a diverse set of directions, which encourages the sampled set to have a greater rank. Interestingly, it can be easily seen that any subset that contains vectors that only point in one direction get an importance score of exactly 0. This pattern can be generally extended to say that subsets with just a few directions do not score as well. Thus, volume sampling encourages both important rows to be selecting, and the sampled set to be geometrically diverse, resulting in a greater span and rank.

Volume sampling runtime. A naive implementation of volume sampling, such as that seen in the [Volume Sampling Algorithm]{#algo_volsample}, requires the enumeration of n choose s subsets of the matrix, resulting in a $O(n^k)$ loose upper bound time complexity, and a tight $\Theta(\frac{n^k}{k!})$. This implementation, however, is not typically used and is rather to demonstrate the most basic form of the algorithm. Today, there are many more efficient versions of this algorithm. At the time of writing, the fastest algorithm for computing exact volume sampling probabilities has a time complexity of $O(nd^2)$ for $A \in \mathbb{R}^{n \times d}$ [^derezinski2018], [^drineas2006l2]. A relatively simple version of one such algorithm uses the idea of 'iteratively knocking out undesirable rows'. That is, a list of row indices is initialized to contain every row in the original matrix A. Then, iteratively, rows are 'knocked out' of the list if that row's contribution to the current hyper-dimensional volume is not that great. The full procedure is shown Iterative Volume Sampling Algorithm [^derezinski2018].

Volume Score Figure¶

Naive Volume Sampling¶

Naïve Volume Sampling

Procedure VolumeSample(A ∈ ℝ^{n×d}, s)

Let $ \mathcal{T} = \{\, S \subseteq \{1,\dots,n\} \mid |S| = s \,\} $.
For each $ S \in \mathcal{T} $:
$ w_S \gets \det\!\big( A_{S,*}\,A_{S,*}^\top \big) $.
Let $ W \gets \sum_{S \in \mathcal{T}} w_S $.
For each $ S \in \mathcal{T} $:
$ p_S \gets w_S / W $.
Sample subset $ S $ from $ \mathcal{T} $ with probability $ \Pr(S) = p_S $.
Return $ \tilde{A} = A_{S,*} $.

Iterative Volume Sampling¶

Iterative Volume Sampling

Procedure IterativeVolumeSample(A ∈ ℝ^{n×d}, s)

Initialize $ S \gets \{1,2,\dots,n\} $.
While $ |S| > s $:
For each $ i \in S $:
$ q_i \gets \dfrac{\det\!\big( A_{\,S\setminus\{i\},*}\,A_{\,S\setminus\{i\},*}^\top \big)}% {\det\!\big( A_{\,S,*}\,A_{\,S,*}^\top \big)} $.
Sample index $ i \in S $ with probability $ \Pr(i) \propto q_i $
(i.e., $ \Pr(i) = q_i / \sum_{j \in S} q_j $).
Update $ S \gets S \setminus \{ i \} $.
Return $ S $ (equivalently, $ \tilde{A} = A_{S,*} $).

The error bounds of Volume Sampling are loosely given by the equation:

\[ \mathbb{E}\,\bigl\|A - \pi_{S}(A)\bigr\|_F^2 \;\le\; \frac{k+1}{s-k+1}\,\bigl\|A - A_k\bigr\|_F^2 \]

or

\[ \mathbb{E}\,\bigl\|A - \pi_{S}(A)\bigr\|_F^2 \;\le\; (k+1)\,\bigl\|A - A_k\bigr\|_F^2, \quad\text{where } k = s \]

where $A_k \in \mathbb{R}^{n \times d}$ is the best-k approximation , $\pi_S(A)$ is the projection of A onto the sampled set, and k is the target rank of the sampled set. Again $||A-A_k||^2_F$ is the optimal error, and $||A-\pi_S(A)||$ is the error, or information loss, of the sampled subset. Initially, these bounds are not as intuitive as the bounds for L2-norm sampling and leverage score sampling, as the multiplicative factor of the error increase as the sample size increases. However, as the sample size increases, the optimal error $||A-A_k||^2_F$ decrease at a rate high enough that overall the right hand side of the equation, the error bound, goes down. Thus, as sample size increases the expected error decreases. The key take away from this bound is that the approximate matrix error will not exceeded (1+s) times that of the optimal error.

Neural Network Pruning¶

Many applications require small, efficient, yet still accurate neural networks. One such application is the modulation classification of incoming radio transmissions, where the classification occurs in real time on a small receiver device. Typically, the receiver device is not equipped with GPUs that can be leveraged to run models. Therefore, to overcome this, recent research has studied Pruning techniques that are able to reduce the overall parameter count of an already trained neural network. The results show that a model can be significantly pruned and still retain a significant amount of accuracy.

There are two main types of pruning, structured pruning and unstructured pruning. Structured pruning removes entire groups (or channels or filters) of weights at once, while unstructured pruning removes individual weights. Additionally, there is Pre-defined pruning, which defines an exact rate of pruning per layer, and there is Dynamic pruning, where the pruning weight at different layers is dynamic and determined based on importance sampling [^li2024tinypower]. Our focus is therefore Dynamic structured pruning, which leverages, in recent research, the L2-norm to calculate the importance score of filters in the network.

Specifically, given a model M, comprised of layers $l_0, l_1, l_2, \dots, l_n$, L2-norm dynamic structured pruning aims to prune the filters of convolutional layers in the network. Formally, let $W_i \in \mathbb{R}^{n \times d}$ be the weights at layer $l_i$, where n is the number of filters and d is the number of filter parameters, assign filter j an importance score using the l2 norm:
$$ S(W_{i,j}) = {||W_{i,{j,*}}||_2^2} $$

$S(W_{i,j})$ is the score for the j'th filter of the i'th convolutional layer in the model M. Using the filter-wise importance score, we dynamically calculate a set of pruning rates for the whole layer. Once the importance scores are calculated for $l_i$, the filters are listed in descending order by importance score. The pruned filters are the filters that come after the minimal difference in importance score. This is shown in figure

A key idea of this work is the theory that filters with greater individual importance scores have a greater effect on other filters with relatively high importance scores. Conversely, once this is a large drop in importance values, the filters following this drop are loss important to the overall ability of the network. However, to determine the importance, the only metric used in previous work is the L2-Norm. and thus only the magnitude of the filter is considered. Therefore, a natural extension of this work is to incorporate (1) leverage scores and (2) volume sampling.

To implement leverage score sampling and volume sampling, some challenges must be overcome. First and foremost, very quickly, the layer matrices become overparameterized with respect to the number of filters. That is, our weight matrix $W \in \mathbb{R}^{n x d}$, quickly becomes over-parametrized, meaning $d >> n$, meaning there are many more filter features than there are filters. This is not an issue for L2-norm sampling, however, for leverage and volume sampling it is a requirement that $n >> d$. Furthermore, we cannot simply transpose the W matrix and sampled filter features to keep because we are performing structured pruning, which is only capable of pruning whole filters.

To remedy this, in cases where the W matrix is over-parameterized, a modified version of leverage and volume sampling is applied. Specifically, the filter feature importance is sampled and then used in a separate row-wise calculation to determine which rows are most important for the most important features. The exact implementation is committed from this report. If interested, the code is hosted at: https://github.com/ESPR3SS0/MultiPrune.git.

Results¶

The results of the additional pruning implementations are shown in results. From these, we can make 2 interesting observations.

First, volume pruning was much more aggressive in it's pruning rate. This indicates that volume pruning finds a small subset of filters a high importance score, and the remaining large portion has much lower scores. More specifically, because MiniDrop will drop the filters following the greatest difference in importance, and when using Volume Scoring we prune more agressively, the algorithm must assign a high values to a small subset.

Second, we find that leverage pruning and L2-norm pruning had very similar results. This may indicate that coincidentally, many of the rows with greater magnitudes also had high leverage.

Table: Pruning Application Results¶

Model	Num Parameters (H-param)	Test Acc.	Prune Runtime (mins)
Baseline	717,250 (0)	99.01%	NA
L2-pruned	223,785 (16)	97.05%	1.30
Leverage Pruned	24,737 (16)	94.19%	0.24
Volume Pruned	5,630 (16)	67%	0.19
L2-pruned	111,942 (8)	95.05%	1.45
Leverage Pruned	103,132 (8)	97.49%	2.00
Volume Pruned	9,087 (8)	77%	0.197
L2-pruned	46,426 (4)	97.05%	1.45
Leverage Pruned	103,132 (4)	97.49%	2.00
Volume Pruned	FAILED (4)	0%	NA

tags:

model_minimization¶
pruning¶

[^drineas2017]: Petros Drineas and Michael W. Mahoney. Lectures on randomized numerical linear algebra. 2017. URL: https://arxiv.org/abs/1712.08880, arXiv:1712.08880. [^derezinski2018]: Michał Dereziński and Manfred K. Warmuth. Reverse iterative volume sampling for linear regression. Journal of Machine Learning Research, 19(23):1–39, 2018. URL: https://jmlr.org/papers/v19/17-781.html. [^drineas2006l2]: Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Sampling algorithms for $\\ell \_2$ regression and applications. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 1127–1136. SIAM, 2006. URL: https://www.cs.purdue.edu/homes/pdrineas/documents/publications/Drineas_SODA_06.pdf. [^li2024tinypower]: Haipeng Li, Mabon Ninan, Boyang Wang, and John M. Emmert. Tinypower: side-channel attacks with tiny neural networks. In Proceedings of the 2024 IEEE International Symposium on Hardware Oriented Security and Trust (HOST), 320–329. IEEE, 2024. URL: https://mabonmn.github.io/assets/pdf/tiny_power.pdf, doi:10.1109/HOST55342.2024.10545382.