Double Shot - Linear Methods Based Pruning¶

The Idea

Pruning neural networks can be thought of as choosing a subset of neurons or weights that preserve the most important behavior of the model. Linear algebra faces a very similar challenge: given a large matrix, how do we find a smaller approximation that still captures its essential structure?

This connection is powerful. If we can borrow approximation ideas from linear algebra — like leverage score sampling or volume sampling — we may design new pruning algorithms for neural networks. The rest of this overview sketches how these matrix approximation methods work, why they matter, and how they translate into pruning.

Mathematical Foundations¶

Suppose we start with a large matrix \(A \in \mathbb{R}^{n \times d}\). Working with all \(n\) rows may be expensive, so we’d like to find a smaller matrix \(\widetilde{A} \in \mathbb{R}^{s \times d}\) with \(s \ll n\) that still captures the geometry of \(A\).

A “good” approximation preserves the span of \(A\) — the set of all possible linear combinations of its columns:

\[ \text{span}(A) = \left\{ \sum_{i=1}^k \alpha_i \mathbf{a}_i \,\middle|\, \alpha_i \in \mathbb{R},\ \mathbf{a}_i \text{ is a column of } A \right\}. \]

If \(\widetilde{A}\) preserves this span, it also preserves the rank of \(A\), meaning it still represents the same directions in space. This makes downstream problems like least-squares more stable.

The “gold standard” approximation is the best-rank-\(k\) approximation from the Singular Value Decomposition (SVD). That is, \(A_k = U \Sigma V^\top\), where \(U\) and \(V\) are orthogonal matrices and \(\Sigma\) holds the singular values. \(A_k\) captures the \(k\) most important directions of \(A\). But this does not reduce the size of the matrix in terms of rows — it’s still \(n \times d\). That’s where sampling comes in.

The challenge: choose actual rows of \(A\) that approximate this ideal \(A_k\) as closely as possible.

L2-Norm Sampling¶

The simplest way is to sample rows with probability proportional to their squared Euclidean norm. Intuitively, rows with larger norms contain more “energy” or information.

Formally, compute importance scores:

\[ \ell_i = \lVert A_{i,*} \rVert_2^2 \]

and sample row \(i\) with probability \(p_i = \ell_i / \sum_j \ell_j\). Optionally, rescale each sampled row to keep the Frobenius norm unbiased.

L₂-Norm Sampling of Rows

Procedure L2NormSample(A ∈ ℝ^{n×d}, s)

For each row \(i = 1 \ldots n\):
\(l_i \gets \lVert A_{i,*} \rVert_2^2\)
Normalize probabilities: \(p_i = l_i / \sum_j l_j\)
Sample \(s\) rows according to \(p_i\), with optional rescaling.
Return \(\widetilde{A} \in \mathbb{R}^{s \times d}\).

This method is easy to implement and has runtime \(O(nd)\). Error guarantees say that the expected approximation error is close to the optimal SVD error, up to a small additive factor.

Leverage Score Sampling¶

L2-norm works well, but it ignores how rows align with the principal directions of \(A\). Leverage scores fix this.

The leverage score of row \(i\) is:

\[ \ell_i = \lVert U_{i,*} \rVert_2^2 \]

where \(U\) is the matrix of left singular vectors from the SVD of \(A\). Rows with high leverage scores align strongly with the column space of \(A\).

This gives a more “informed” sampling procedure. Like L2, we normalize these scores into probabilities and sample rows accordingly.

In the figure above, rows that project strongly onto the principal component line have higher leverage and are more likely to be sampled.

Leverage Sampling Algorithm

Procedure LeverageSample(A ∈ ℝ^{n×d}, s)

Compute thin SVD: \(A = U \Sigma V^\top\)
For each row \(i\): \(\ell_i = \lVert U_{i,*} \rVert_2^2\)
Normalize: \(p_i = \ell_i / \sum_j \ell_j\)
Sample \(s\) rows according to \(p_i\) (with optional rescaling).
Return \(\widetilde{A}\).

Leverage score sampling gives tighter error bounds than L2, but requires computing an SVD, so its runtime is \(O(nd^2)\) in practice.

Volume Sampling¶

Volume sampling takes a different perspective. Instead of looking at rows individually, it tries to maximize the volume spanned by the sampled subset.

For a subset \(S\) of rows:

\[ \Pr(S) \propto \det(A_{S,*} A_{S,*}^\top). \]

This favors subsets that are both large in magnitude and diverse in direction. In 2D, this means preferring points that form a large triangle rather than lying on the same line.

Naïve volume sampling requires enumerating all subsets, which is infeasible. But more efficient iterative algorithms exist, where “bad” rows are gradually knocked out while retaining volume.

Iterative Volume Sampling

Procedure IterativeVolumeSample(A ∈ ℝ^{n×d}, s)

Start with all rows \(S = \{1, \dots, n\}\).
While \(|S| > s\): remove one row with probability proportional to how little it contributes to the volume.
Return the remaining set \(S\).

Volume sampling often leads to subsets that preserve both energy and rank especially well, though it can be more computationally expensive.

Neural Network Pruning¶

Now let’s connect these ideas back to pruning.

In a convolutional neural network, each filter can be thought of as a row in a matrix \(W \in \mathbb{R}^{n \times d}\) (with \(n\) filters, each having \(d\) parameters). Structured pruning removes entire filters, not just individual weights.

A common strategy is L2-norm pruning, where filters with small norms are discarded. This is essentially L2-norm row sampling. But we can imagine using leverage scores or volume sampling instead:

L2 pruning keeps filters with the largest energy.
Leverage pruning keeps filters that contribute most to the network’s “span.”
Volume pruning seeks sets of filters that are both strong and diverse.

An example is shown below with MiniDrop, where filters are pruned after the largest drop in importance scores:

Results¶

When these methods are applied in practice, interesting patterns emerge.

Volume pruning tends to be aggressive, keeping only a small core of filters while discarding many others. Accuracy drops more steeply as a result.
L2 and leverage pruning often give similar results, since rows with large norms also tend to have high leverage.

Table: Pruning Application Results¶

Model	Num Parameters (H-param)	Test Acc.	Prune Runtime (mins)
Baseline	717,250 (0)	99.01%	NA
L2-pruned	223,785 (16)	97.05%	1.30
Leverage Pruned	24,737 (16)	94.19%	0.24
Volume Pruned	5,630 (16)	67%	0.19
L2-pruned	111,942 (8)	95.05%	1.45
Leverage Pruned	103,132 (8)	97.49%	2.00
Volume Pruned	9,087 (8)	77%	0.197
L2-pruned	46,426 (4)	97.05%	1.45
Leverage Pruned	103,132 (4)	97.49%	2.00
Volume Pruned	FAILED (4)	0%	NA

These results highlight both the promise and the trade-offs of bringing linear sampling methods into pruning: some approaches prune aggressively but lose too much accuracy, while others prune more conservatively and retain performance.

tags:
- #model_minimization
- #pruning