Single Shot - Linear Methods Based Pruning¶

The Idea

Pruning in neural networks can be thought of as selecting a subset of neurons or filters that still captures the essential structure of the model. In linear algebra, there is a similar task: approximating a large matrix with a smaller one that preserves its important geometry. This connection suggests that techniques like L2-norm sampling, leverage score sampling, and volume sampling might help us prune networks in a principled way.

Matrix Approximation Foundations¶

Suppose we start with a large matrix \(A \in \mathbb{R}^{n \times d}\). The goal is to replace it with a much smaller matrix \(\tilde A \in \mathbb{R}^{s \times d}\) where \(s \ll n\), while still retaining the “shape” of the original. What matters here is that the smaller matrix preserves the span of \(A\) — in other words, it should represent the same directions in space. Rows with larger magnitudes are especially important because they carry more energy, which is tied to the Frobenius norm of \(A\).

The theoretical gold standard is the best-k approximation from Singular Value Decomposition (SVD), but while that captures the subspace well, it doesn’t actually shrink the matrix by selecting rows. Sampling methods are motivated by this gap: they are cheaper to run and produce actual smaller matrices.

L2-Norm Sampling¶

The simplest approach is to sample rows based on their squared L2 norm. Rows with larger norms are more “important,” so they are chosen with higher probability. This method is fast — just \(O(nd)\) time — and easy to implement. If rows are rescaled after sampling, the resulting \(\tilde A\) preserves useful properties like the expected Frobenius norm.

The error bound guarantees that the sampled matrix won’t be much worse than the optimal low-rank approximation, aside from a small penalty that depends on the stable rank of the original matrix.

Leverage Score Sampling¶

Leverage sampling is more sophisticated. Instead of just looking at row magnitudes, it looks at how much each row contributes to the column space. This is done by computing the thin SVD and using the left singular vectors to assign importance scores. Rows with higher leverage scores are more aligned with principal components and therefore carry more weight.

Although more expensive to compute (about \(O(nd^2)\)), leverage sampling often produces results that are close to optimal in practice.

Volume Sampling¶

Volume sampling takes a different approach: it favors subsets of rows that maximize geometric diversity. The probability of choosing a subset depends on the determinant of its Gram matrix, which reflects the volume of the shape formed by the rows. This means it avoids selecting rows that all point in the same direction, instead keeping subsets that are both strong and diverse.

While the naive algorithm is exponential, modern iterative approaches bring the runtime down to about \(O(nd^2)\). The error bounds show that as sample size grows, the approximation becomes closer to optimal.

Neural Network Pruning¶

In practice, these sampling techniques can be applied to pruning neural networks. Structured pruning, which removes entire filters, is especially relevant because it reduces model size in ways that are efficient on hardware. Traditionally, L2-norm scores have been used to decide which filters to prune. Extending this to leverage or volume sampling offers a way to capture not just magnitude, but also span and diversity.

One challenge is that convolutional layers often have far more parameters than filters, which doesn’t align with the assumptions of leverage or volume methods. Adjustments are needed to make them practical.

Results¶

Experiments show different behaviors. Volume sampling prunes very aggressively, often keeping only a tiny subset of filters, which harms accuracy. L2-norm and leverage sampling, on the other hand, behave similarly, suggesting that in many cases high-magnitude rows also have high leverage.

Model	Num Parameters	Test Acc.	Runtime (min)
Baseline	717k	99.0%	NA
L2-pruned	223k	97.1%	1.3
Leverage Pruned	25k	94.2%	0.24
Volume Pruned	5.6k	67%	0.19

Takeaway¶

Matrix approximation ideas like L2-norm, leverage, and volume sampling provide different perspectives on pruning. L2 is fast and reliable, leverage is more informed but slower, and volume aggressively favors diversity, often at the cost of stability. Together they show how linear algebra can guide the design of smaller, more efficient networks.