Skip to content

Single Shot - Linear Methods Based Pruning

The Idea

Pruning in neural networks can be thought of as selecting a subset of neurons or filters that still captures the essential structure of the model. In linear algebra, there is a similar task: approximating a large matrix with a smaller one that preserves its important geometry. This connection suggests that techniques like L2-norm sampling, leverage score sampling, and volume sampling might help us prune networks in a principled way.


Matrix Approximation Foundations

Suppose we start with a large matrix \(A \in \mathbb{R}^{n \times d}\). The goal is to replace it with a much smaller matrix \(\tilde A \in \mathbb{R}^{s \times d}\) where \(s \ll n\), while still retaining the “shape” of the original. What matters here is that the smaller matrix preserves the span of \(A\) — in other words, it should represent the same directions in space. Rows with larger magnitudes are especially important because they carry more energy, which is tied to the Frobenius norm of \(A\).

The theoretical gold standard is the best-k approximation from Singular Value Decomposition (SVD), but while that captures the subspace well, it doesn’t actually shrink the matrix by selecting rows. Sampling methods are motivated by this gap: they are cheaper to run and produce actual smaller matrices.


L2-Norm Sampling

The simplest approach is to sample rows based on their squared L2 norm. Rows with larger norms are more “important,” so they are chosen with higher probability. This method is fast — just \(O(nd)\) time — and easy to implement. If rows are rescaled after sampling, the resulting \(\tilde A\) preserves useful properties like the expected Frobenius norm.

The error bound guarantees that the sampled matrix won’t be much worse than the optimal low-rank approximation, aside from a small penalty that depends on the stable rank of the original matrix.


Leverage Score Sampling

Leverage sampling is more sophisticated. Instead of just looking at row magnitudes, it looks at how much each row contributes to the column space. This is done by computing the thin SVD and using the left singular vectors to assign importance scores. Rows with higher leverage scores are more aligned with principal components and therefore carry more weight.

Although more expensive to compute (about \(O(nd^2)\)), leverage sampling often produces results that are close to optimal in practice.


Volume Sampling

Volume sampling takes a different approach: it favors subsets of rows that maximize geometric diversity. The probability of choosing a subset depends on the determinant of its Gram matrix, which reflects the volume of the shape formed by the rows. This means it avoids selecting rows that all point in the same direction, instead keeping subsets that are both strong and diverse.

While the naive algorithm is exponential, modern iterative approaches bring the runtime down to about \(O(nd^2)\). The error bounds show that as sample size grows, the approximation becomes closer to optimal.


Neural Network Pruning

In practice, these sampling techniques can be applied to pruning neural networks. Structured pruning, which removes entire filters, is especially relevant because it reduces model size in ways that are efficient on hardware. Traditionally, L2-norm scores have been used to decide which filters to prune. Extending this to leverage or volume sampling offers a way to capture not just magnitude, but also span and diversity.

One challenge is that convolutional layers often have far more parameters than filters, which doesn’t align with the assumptions of leverage or volume methods. Adjustments are needed to make them practical.


Results

Experiments show different behaviors. Volume sampling prunes very aggressively, often keeping only a tiny subset of filters, which harms accuracy. L2-norm and leverage sampling, on the other hand, behave similarly, suggesting that in many cases high-magnitude rows also have high leverage.

Model Num Parameters Test Acc. Runtime (min)
Baseline 717k 99.0% NA
L2-pruned 223k 97.1% 1.3
Leverage Pruned 25k 94.2% 0.24
Volume Pruned 5.6k 67% 0.19

Takeaway

Matrix approximation ideas like L2-norm, leverage, and volume sampling provide different perspectives on pruning. L2 is fast and reliable, leverage is more informed but slower, and volume aggressively favors diversity, often at the cost of stability. Together they show how linear algebra can guide the design of smaller, more efficient networks.