Linear Universal Approximation Theorem

The Linear Universal Approximation Theorem is a theorem I made up that says you can represent any function as a linear map, as long as you encode the inputs and decode the outputs in a sufficiently high-dimensional space with the appropriate transformations. The standard Universal Approximation Theorem says that certain sufficiently large neural networks with nonlinear activation functions can approximate any function to an arbitrary degree of accuracy. The Linear Universal Approximation Theorem makes an analogous claim, but places all nonlinearities in the encoding of inputs and decoding of outputs, leaving the core transformation entirely linear.

I don't know if this idea has been formally expressed anywhere, but it's related to ideas like Cover's theorem and the kernel trick in Support Vector Machines, which exploit the fact that data becomes easier to linearly separate in higher-dimensional spaces.

Discrete Function

Let's start with a proof in the discrete case. Suppose we have a function that maps discrete objects to discrete objects , where each maps to some specific . There are no restrictions on what these objects represent. For instance, the function could map a word to its color, like , or the function could map finitely many real numbers to their squares, like .

Each input can be one-hot encoded as an -dimensional vector , which is a vector of all zeros except for a in position . For example, with inputs, would be encoded as

Similarly, each output is one-hot encoded as an -dimensional vector . We construct an matrix initialized to all zeros, and for each mapping , we set . Multiplying by the one-hot encoding of any input then produces the one-hot encoding of the correct output . This means we can represent any function between finite sets by one-hot encoding the inputs, applying a linear map, and decoding the resulting one-hot vector.

Here's an explicit example. Suppose , , , , , . Each number is one-hot encoded as a -dimensional vector with a in position and s elsewhere. The constructed weight matrix is

and you can verify that , , and so on for every input. This construction works for any finite number of inputs and outputs, and for any choice of mapping.

Continuous Function

The discrete case gives an exact representation. For continuous inputs and outputs, the same idea gives an approximation that can be made arbitrarily precise. Suppose we want to approximate . We can start by choosing sample points: , , , , , .

We one-hot encode each of the input values and each of the output values as -dimensional vectors, keeping the input and output representations separate. Since we assign the same index to each input-output pair, the linear map between them is just the identity matrix. All of the nonlinearity is captured by using different encoders/decoders for the input and output spaces.

To handle continuous values, we can interpolate between one-hot encodings. For a continuous input like , we encode it as halfway between the one-hot encodings of and :

To recover a number from its encoding, we take the dot product with the list of sample values. The input decoder uses and the output decoder uses . Dotting our encoded vector with gives , confirming a valid encoding. Since the linear map is the identity, the output representation is the same vector, which decodes to . The true value is , so there's some error since we're linearly interpolating between and . By increasing the number of sample points, this piecewise linear approximation can get arbitrarily close to any continuous function.

The construction described above is trivial, so why care about this?

Going back to the discrete example, notice that , , forms a -cycle, and , , forms another. Instead of using -dimensional one-hot encodings and a matrix, we can represent both cycles with a single cyclic permutation matrix:

For this to work, we encode the elements of each cycle so that rotates them correctly:

First cycle: , ,
Second cycle: , ,

You can verify that maps each encoding to the encoding of the correct output. We've compressed from dimensions down to while still representing the function exactly, because the encodings capture the shared cyclic structure.

We can compress further to dimensions. Since both cycles have length , we need a matrix satisfying :

with encodings:

First cycle: , ,
Second cycle: , ,

The encodings can't all be orthogonal to each other in fewer than dimensions, but that's fine. The function is still represented exactly. This is related to superposition in neural networks, where models represent more features than they have dimensions by allowing representations to be almost orthogonal rather than strictly orthogonal. Since the number of almost orthogonal vectors in dimensions grows exponentially with , there's significant potential for compression.

The core transformation is a single matrix multiplication, with all nonlinearity confined to the encoding and decoding. This connects to the linear representation hypothesis in mechanistic interpretability. Transformers may already work this way, with nonlinear layers (attention, MLPs) building up and reading off features while the residual stream serves as a linear representational space. In theory, those nonlinearities aren't strictly necessary, since the entire transformation on the residual stream can be linear.

In practice, this framework works most naturally with discrete inputs and outputs. A model only needs to learn an embedding for each discrete object such that the embeddings have the right structure for a linear map to transform them correctly. High-dimensional spaces provide plenty of room for this. For continuous or numerical values, the right way to encode them is less obvious, though approaches like RoPE offer some inspiration by mapping numbers to angles at varying frequencies.