Linear Universal Approximation Theorem
The Linear Universal Approximation Theorem is a theorem I made up that says you can represent any function as a linear map, as long as you encode the inputs and decode the outputs in a sufficiently high-dimensional space with the appropriate transformations. The standard Universal Approximation Theorem says that certain sufficiently large neural networks with nonlinear activation functions can approximate any function to an arbitrary degree of accuracy. The Linear Universal Approximation Theorem makes an analogous claim, but places all nonlinearities in the encoding of inputs and decoding of outputs, leaving the core transformation entirely linear.
I don't know if this idea has been formally expressed anywhere, but it's related to ideas like Cover's theorem and the kernel trick in Support Vector Machines, which exploit the fact that data becomes easier to linearly separate in higher-dimensional spaces.
Discrete Function
Let's start with a proof in the discrete case.
Suppose we have a function
Each input
Similarly, each output
Here's an explicit example. Suppose
and you can verify that
Continuous Function
The discrete case gives an exact representation.
For continuous inputs and outputs, the same idea gives an approximation that can be made arbitrarily precise.
Suppose we want to approximate
We one-hot encode each of the
To handle continuous values, we can interpolate between one-hot encodings.
For a continuous input like
To recover a number from its encoding, we take the dot product with the list of sample values. The input decoder uses
The construction described above is trivial, so why care about this?
Going back to the discrete example, notice that
For this to work, we encode the elements of each cycle so that
- First cycle:
, , - Second cycle:
, ,
You can verify that
We can compress further to
with encodings:
- First cycle:
, , - Second cycle:
, ,
The
The core transformation is a single matrix multiplication, with all nonlinearity confined to the encoding and decoding. This connects to the linear representation hypothesis in mechanistic interpretability. Transformers may already work this way, with nonlinear layers (attention, MLPs) building up and reading off features while the residual stream serves as a linear representational space. In theory, those nonlinearities aren't strictly necessary, since the entire transformation on the residual stream can be linear.
In practice, this framework works most naturally with discrete inputs and outputs. A model only needs to learn an embedding for each discrete object such that the embeddings have the right structure for a linear map to transform them correctly. High-dimensional spaces provide plenty of room for this. For continuous or numerical values, the right way to encode them is less obvious, though approaches like RoPE offer some inspiration by mapping numbers to angles at varying frequencies.