Information Theory Analysis of Sorting

I've learned that the best comparison-based sorting algorithms have average time complexity without really understanding why that's a fundamental limit, so I looked into this and found a simple explanation. For an array of size , there are different permutations, and every time you perform a comparison, you reduce the number of candidate permutations that could represent the sorted array by up to half, thus you need at least comparisons. By Stirling's approximation,

and this is what gives the time complexity lower bound. Framing this another way, the problem of sorting could be turned into obtaining bits of information in order to determine which of the permutations of the array is sorted, and comparing two elements of the array can give you at most 1 bit of information.

So what if you want a better time complexity? Depending on the context of what you’re sorting, you can use the proof above for identifying ways to do better. For instance, if you have duplicate values in the array, there’ll be multiple permutations instead of just 1 permutation among the total permutations that are considered sorted. If there are possible sorted permutations, then you're looking to obtain bits of information, and for sufficiently large , this could end up being bits of information.

What about non-comparison-based sorting algorithms such as radix sort? You still have total bits of information to obtain assuming no duplicate values, but you could end up doing operations that give more than 1 bit of information. Suppose you’re doing radix sort in base on integers in the range . The bucketing procedure in radix sort can obtain more than 1 bit of information per bucketing operation since a given value can be placed in one of different buckets in a single operation. However, this operation still only gives a constant amount of information, at most bits on average, so this doesn’t affect the overall time complexity.

You may be wondering then why is it that radix sort is known to be a linear time sorting algorithm, or what went wrong with my analysis? The key lies in the assumption that there are no duplicate values, which prevents from scaling beyond , otherwise the Pigeonhole Principle guarantees there must be a duplicate value. When has an upper bound of , then is bounded by , so the time complexity for comparison based sorting algorithms is analogous to the time complexity for radix sort.

A better way of formulating the amount of information in radix sort is to reframe the sorted array as being the number of 0's, number of 1's, ..., number of 's in the array, and there are bits of information in that reformulation. This perspective handles total bits of information with consideration for duplicates in the array, and it’s easy to check that this is within bits of information after waving off the large constant, verifying that radix sort is able to achieve linear time complexity when scaling with a fixed and .

What about approximate sorting? Could you approximately sort an array in time where each element is at most positions off from its correct position for some constant integer ? Counting the number of permutations of an array satisfying this constraint directly seems difficult, but it can be upper bounded by where each element can have an offset anywhere in the range . This is a weak upper bound considering how most combinations of offsets will place some elements outside of the array or two elements in the same array slot, making them invalid. With this upper bound estimate, you still need at least

bits of information so approximately sorting an array under this constraint won't reduce the time complexity.

What if instead of allowing for an offset error of a constant , you allow an offset error of for some constant ? Similar to the analysis above, a lower bound for the number of permutations of an array satisfying this constraint can be obtained by first partitioning the sorted array into roughly groups of elements and taking arbitrary permutations within each group. It’s clear that this scheme generates approximately sorted arrays satisfying the constraint, and the number of these permutations is roughly . This lower bound estimate means you need at most bits of information, and using Stirling's approximation this is roughly

The terms cancel out and the expression reduces to , so you need at most bits of information to approximately sort an array under this constraint.

This only checks if it’s plausible to approximately sort an array in linear time with this constraint, so it’s important to try to come up with a linear time algorithm to verify it’s possible. One possible idea is a modified quicksort algorithm, where instead of recursing down all subarrays to fully sort the whole array, the recursion stops once a subarray has less than elements. It's easy to verify that this recursion stopping condition ensures that the final array is approximately sorted. The max recursion depth is for some function , which is a constant, so the time complexity of this modified quicksort is .

The takeaway from this analysis is that if you want to do faster than sorting, you may be able to take advantage of properties of your array such as a known distribution in it. Most real world sorting problems do have properties that allow for much better performance than standard algorithms. Though keep in mind that systems performance considerations likely matter just as much if not more than the algorithm.

Originally published on Substack: https://alyxya.substack.com/p/information-theory-analysis-of-sorting.