dstats.sort

A comprehensive sorting library for statistical functions. Each function takes N arguments, which are arrays or array-like objects, sorts the first and sorts the rest in lockstep. For merge and insertion sort, if the last argument is a ulong*, increments the dereference of this ulong* by the bubble sort distance between the first argument and the sorted version of the first argument. This is useful for some statistical calculations.

All sorting functions have the precondition that all parallel input arrays must have the same length.

Notes:

Comparison functions must be written such that compFun(x, x) == false. For example, "a < b" is good, "a <= b" is not.

These functions are heavily optimized for sorting arrays of ints and floats (by far the most common case when doing statistical calculations). In these cases, they can be several times faster than the equivalent functions in std.algorithm. Since sorting is extremely important for non-parametric statistics, this results in important real-world performance gains. However, it comes at a price in terms of generality:

1. They assume that what they are sorting is cheap to copy via normal assignment.

2. They don't work at all with general ranges, only arrays and maybe ranges very similar to arrays.

3. All tuning and micro-optimization is done with ints and floats, not classes, large structs, strings, etc.

Members

Functions

heapSort
T[0] heapSort(T data)

Heap sort. Unstable, O(N log N) time average and worst case, O(1) space, large constant term in time complexity.

insertionSort
T[0] insertionSort(T data)

Insertion sort. O(N<sup>2</sup>) time worst, average case, O(1) space, VERY small constant, which is why it's useful for sorting small subarrays in divide and conquer algorithms. If last argument is a ulong*, increments the dereference of this argument by the bubble sort distance between the input array and the sorted version of the input.

mergeSort
T[0] mergeSort(T data)

Merge sort. O(N log N) time, O(N) space, small constant. Stable sort. If last argument is a ulong* instead of an array-like type, the dereference of the ulong* will be incremented by the bubble sort distance between the input array and the sorted version. This is useful in some statistics functions such as Kendall's tau.

mergeSortInPlace
T[0] mergeSortInPlace(T data)

In-place merge sort, based on C++ STL's stable_sort(). O(N log<sup>2</sup> N) time complexity, O(1) space complexity, stable. Much slower than plain old mergeSort(), so only use it if you really need the O(1) space.

mergeSortTemp
T[0] mergeSortTemp(T data)

Merge sort, allowing caller to provide a temp variable. This allows recycling instead of repeated allocations. If D is data, T is temp, and U is a ulong* for calculating bubble sort distance, this can be called as mergeSortTemp(D, D, D, T, T, T, U) or mergeSortTemp(D, D, D, T, T, T) where each D has a T of corresponding type.

partitionK
ElementType!(T[0]) partitionK(T data, ptrdiff_t k)

Partitions the input data according to compFun, such that position k contains the kth largest/smallest element according to compFun. For all elements e with indices < k, !compFun(datak, e) is guaranteed to be true. For all elements e with indices > k, !compFun(e, datak) is guaranteed to be true. For example, if compFun is "a < b", all elements with indices < k will be <= datak, and all elements with indices larger than k will be >= k. Reorders any additional input arrays in lockstep.

qsort
T[0] qsort(T data)

Quick sort. Unstable, O(N log N) time average, worst case, O(log N) space, small constant term in time complexity.

quickSelect
T quickSelect(T[] data, sizediff_t k)

Returns the kth largest/smallest element (depending on compFun, 0-indexed) in the input array in O(N) time. Allocates memory, does not modify input array.

Structs

TopN
struct TopN(T, alias compFun = "a > b")

Given a set of data points entered through the put function, this output range maintains the invariant that the top N according to compFun will be contained in the data structure. Uses a heap internally, O(log N) insertion time. Good for finding the largest/smallest N elements of a very large dataset that cannot be sorted quickly in its entirety, and may not even fit in memory. If less than N datapoints have been entered, all are contained in the structure.

Examples

1 auto foo = [3, 1, 2, 4, 5].dup;
2 auto bar = [8, 6, 7, 5, 3].dup;
3 qsort(foo, bar);
4 assert(foo == [1, 2, 3, 4, 5]);
5 assert(bar == [6, 7, 8, 5, 3]);
6 auto baz = [1.0, 0, -1, -2, -3].dup;
7 mergeSort!("a > b")(bar, foo, baz);
8 assert(bar == [8, 7, 6, 5, 3]);
9 assert(foo == [3, 2, 1, 4, 5]);
10 assert(baz == [-1.0, 0, 1, -2, -3]);

Meta

Authors

David Simcha