Sorting listlists of integers

In a previous blog entry I described a novel method, proposed by Robert Israel, for sorting a list of lists of small positive integers, specifically those less than 256. It is significantly faster than the usual method. Roman Pearce responded with a method for rapidly sorting listlists of integers (a listlist is a list of sublists with identical number of elements), regardless of size. While not as fast as Robert's technique, it, too, is significantly faster than the usual method and has no restriction on integer size.

Roman's Method

Roman's method uses a scalar product to compute a key for each list. Here is a modified version of his proposal:
sortLL1 := proc(L::listlist(nonnegint))
local B,X,n,x,j;
    B := max(map(op,L)[])+1;
    n := nops(L[1]);
    X := [seq(B^(n-j), j=1..n)];
    SORT_W_ATTR(Float(inner(X,x)), x, x=L, `<`);
end proc:

The variable B is assigned one more than the largest integer in L, the listlist of integers. The variable X is assigned the list [Bn-1,Bn-2,...,B,1], which is dot multiplied (using the undocumented builtin procedure, inner) with each sublist in L to compute the key. Because B is larger than the largest integer in L, the computed key is guaranteed to be injective and monotonic. SORT_W_ATTR does the actual sorting. It uses the technique of sorting with attributes, which is fast because it does not need to compute the inverse of the scalar.

Macros

SORT_W_ATTR is a macro, not a Maple procedure. As such, it can only be used in code that is processed by tty maple (cmaple). Maple macros are limited, but occasionally useful for simplifying the source code. See the help page ?preprocessor for details. Here is the definition of SORT_W_ATTR.
$define SORT_W_ATTR(k,x,e,F) \
map(attributes, sort([seq(setattribute(k,x),e)],F))
If you do not use tty maple, you can manually expand the macro in the assignment to sortLL1. The equivalent line is
map(attributes, sort([seq(setattribute(Float(inner(X,x)),x),x=L)],`<`));

Optimizing the Base

While this method is impressively fast, it has a practical problem. As B and n (the length of the sublists) increase, the time required to compute the scalar product increase. Theoretically one would expect this to be O(Bn m log m), where m is the number of sublists in L.

Can we do better? My first thought was to improve the vector, X, used in the dot product. Using powers of B for all the components is too conservative. Here is a modified approach that uses the maximum value in each column (think of L as a matrix) to form the base for each component in X.

sortLL2 := proc(L::listlist(nonnegint))
local M,n,B,X,C,k,x;
    M := map(lst -> max(op(lst)), `kernel/transpose`(L));
    n := nops(M);
    B := rtable(1..n);
    C := 0;
    for k from n to 1 by -1 do
        B[k] := C + 1;
        C := M[k]*B[k] + C;
    end do;
    X := [seq(B[k], k=1..n)];
    SORT_W_ATTR(Float(inner(X,x)), x, x=L, `<`);
end proc:

While it takes longer to compute this X, that is only done once, so the resulting speed-up in the scalar products for large lists can be significant. Alas, for uniformly distributed random integers, the effect is miniscule.

Strings

After reflecting on this during my daily walk, I came up with a simpler way to compute an appropriate key, using string catenation rather than arithmetic. The following example shows how a listlist of positive integers is converted to a corresponding list of strings.

[  [  3, 400,   2 ] → "103" "1400" "1002" → "10314001002"
 , [ 20,  31, 203 ] → "120" "1031" "1203" → "12010311203"
 , [  0,  12,  12 ] → "100" "1012" "1012" → "10010121012"
]
Here is the code that implements this conversion.
sortLL3 := proc(L::listlist(nonnegint))
local X,x,n,i;
    X := map(x -> 10^length(max(op(x))), `kernel/transpose`(L));
    n := nops(X);
    SORT_W_ATTR(cat("", seq(X[i]+x[i], i=1..n)), x, x=L, 'lexorder');
end proc:
A minor improvement can be made by using Maple's ability to directly add sequences, term by term:
sortLL4 := proc(L::listlist(nonnegint))
local X,x;
    X := seq(10^length(max(op(x))), x = `kernel/transpose`(L));
    SORT_W_ATTR(cat("", X+x[]), x, x=L, 'lexorder');
end proc:

Negative Integers

Procedure sortLL4 is restricted to nonnegative integers (do you see why?). A slight modification removes this restriction:
sortLL5 := proc(L::listlist(integer))
local X,x;
    X := seq(2*10^length(max((-min,max)(op(x)))), x = `kernel/transpose`(L));
    SORT_W_ATTR(cat("", X+x[]), x, x=L, 'lexorder');
end proc:

Caveat

Using strings with sorting-by-attributes has a subtle but serious flaw. It is not thread-safe. All Maple strings, regardless of any attached attributes, share a common address. This can be seen by doing
sortLL4([[3,4],[1,2]]);
                        [[1,2],[3,4]]
attributes("1112");
                        [1,2]

If two threads called sortLL5 concurrently, a string attributed by one thread could be changed by the other. The call to map(attributes, . ) would then insert a sublist from one thread into the output of the other.

Currently, very few Maple routines use threads, and not all the existing code is thread-safe. However, in the future, as multicore devices become popular, I expect that thread-safety will be important.

The Threads package provides mutexes, which can prevent collisions in data structures however, in this case, because strings are inherently global, there is no robust way to use mutex's to avoid collisions. To avoid the thread-safety issue, attributes must not be applied to strings.

Solution

There is, fortunately, a solution to the thread-safety issue: enclose each string in a list and apply the attribute to the list. The Maple sort procedure can use a builtin comparator, lexorder[n], for comparing lists of strings. The index to lexorder specifies the element in the list to compare. Here is the thread-safe code:
sortLL6 := proc(L::listlist(integer))
local X,x;
    X := seq(2*10^length(max((-min,max)(op(x)))), x = `kernel/transpose`(L));
    SORT_W_ATTR([cat("", X+x[])], x, x=L, 'lexorder'[1]);
end proc:
Alas, this modification reduces the speed of the routine by about a factor of 2, compared to the non thread-safe version. However, it is still faster, and uses less memory, than sortLL1.

Comparison

Here are plots comparing the time and memory usage of each routine. Integers are uniformly distributed from 1 to max.

Vary Number of Sublists

Fix n=5 (number of integers in each list)
    max=10^5
Vary m from 10 to 10^3

Vary Number of Integers in Sublists

Fix m=100 (number of sublists)
    max=10^5
Vary n from 1 to 100

Vary Maximum Integer

Fix m=100 (number of sublists)
    n=5   (number of integers in each list)
Vary max from 10 to 10^4

These plots were created using the Tine package. Here is the Maple Code used to assign and measure the procedures described here.

Comments

sortLL6 is nearly perfect!

Joe,

I'm quite impressed by sortLL6, as it nearly does what I've been needing, namely, a procedure which will sort a list of lists of floats. Can your procedure be adapted to do this? Also, how would one sort on a position other than the first in the nested lists? For instance, given

foo := [[-50001, 1], [61, -99], [303, 4]]

how would one sort on the index 2 position to give

[[61, -99], [-50001, 1], [303, 4]] ?

Once again, thanks for the extremely clever and efficient solution to sorting integer listlists.

Cheers,

Eric.

Partial answer

I'm not sure how you want to sort the list of lists of floats.  Could you give an example?

Sorting a list of lists using a particular element as the key is easy. To make it fast use sorting with attributes. Here is how you can do that with your given example,

foo := [[-50001, 1], [61, -99], [303, 4]]:
map(attributes,sort([seq(setattribute(SFloat(L[2]),L), L = foo)]));
                                         [[61, -99], [-50001, 1], [303, 4]]

soarting floats

Joe,

I have a MonteCarlo model which generates large (>2^17) lists of lists of pairs of floats. Here is a tiny example:

[
[2.224000000, 2.018000000],
[3.184000000, 2.034000000],
[6.240000000, 1.876000000],
[2.824000000, 3.066000000],
[4.088000000, 4.030000000]
]

I would like to do an ascending numerical sort of the listlist, indexing on a position in the list element, [float,float]. Sample output sorting on the first position would be:

[
[2.224000000, 2.018000000],
[2.824000000, 3.066000000],
[3.184000000, 2.034000000],
[4.088000000, 4.030000000],
[6.240000000, 1.876000000]
]

and a sample using the second position would be:

[
[6.240000000, 1.876000000],
[2.224000000, 2.018000000],
[3.184000000, 2.034000000],
[2.824000000, 3.066000000],
[4.088000000, 4.030000000]
]

Ideally, a procedure which could be called as follows would be ideal:

:=sortLL(,),

where elements are lists of floats, and is the index of the element in the float list on which the sort will be done. The float lists need not be confined to just 2 elements, as in my example, but could be of arbitrary length. That length, however, should be the same for all float lists in the particular listlist being sorted.

Thanks for your example of how to select a different index. That answers part of the question, but I still am unclear how to procede with float instead of integer.

Cheers,

Eric.

sorting floats (correction)

Joe,

I used angle braces in my last post, and they got swallowed by the blog formatter. The last part should read as follows:

=============
Ideally, a procedure which could be called as follows would be ideal:

:=sortLL("listlist","position"),

where "listlist" elements are lists of floats, and "position" is the index of the element in the float list on which the sort will be done. The float lists need not be confined to just 2 elements, as in my example, but could be of arbitrary length. That length, however, should be the same for all float lists in the particular listlist being sorted.
...
=============

Thanks,

Eric.

my "sorting floats" solution

Joe,

After reading the manual more thoroughly on "sort", I came up with the following:

sortLLe := proc ( L::( listlist(numeric) ), i )
# This procedure sorts a list of lists of numbers in ascending order;
# i is the sort index within the list element.
sort( L, (a, b) -> evalb(a[i] < b[i] ) );
end proc;

It seems to do the job, although it's probably not as fast as any of the solutions you've given.

Cheers,

Eric.

nice

That will, indeed, work. As you suspect, it is not the fastest.  The principal reason that it is slow is that every comparison, and there are typically n*log(n) of them, requires a call to the user procedure (x,y) -> evalb(x[i]<y[i]).

The searching-with-attributes that I mentioned above avoids that by first extracting the key from the list than using the builtin sort, which doesn't require an external comparator.  The trick is that the computed key for each row (the value in the selected position) is tagged (attributed) with the entire row.  After sorting the keys we merely extract the attributes.

SortLL := proc(LL::listlist(numeric), pos::posint)
local L;
    map(attributes, sort([seq(setattribute(SFloat(L[pos]), L), L = LL)]))
end proc:
foo := [[-50001, 1], [61, -99], [303, 4]]:
SortLL(foo,1);
                      [[-50001, 1], [61, -99], [303, 4]]

SortLL(foo,2);
                      [[61, -99], [-50001, 1], [303, 4]]

 

Conceptually a better way to do this is to use an inplace-sort on an Array of floats; that can be accomplished with external compiled routines and should be quite a bit faster.  I'm looking into doing that.

"slicing" a sorted list

Joe,

This is related to our previous discussion regarding sorted listlists.

Given a sorted list of numeric lists, I would like to slice it into two lists: one comprising the left half of the sorted listlist in which an element of the numeric list element is less than a certain value, and the other, the right half remaining.

I have been using a loop to do this, but, as you might guess, it's exceedingly slow:

L:=NULL:
R:=NULL:
for i from 1 to Sall do
if (S[i][2]>=cutoff) then
R:=R,S[i]:

fi:
od:

I am sure that there are tools in Maple which would allow me to do this more simply and faster. Can you suggest them?

Thanks,

Eric.

acer's picture

Array or Matrix

Eric, I suggest that you abandon using lists of lists for this. I would suggest using either an Array or a Matrix, and perhaps with float[8] datatype for maximal speed and minimal memory footprint.

For one thing, lists of lists are inherently bad for sorting, due to immutability. (Sorting lists of lists via attributes is neat, but the very fact that acrobatics are required makes it suspect.)

Extraction of columns or rows of a Matrix/Array is easy -- there are several nice ways and even nice syntax. If you sort a column, and store the permutations as a side-effect, then you can apply those permutations to all the Array/Matrix at once. There are easy ways to do that too, with nice syntax.

And, if the Array/Matrix is float[8] datatype then you can using Maple's Compiler on procedures that you write to do these tasks. Such a compiled routine could be the fastest approach.

Slicing an Array/Matrix (by sorted values in a named column, or not) is also very easy and relatively fast.

acer

"slicing" a sorted list

Joe,

This is related to our previous discussion regarding sorted listlists.

Given a sorted list of numeric lists, I would like to slice it into two lists: one comprising the left half of the sorted listlist in which an element of the numeric list element is less than a certain value, and the other, the right half remaining.

I have been using a loop to do this, but, as you might guess, it's exceedingly slow:

L:=NULL:
R:=NULL:
for i from 1 to Sall do
if (S[i][2]<=cutoff) then
R:=R,S[i]:
else
L:=L,S[i]:
fi:
od:

I am sure that there are tools in Maple which would allow me to do this more simply and faster. Can you suggest them?

Thanks,

Eric.

Doug Meade's picture

select and remove

This sounds like a job for hree of my favorite time-saving Maple commands: select, remove, or selectremove. One of the benefits of these commands is that they do not assume the original data is sorted.

Here's an example of how they can be used in a problem like yours.

First generate some data

with(RandomTools):
Sall := Generate( list([integer(range=-10..10),integer(range=-10..10),integer(range=-10..10)],10) ); 
[[-3, 3, 3], [-2, 6, -9], [-2, 10, -8], [-9, -6, -8], [-2, -6, 3], [9, 5, 8], 

  [-8, -2, 2], [-8, -3, -4], [-2, -1, 6], [-6, -10, -10]]

Then, separate the list into those elements whose second element is less than 5 or greater than or equal to 5

L,R := selectremove( s->s[2]<5, Sall ):
L;
     [[-3, 3, 3], [-9, -6, -8], [-2, -6, 3], [-8, -2, 2], [-8, -3, -4], 

       [-2, -1, 6], [-6, -10, -10]]
R;
                   [[-2, 6, -9], [-2, 10, -8], [9, 5, 8]]

The key to using select is coming up with the correct test for deciding if each element is selected or removed. In many cases this can be a builtin Maple command, such as isprime or has. For example, to select the real roots of a solution, I do the following:

soln := solve( (x^2-1)^3=1, x );
                                         (1/2)                    (1/2)  
      (1/2)   (1/2)    1 /         (1/2)\       1 /         (1/2)\       
    -2     , 2     , - - \2 + 2 I 3     /     , - \2 + 2 I 3     /     , 
                       2                        2                        

                          (1/2)                    (1/2)
        1 /         (1/2)\       1 /         (1/2)\     
      - - \2 - 2 I 3     /     , - \2 - 2 I 3     /     
        2                        2                      
solnR := remove( has, [soln], I );
                              [  (1/2)   (1/2)]
                              [-2     , 2     ]

I hope this helps with your problem,

Doug

---------------------------------------------------------------------
Douglas B. Meade  <><
Math, USC, Columbia, SC 29208  E-mail: mailto:meade@math.sc.edu       
Phone:  (803) 777-6183         URL:    http://www.math.sc.ed

 

explanans

Doug's solution, using selectremove, is the proper approach. I'm responding mainly to point out the reason that your approach is so slow.  It isn't so much because it uses a do loop, but rather because you are building up a sequence (two sequences) term by term in the loop.  That is, as the loop progresses, R becomes

R1
R1,R2
R1,R2,R3
R1,R2,R3,R4,...

Each new sequence must be created and stored.  For a final sequence of n terms, this is O(n^2) in time and memory usage.   The standard method for avoiding this is to add the terms to a table and then, when complete, convert the table to a list.  For example

R := table(): L := table():
j := 0: k := 0:
for i to Sall do
   if S[i][2] <= cutoff then
       j := j+1;
       R[j] := S[i];
   else
      k := k + 1;
      L[k] := S[i];
   end if;
end do:
R := convert(R, list);
L := convert(L, list);

Of course, using selectremove is the better approach here...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
}