Measuring some tree-traversing functions

::

In a previous article my friend Zyni wrote some variations on a list-flattening function, some of which were ‘recursive’ and some of which ‘iterative’, managing the stack explicitly. We thought it would be interesting to see what the performance differences were, both for this function and a more useful variant which searches a tree rather than flattening it.

What we measured

The code we used is here1. We measured four variations of each of two functions.

List flattening

All these functions use collecting to build their results forwards. They live in flatten-variants.lisp.

• flatten/implicit-stack works in the obvious recursive way, with an implicit stack. This uses iterate to express the local recursive function.
• flatten/explicit-stack uses an explicit stack (called agenda in the code) represented as a vector, and uses looping to express iteration.
• flatten/explicit-stack/adja is like the previous function but it is willing to extend the explicit stack, which it does by using adjust-array and assignment.
• flatten/explicit-stack/adjb is like flatten/explicit-stack/adja but uses a local tail-recursive function to bind the extended stack rather than assignment.
• Finally flatten/consy-stack is very close to Zyni’s original iterative solution: it represents the stack as a list. This version necessarily conses fairly copiously.

Searching cons trees

These functions, in treesearch-variants.lisp, correspond to the flattening variants, except they are searching for some atomic value in the tree of conses:

• search/implicit-stack uses an implicit stack;
• search/explicit-stack uses a vector;
• search/consy-stack uses a consy stack.

Notes on the code

The functions all have (declare (optimize (speed 3))) but specifically don’t turn off safety or use implementation-specific settings: we wanted to test code we felt we’d be happy running, and that means code compiled with reasonable settings for safety: if you turn safety off you’re brave, foolish, or both.

We did not compare looping with do or loop: we probably should. However the expansion of looping is pretty straightforward:

(looping ((this o) (depth 0))
(declare ...)
...)

Turns into

(let ((this o) (depth 0))
(declare ...)
(block nil
(tagbody
#:start
(multiple-value-setq (this depth) ...)
(go #:start))))

The only real question here, we think is whether multiple-value-setq is compiled well: brief inspection implies it is. We should probably still compare the current version with more ‘native CL’ variants.

The variants which use a vector as a stack maintain the current element themselves: that’s because we tested using a fill pointer and vector-push / vector-pop and it was really significantly slower in both implementations.

What we did

The Lisp implementations we used

We used LispWorks 8.0 and very recent SBCL builds, compiled from the master branch no more than a few days before we ran the tests in mid March 2023.

In the case of SBCL we paid attention to notes and warnings during compilation. The significant one we did not address was that it complained vociferously about not being able to optimize calls to eql: that’s because we don’t know the type of the thing we are searching for: it needs to do the work it is trying to avoid. Apart from this the only warnings were about the computation of the new length of the agenda, which never actually happens in the tests we ran.

The machines we benchmarked on

We both have M1-based Macbook Airs so this is what we used. In particular we have not run any benchmarks on x64.

What we ran

make-car-cdr, in common.lisp, makes a list where each element is a chain linked by cars, finally terminating in a specified element. Controlling the length of the list and the depth of the chains gives the functions more iterative or more recursive work to do respectively. The benchmarking code then made a series of suitable structures of increasing size and timed many iterations of each function on the same structure, computing the time per call. We then wrote a program in Racket to plot the results on axes of ‘breadth’ (length of the list) and ‘depth’ (depth of the car-linked chain). For the search functions the element being searched for was not in the tree so they had to do as much work as possible.

Life was usually arranged so that the initial agenda was big enough for the functions which used a vector as the agenda, so none of that aspect of them was teated, except for one case below. Apart from that case, the ‘vector stack’ timings refer to flatten/explicit-stack and treesearch/explicit-stack, not the adjustable-stack variants.

Some results

We timed 1,000 iterations of each call, for list lengths (breadth in the plots and below) from 30 to 1,000 in steps of 10 and depths (depth in the plots and below) from 10 to 300 in steps of 10, computing times in μs per iteration. Neither of us knows anything about how data like this should be best presented but simply plotting the performance surfaces seemed reasonable. We used bilinear interpolation to make the surface from the points2.

LispWorks

This is nicely linear in both breadth and depth, and so quadratic in breadth $$\times$$ depth. And it’s easy to see that for LW using the implicit stack is faster than the manually-managed stack.

This compares the vector stack with the consy stack, for treesearch. The consy stack is slightly faster which surprised us. This conses a list as long as the depth of the tree for each ‘leftward’ branch, and then immediately unwinds that and throws the whole list away. So it creates significant garbage, but the allocation and garbage collection overhead together is still faster than using a vector. Consing really is (almost) free.

Here is more evidence that consing is very cheap: the difference between treesearch (which does not cons) and flatten (which does) is tiny.

SBCL

So here is SBCL. For SBCL explicitly managing the stack as a vector is significantly faster than the implicit stack. Something that is also apparent here is how variable SBCL’s timings are compared with LW’s: we don’t know why that is although we suspect it might be because SBCL’s garbage collector is more intrusive than LW’s. We also don’t know whether this variation is repeatable, or whether it’s due to a single very slow run or something like that.

For SBCL the consy stack is significantly slower than the vector stack, so for SBCL the vector stack is the fastest.

SBCL has a slightly larger difference between treesearch and flatten, with flatten being slower. There are also curious ‘waves’ in the plot as depth increases.

LispWorks compared with SBCL

LW is significantly faster than SBCL for implicit stacks except for very small depths.

This compares LW using an implicit stack with SBCL using an explicit vector stack. The difference is pretty small now.

This was meant to be the worst-case for both: flattening and a consy stack. But it’s not particularly informative, I think.

The outer reaches: LispWorks with a deep tree

We did one run with the maximum depth set to 10,000 with a step of 500, and maximum breadth set to 1,000 with a step of 100, averaged over 100 iterations instead of 1,000. This is too deep for LW’s stack, but LW allows stack extension, and we wrote what later became this to extend the stack as required. Note that this happens only during the first recursion into the left-hand branch of the tree so has minimal effect on performance. This also used search/explicit-stack/adjb for the vector stack.

As before the implicit stack is much better for LW. This is much more bumpy than LW was for smaller depths, this might have been because the machine did other things while it was running but we don’t think so.

Some conclusions

None of the differences were really large. In particular there’s no enormous advantage from managing the stack yourself.

Consing and the resulting garbage-collection does really seem to be very cheap, especially in LispWorks: the days of long GC pauses are long gone.

We were surprised that LispWorks was fairly reliably faster than SBCL: surprised enough that we ran everything several times to be sure. It’s also interesting how much smoother LW’s performance surface is in most cases.

It is possible that our implementations just suck, of course.

Mostly it’s just some pretty pictures.

1. All of the functions should be portable CL. Some of the mechanism for expressing dependencies and loading things is not. However it should be easy for anyone to run this if they wish to.

2. Getting the bilinear interpolation right took longer than anything else, and perhaps longer than everything else put together.