Fragments: Posts tagged 'computer'

Numerical prediction

Tim Bradshaw — Fri, 28 Jul 2023 10:39:12 UT

In late 2018, when I still worked at the Met Office, I sent a document to some people there which explained why I thought AI would come to dominate weather forecasting, and why weather forecasting organisations should be looking at AI, urgently. Today, the 28th of July 2023, there is a leader on the subject in The Economist as well as an extended article in its Science and Technology section.

2018

Here¹ is the document I wrote in 2018: if it was ever sensitive I don’t think it is now. Here are some excerpts from it²:

Neural networks are likely to provide better weather forecasts in due course than current numerical models. If this is true then weather forecasting organisations that don’t use them will be replaced by ones that do. Even though this only may be true, weather forecasting organisations should be investigating these techniques, today.

[…]

[…] NN models are likely to be highly successful for weather prediction. However they will not be trivial to design and deploy: cargo cult NN approaches are not going to work.

If NN models are successful then they will largely displace hand-crafted physics-based models (GCM models such as UM³). Weather forecasting is a service, and consumers of the service care only about how good the forecasts are rather than how they are produced.

If this happens then organisations involved in weather forecasting, such as the Met Office, will need to adopt NN models or cease to exist: NNs are an existential threat to weather forecasting organisations.

This means that such organisations should be investigating NN models very seriously now so that, in the likely case that they are successful, they are not left behind.

[…]

The traditional approach [to weather forecasting] is to understand the physics and write a system which numerically solves the equations to a lesser or greater degree of accuracy. This has been pretty successful of course.

An alternative approach is to not do that at all, but rather build a system which can, itself, learn to simulate the weather: a system which can be trained to simulate the weather, in other words, based on observations. As far as I’m aware such an approach has not been tried on any significant scale.

[…]

There is copious training data. There is obviously a really huge amount of data which can be used to drive a model, which NNs love. But NN models need training data in general: they need to be told how well they did so they can correct their weights. And weather is almost the best example it’s possible to think of of this: if we want to predict, say, rainfall in 24 hours time, then, if we wait 24 hours, we know how much rain actually fell, and we can use that data to teach the model how do to better. And this is true for everything, all the time: every time the model makes any prediction about the state at some future time then, at that future time, we know what the state actually is and can use that information to train the model. This is the sort of situation NN people dream about.

[…]

[…] Hand-crafted models are more likely to remain sane than NN models in the early stages. There’s no rule that says that an NN won’t get some mad idea into its head and start, occasionally, making predictions which are completely physically insane.

[…]

While NN models are an almost perfect fit for weather forecasting they are, perhaps surprisingly, a terrible fit for climate modelling. This is for two reasons.

Sparseness of training data. NNs are likely to work for weather prediction because the training data is so copious: if you want to predict the weather a given time ahead then you simply predict, wait until that amount of time has elapsed and you have training data, and then you iterate this process. You can’t do that for climate: if you want to predict the climate a century ahead you can neither wait for a century for the training data nor can you iterate the process.

Opacity of NN models. Even if climate modelling by an NN is technically practical it’s an absolutely terrible answer to the questions people actually want to answer. If I run some NN model and it predicts 4 degrees of warming by 2100 the first thing people will ask is ‘why does it predict that?’. And the best answer to that question is ‘because some opaque blob of weights which neither I nor any human understands told me that’, which is a terrible answer: it’s essentially the same as ‘a voice in my head told me’. Given the political sensitivity of climate modelling this is not going to be an answer anyone will accept, and nor should they.

So climate modelling is a really good example of a place where a transparent physics-based model is the only reasonable answer. And that’s ultimately because the people who are interested in climate ere not just interested in a statistically-good prediction (whatever that even means in this case): they’re interested in why the prediction is what it is. Climate modelling requires hand-crafted physics-based models, and there’s no way around that.

2023

Here is an excerpt from The Economist’s leader:

The application of machine learning and other forms of artificial intelligence (AI) will improve things further. The supercomputers used for NWP calculate the next days’ weather on the basis of current conditions, the laws of physics and various rules of thumb; doing so at a high resolution eats up calculations by the trillion with ridiculous ease. Now machine-learning systems trained simply on past weather data can more or less match their forecasts, at least in some respects. If advances in AI elsewhere are any guide, that is only the beginning.

Well, I am not some unique genius: many people could, and probably did, see what was coming when I wrote the 2018 document. I predicted that neural network approaches would come to dominate weather forecasting, and it looks like they will.

But what I also realised remains, I think, important, and is not addressed at all in the articles in The Economist. And that is this:

AI, in the form of neural networks, is not a suitable approach to climate prediction both because the training data is inadequate, but more importantly because it is critical that climate models not only predict the climate but allow people to understand why they are predicting what they predict, rather than simply being an opaque blob;
currently climate models, at least in the Met Office and I am sure elsewhere, are to a great extent parasitic on weather models, sharing a great deal of of their code with those models.

This means that if weather forecasting becomes dominated by opaque NN models, climate modellers will have to bear the entire cost of funding development of their models. Chances are they can’t do that.

An even worse outcome would be that climate modellers leap into using opaque NN models without thinking through what this means. This would hand the climate denialists who increasingly dominate the politics of the UK a weapon which they would certainly not hesitate to use.

When I sent the 2018 document to people in the Met Office I did not even receive an acknowledgement: I am quite sure nobody read it. I think this says a great deal about the nature of organisations like the Met Office.

Despite how the all this might read, I’m not at all embittered by this: if I cared about the Met Office in 2018 I certainly don’t now, four years later. If anything, I’m rather pleased that what I thought, in 2018, would happen does indeed seem to ba happening. Most importantly I want the other thing I realised in 2018 — that climate modelling isn’t well-suited to NN approaches and that organisations which do both weather and climate modelling need to worry about this as NN approaches to weather forecasting eat physics-based approaches alive — to exist in some form that is accessible to people. That’s why this article exists.

The location of this document might change. This post itself is a better link to remember as I will update the pointer if I move the document. ↩
Note that I used the term ‘neural network’, abbreviated to ‘NN’ in the document, as I did not then (and do not now) want to lazily consider neural networks to be the same thing as AI. ↩
UM, the Unified Model, was the model the Met Office used for both weather and climate modelling ain 2018. ↩

Closed as duplicate considered harmful

Tim Bradshaw — Mon, 05 Dec 2022 16:10:07 UT

The various Stack Exchange sites, and specifically Stack Overflow, seem to be some of the best places for getting reasonable answers to questions on a wide range of topics from competent people. They would be a lot better if they were not so obsessed about closing duplicates.

Closing duplicates seems like a good idea: having a single, canonical, question on a given topic with a single, canonical, answer seems like a good thing. It’s not.

The reason it’s not is that it makes two false assumptions:

that a given question has a single best answer;
that this answer does not change over time.

Neither of these assumptions is true for a large number of interesting questions.

Questions can have several good answers. I have at least three introductory books on analysis, and not because I didn’t find the good one on the first try: I have several because they give different perspectives — different answers, in the sense of Stack Exchange — to various aspects of the subject. I have several books on introductory quantum mechanics, several books on introductory general relativity, and so it goes on. It is, simply, a delusion that there exists a single most helpful answer to many questions: pretending that there is stupidly limiting.

And what constitutes a good answer can change over time. If you asked, for instance, what a macro was in Lisp and what macros are good for, you would have got very different answers in 1982 than in 2022¹. The same is true for many other subjects: human knowledge is not static.

All of this is made worse as only the person asking a question can accept an answer: they may not do so at all or, worse, they may be asking in bad faith and accept wrong or misleading answers (yes, this happens in various Stack Exchanges).

The true Stack Exchange believer will now explain in great detail² why none of this matters: people should just spend their time adding improved answers to questions which already have accepted answers rather than to new questions which will be closed as duplicates. Because, of course, the accepted answer will not be the one almost everyone looks at, and even if they don’t care about increasing their karma on Stack Exchange, they will be very happy to write answers that, in the real world, almost nobody will ever look at.

Yeah, right.

This is such a shame: Stack Exchange is a good thing, but it’s seriously damaged by this unnescessary problem. The answer is not simply to allow unrestricted duplicates, but to wait for a bit and see if a question which is, or is nearly, a duplicate has attracted new and interesting answers, and to not close it as a duplicate in that case. This would not be hard to do.

And even in 2022 you will get answers from people who seem not to have learned anything since 1982. ↩
Please, don’t: I don’t have a Stack Exchange account any more and, even if I did, I would not be interested. ↩

Vector supercomputers

Tim Bradshaw — Thu, 30 Dec 2021 12:20:51 UT

There are apocryphal reports that Apple M1 systems are not as fast as people have been led to believe for general-purpose programs. That’s unsurprising.

I think what’s happened is that vector supercomputers have secretly won, and with them come all their performance weirdnesses that make a lot of code really suck: no-one wanted to run anything other than rather specialised programs on a Cray 1 or any of its descendants because it was just not very fast for that. Vector supercomputers were great at numerical loops over large arrays, but they were absolutely terrible at code which had to make lots of actual decisions.

So now we’re seeing machines which are optimised to be extremely good at mashing arrays of numbers, and much less good at general computation. Of course, unlike the 1970s & 80s machines ‘much less good’ is ‘quite good enough’ in almost all cases.

And they’ve won, really, because we’re in the middle of another AI hype-cycle: the last hype cycle gave us all sorts of weird hardware like Lisp machines, graph-reduction machines and so on: this one, which is built, really, on programs which ought to be written in Fortran, is giving us special-purpose array-mashing machines — vector supercomputers, in other words — which are really good at all the annoying machine-learning things our computers now insist on foisting on us.

Well, this AI hype cycle will be like all the other AI hype cycles: despite the idiot boosters who have conveniently forgotten what happened last time and all the times before that, we are not anywhere near some kind of strong AI based on machine learning. Already you can see this: whatever language-learning system we’re all meant to worship at the feet of has now been trained on all the natural language that exists on the internet, in order to produce results which are not, in fact, acceptable. And there’s nowhere to go from here: there is no more training data.

It remains to be seen whether array-mashing machines outlive the hype that gave rise to them: there are good uses for systems like this, just as there are good uses for machine learning, but when the bubble bursts it may yet take them with it.

The proper use of macros in Lisp

Tim Bradshaw — Thu, 11 Nov 2021 14:32:11 UT

People learning Lisp often try to learn how to write macros by taking an existing function they have written and turning it into a macro. This is a mistake: macros and functions serve different purposes and it is almost never useful to turn functions into macros, or macros into functions.

Let’s say you are learning Common Lisp¹, and you have written a fairly obvious factorial function based on the natural mathematical definition: if $n \in \mathbb{N}$, then

\[ n! = \begin{cases} 1 &n \le 1\\ n \times (n - 1)! &n > 1 \end{cases} \]

So this gives you a fairly obvious recursive definition of factorial:

(defun factorial (n)
  (if (<= n 1)
      1
    (* n (factorial (1- n )))))

And so, you think you want to learn about macros so can you write factorial as a macro? And you might end up with something like this:

(defmacro factorial (n)
  `(if (<= ,n 1)
      1
    (* ,n (factorial ,(1- n )))))

And this superficially seems as if it works:

> (factorial 10)
3628800

But it doesn’t, in fact, work:

> (let ((x 3))
    (factorial x))

Error: In 1- of (x) arguments should be of type number.

Why doesn’t this work and can it be fixed so it does? If it can’t what has gone wrong and how are macros meant to work and what are they useful for?

It can’t be fixed so that it works. trying to rewrite functions as macros is a bad idea, and if you want to learn what is interesting about macros you should not start there.

To understand why this is true you need to understand what macros actually are in Lisp.

What macros are: a first look

A macro is a function whose domain and range is syntax.

Macros are functions (quite explicitly so in CL: you can get at the function of a macro with macro-function, and this is something you can happily call the way you would call any other function), but they are functions whose domain and range is syntax. A macro is a function whose argument is a language whose syntax includes the macro and whose value, when called on an instance of that language, is a language whose syntax doesn’t include the macro. It may work recursively: its value may be a language which includes the same macro but in some simpler way, such that the process will terminate at some point.

So the job of macros is to provide a family of extended languages built on some core Lisp which has no remaining macros, only functions and function application, special operators & special forms involving them and literals. One of those languages is the language we call Common Lisp, but the macros written by people serve to extend this language into a multitude of variants.

As an example of this I often write in a language which is like CL, but is extended by the presence of a number of extra constructs, one of which is called ITERATE (but it predates the well-known one and is not at all the same):

(iterate next ((x 1))
 (if (< x 10)
     (next (1+ x))
   x)

is equivalent to

(labels ((next (x)
          (if (< x 10)
              (next (1+ x))
            x)))
 (next 1))

Once upon a time when I first wrote iterate, it used to manually optimize the recursive calls to jumps in some cases, because the Symbolics I wrote it on didn’t have tail-call elimination. That’s a non-problem in LispWorks². Anyone familiar with Scheme will recognise iterate as named let, which is where it came from (once, I think, it was known as nlet).

iterate is implemented by a function which maps from the language which includes it to a language which doesn’t include it, by mapping the syntax as above.

So compare this with a factorial function: factorial is a function whose domain is natural numbers and whose range is also natural numbers, and it has an obvious recursive definition. Well, natural numbers are part of the syntax of Lisp, but they’re a tiny part of it. So implementing factorial as a macro is, really, a hopeless task. What should

(factorial (+ x y (f z)))

Actually do when considered as a mapping between languages? Assuming you are using the recursive definition of the factorial function then the answer is it can’t map to anything useful at all: a function which implements that recursive definition simply has to be called at run time. The very best you could do would seem to be this:

(defun fact (n)
 (if (< n 3)
     n
   (* n (fact (1- n)))))

(defmacro factorial (expression)
 `(fact ,expression))

And that’s not a useful macro (but see below).

So the answer is, again, that macros are functions which map between languages and they are useful where you want a new language: not just the same language with extra functions in it, but a language with new control constructs or something like that. If you are writing functions whose range is something which is not the syntax of a language built on Common Lisp, don’t write macros.

What macros are: a second look

Macroexpansion is compilation.

A function whose domain is one language and whose range is another is a compiler for the language of the domain, especially when that language is somehow richer than the language of the range, which is the case for macros.

But it’s a simplification to say that macros are this function: they’re not, they’re only part of it. The actual function which maps between the two languages is made up of macros and the macroexpander provided by CL itself. The macroexpander is what arranges for the functions defined by macros to be called in the right places, and also it is the thing which arranges for various recursive macros to actually make up a recurscive function. So it’s important to understand that the macroexpander is a critical part of the process: macros on their own only provide part of it.

An example: two versions of a recursive macro

People often say that you should not write recursive macros, but this prohibition on recursive macros is pretty specious: they’re just fine. Consider a language which only has lambda and doesn’t have let. Well, we can write a simple version of let, which I’ll call bind as a macro: a function which takes this new language and turns it into the more basic one. Here’s that macro:

(defmacro bind ((&rest bindings) &body forms)
 `((lambda ,(mapcar #'first bindings) ,@forms)
   ,@(mapcar #'second bindings)))

And now

> (bind ((x 1) (y 2))
    (+ x y))              
(bind ((x 1) (y 2)) (+ x y))
 -> ((lambda (x y) (+ x y)) 1 2)
3

(These example expansions come via use of my trace-macroexpand package, available in a good Lisp near you: see appendix for configuration).

So now we have a language with a binding form which is more convenient than lambda. But maybe we want to be able to bind sequentially? Well, we can write a let* version, called bind*, which looks like this

(defmacro bind* ((&rest bindings) &body forms)
 (if (null (rest bindings))
     `(bind ,bindings ,@forms)
   `(bind (,(first bindings))
      (bind* ,(rest bindings) ,@forms))))

And you can see how this works: it checks if there’s just one binding in which case it’s just bind, and if there’s more than one it peels off the first and then expands into a bind* form for the rest. And you can see this working (here both bind and bind* are being traced):

> (bind* ((x 1) (y (+ x 2)))
    (+ x y))
(bind* ((x 1) (y (+ x 2))) (+ x y))
 -> (bind ((x 1)) (bind* ((y (+ x 2))) (+ x y)))
(bind ((x 1)) (bind* ((y (+ x 2))) (+ x y)))
 -> ((lambda (x) (bind* ((y (+ x 2))) (+ x y))) 1)
(bind* ((y (+ x 2))) (+ x y))
 -> (bind ((y (+ x 2))) (+ x y))
(bind ((y (+ x 2))) (+ x y))
 -> ((lambda (y) (+ x y)) (+ x 2))
(bind* ((y (+ x 2))) (+ x y))
 -> (bind ((y (+ x 2))) (+ x y))
(bind ((y (+ x 2))) (+ x y))
 -> ((lambda (y) (+ x y)) (+ x 2))
4

You can see that, in this implementation, which is LW again, some of the forms are expanded more than once: that’s not uncommon in interpreted code: since macros should generally be functions (so, not have side-effects) it does not matter that they may be expanded multiple times. Compilation will expand macros and then compile the result, so all the overhead of macroexpansion happend ahead of run-time:

 (defun foo (x)
   (bind* ((y (1+ x)) (z (1+ y)))
     (+ y z)))
foo

> (compile *)
(bind* ((y (1+ x)) (z (1+ y))) (+ y z))
 -> (bind ((y (1+ x))) (bind* ((z (1+ y))) (+ y z)))
(bind ((y (1+ x))) (bind* ((z (1+ y))) (+ y z)))
 -> ((lambda (y) (bind* ((z (1+ y))) (+ y z))) (1+ x))
(bind* ((z (1+ y))) (+ y z))
 -> (bind ((z (1+ y))) (+ y z))
(bind ((z (1+ y))) (+ y z))
 -> ((lambda (z) (+ y z)) (1+ y))
foo
nil
nil

> (foo 3)
9

There’s nothing wrong with macros like this, which expand into simpler versions of themselves. You just have to make sure that the recursive expansion process is producing successively simpler bits of syntax and has a well-defined termination condition.

Macros like this are often called ‘recursive’ but they’re actually not: the function associated with bind* does not call itself. What is recursive is the function implicitly defined by the combination of the macro function and the macroexpander: the bind* function simply expands into a bit of syntax which it knows will cause the macroexpander to call it again.

It is possible to write bind* such that the macro function itself is recursive:

(defmacro bind* ((&rest bindings) &body forms)
  (labels ((expand-bind (btail)
             (if (null (rest btail))
                 `(bind ,btail
                    ,@forms)
               `(bind (,(first btail))
                  ,(expand-bind (rest btail))))))
    (expand-bind bindings)))

And now compiling foo again results in this output from tracing macroexpansion:

(bind* ((y (1+ x)) (z (1+ y))) (+ y z))
 -> (bind ((y (1+ x))) (bind ((z (1+ y))) (+ y z)))
(bind ((y (1+ x))) (bind ((z (1+ y))) (+ y z)))
 -> ((lambda (y) (bind ((z (1+ y))) (+ y z))) (1+ x))
(bind ((z (1+ y))) (+ y z))
 -> ((lambda (z) (+ y z)) (1+ y))

You can see that now all the recursion happens within the macro function for bind* itself: the macroexpander calls bind*’s macro function just once.

While it’s possible to write macros like this second version of bind*, it is normally easier to write the first version and to allow the combination of the macroexpander and the macro function to implement the recursive expansion.

Two historical uses for macros

There are two uses for macros — both now historical — where they were used where functions would be more natural.

The first of these is function inlining, where you want to avoid the overhead of calling a small function many times. This overhead was a lot on computers made of cardboard, as all computers were, and also if the stack got too deep the cardboard would tear and this was bad. It makes no real sense to inline a recursive function such as the above factorial: how would the inlining process terminate? But you could rewrite a factorial function to be explicitly iterative:

(defun factorial (n)
 (do* ((k 1 (1+ k))
       (f k (* f k)))
      ((>= k n) f)))

And now, if you have very many calls to factorial, you wanted to optimise the function call overhead away, and it was 1975, you might write this:

(defmacro factorial (n)
 `(let ((nv ,n))
    (do* ((k 1 (1+ k))
          (f k (* f k)))
         ((>= k nv) f))))

And this has the effect of replacing (factorial n) by an expression which will compute the factorial of n. The cost of that is that (funcall #'factorial n) is not going to work, and (funcall (macro-function 'factorial) ...) is never what you want.

Well, that’s what you did in 1975, because Lisp compilers were made out of the things people found down the sides of sofas. Now it’s no longer 1975 and you just tell the compiler that you want it to inline the function, please:

(declaim (inline factorial))
(defun factorial (n) ...)

and it will do that for you. So this use of macros is now purely historicl.

The second reason for macros where you really want functions is computing things at compile time. Let’s say you have lots of expressions like (factorial 32) in your code. Well, you could do this:

(defmacro factorial (expression)
 (typecase expression
   ((integer 0)
    (factorial/fn expression))
   (number
    (error "factorial of non-natural literal ~S" expression))
   (t
    `(factorial/fn ,expression))))

So the factorial macro checks to see if its argument is a literal natural number and will compute the factorial of it at macroexpansion time (so, at compile time or just before compile time). So a function like

(defun foo ()
 (factorial 32))

will now compile to simply return 263130836933693530167218012160000000. And, even better, there’s some compile-time error checking: code which is, say, (factorial 12.3) will cause a compile-time error.

Well, again, this is what you would do if it was 1975. It’s not 1975 any more, and CL has a special tool for dealing with just this problem: compiler macros.

(defun factorial (n)
 (do* ((k 1 (1+ k))
       (f k (* f k)))
      ((>= k n) f)))

(define-compiler-macro factorial (&whole form n)
 (typecase n
   ((integer 0)
    (factorial n))
   (number
    (error "literal number is not a natural: ~S" n))
   (t form)))

Now factorial is a function and works the way you expect — (funcall #'factoial ...) will work fine. But the compiler knows that if it comes across (factorial ...) then it should give the compiler macro for factorial a chance to say what this expression should actually be. And the compiler macro does an explicit check for the argument being a literal natural number, and if it is computes the factorial at compile time, and the same check for a literal number which is not a natural, and finally just says ’I don’t know, call the function’. Note that the compiler macro itself calls factorial, but since the argument isn’t a literal there’s no recursive doom.

So this takes care of the other antique use of macros where you would expect functions. And of course you can combine this with inlining and it will all work fine: you can write functions which will handle special cases via compiler macros and will otherwise be inlined.

That leaves macros serving the purpose they are actually useful for: building languages.

Appendix: setting up `trace-macroexpand`

(use-package :org.tfeb.hax.trace-macroexpand)

;;; Don't restrict print length or level when tracing
(setf *trace-macroexpand-print-level* nil
      *trace-macroexpand-print-length* nil)

;;; Enable tracing
(trace-macroexpand)

;;; Trace the macros you want to look at ...
(trace-macro ...)

;;; ... and ntrace them
(untrace-macro ...)

All the examples in this article are in Common Lisp except where otherwise specified. Other Lisps have similar considerations, although macros in Scheme are not explicitly functions in the way they are in CL. ↩
This article originated as a message on the lisp-hug mailing list for LispWorks users. References to ‘LW’ mean LispWorks, although everything here should apply to any modern CL. (In terms of tail call elimination I would define a CL which does not eliminate tail self-calls in almost all cases under reasonable optimization settings as pre-modern: I don’t use such implementations.) ↩

The best Lisp

Tim Bradshaw — Wed, 03 Nov 2021 12:03:44 UT

People sometimes ask which is the best Lisp dialect? That’s a category error, and here’s why.

Programming in Lisp — any Lisp — is about building languages: in Lisp the way you solve a problem is by building a language — a jargon, or a dialect if you like — to talk about the problem and then solving the problem in that language. Lisps are, quite explicitly, language-building languages.

This is, in fact, how people solve large problems in all programming languages: Greenspun’s tenth rule isn’t really a statement about Common Lisp, it’s a statement that all sufficiently large software systems end up having some hacked-together, informally-specified, half-working language in which the problem is actually solved. Often people won’t understand that the thing they’ve built is in fact a language, but that’s what it is. Everyone who has worked on large-scale software will have come across these things: often they are very horrible, and involve much use of language-in-a-string¹.

The Lisp difference is two things: when you start solving a problem in Lisp, you know, quite explicitly, that this is what you are going to do; and the language has wonderful tools which let you incrementally build a series of lightweight languages, ending up with one or more languages in which to solve the problem.

So, after that preface, why is this question the wrong one to ask? Well, if you are going to program in Lisp you are going to be building languages, and you want those languages not to be awful. Lisp makes it it far easier to build languages which are not awful, but it doesn’t prevent you doing so if you want to. And again, anyone who has dealt with enough languages built on Lisps will have come across some which are, in fact, awful.

If you are going to build languages then you need to understand how languages work — what makes a language habitable to its human users (the computer does not care with very few exceptions). That means you will need to be a linguist. So the question then is: how do you become a linguist? Well, we know the answer to that, because there are lots of linguists and lots of courses on linguistics. You might say that, well, those people study natural languages, but that’s irrelevant: natural languages have been under evolutionary pressure for a very long time and they’re really good for what they’re designed for (which is not the same as what programming languages are designed for, but the users — humans — are the same).

So, do you become a linguist by learning French? Or German? Or Latin? Or Cuzco Quechua? No, you don’t. You become a linguist by learning enough about enough languages that you can understand how languages work. A linguist isn’t someone who speaks French really well: they’re someone who understands that French is a Romance language, that German isn’t but has many Romance loan words, that English is closer to German than it is French but got a vast injection of Norman French, which in turn wasn’t that close to modern French, that Swiss German has cross-serial dependencies but Hochdeutsch does not and what that means, and so on. A linguist is someone who understands things about the structure of languages: what do you see, what do you never see, how do different languages do equivalent things? And so on.

The way you become a linguist is not by picking a language and learning it: it’s by looking at lots of languages enough to understand how they work.

If you want to learn to program in Lisp, you will need to become a linguist. The very best way to ensure you fail at that is to pick a ‘best’ Lisp and learn that. There is no best Lisp, and in order to program well in any Lisp you must be exposed to as many Lisps and as many other languages as possible.

If you think there’s a distinction between a ‘dialect’, a ‘jargon’ and a ‘language’ then I have news for you: there is. A language is a dialect with a standards committee. (This is stolen from a quote due to Max Weinrich that all linguists know:

אַ שפּראַך איז אַ דיאַלעקט מיט אַן אַרמיי און פֿלאָט

a shprakh iz a dyalekt mit an armey un flot.)

‘Language-in-a-string’ is where a programming language has another programming language embedded in strings in the outer language. Sometimes programs in that inner programming language will be made up by string concatenation in the outer language. Sometimes that inner language will, in turn, have languages embedded in its strings. It’s a terrible, terrible thing. ↩

Computer insecurity

Tim Bradshaw — Mon, 27 Sep 2021 15:35:02 UT

Making computer systems secure is very difficult. The consequences of insecure systems are already extremely serious and will be catastrophic in future if they are not already. Malignant people, often sponsored by malignant states, are actively attacking computer systems and have had considerable success doing so.

So it is surprising that companies whose stated aims are to increase security are effectively working to make their customers’ systems less secure.

Managing large, complex computing installations

For any large, complex computing installation¹, simply managing it is a problem. The way of managing a small installation — having someone (part of) whose job is to look after the installation — has terrible scaling problems: if your installation has a million OS instances, then keeping them up to date might involve a hundred thousand people. And if you could afford that many people you still haven’t solved the problem: with a large number of people whose job is to look after parts of the installation there is a vanishingly tiny chance that they will do so consistently.

For systems which are merely large this problem can be made a lot simpler: for such a system the number of components is far larger than the number of tasks the system performs, so there are many components for each task. These components can then be forced to be identical (or identical-enough). The failure of single components simply lowers the capacity of the system in almost all cases. There are still scaling problems — for a system with a huge amount of hardware, hardware failure rates will mean that more of the hardware fails and needs to be replaced, requiring people to actually do the replacement — but much of the management of such a system scales much less than linearly with its size. Finding problems which both can be solved by systems which are merely large and from which money can be made is what made the giant internet companies so rich, of course².

For systems which are both large and complex the problem is far harder: because such a system is performing a large number of distinct tasks managing it necessarily requires people with expertise in all these tasks, and there are only so many things a person can be good at. Because of this, running such a system is never really scalable. But, if you can isolate various layers of the system — the computing and storage hardware, the operating system, the software platform on which applications live, and so on — then you can make those parts of the system into something which is merely large, and you can manage those in a way which will scale.

This, of course, is exactly what everyone with a large, complex computing installation is trying to do.

Single points of control

The trick to managing a large installation, or the parts of a large, complex installation which can be made merely large, is to have single points of control. For instance, if I want to deploy some update to a very large number of machines, I very definitely don’t want to have to access each machine individually to do that: instead I need to have some single point of control from where I can say ‘deploy this update to this set of machines’ and that will just happen, and I’ll get some kind of report about which machines it worked on and so on.

Making the management of large installations scalable requires these single points of control. They may not be rooms full of dials and flashing lights in hollowed-out volcanos staffed by people in white coats, where occasional klaxons sound (although, of course, they should be), but they have to exist, somewhere: it must be the case that changes to the system can be made in one place, or a very small number of places, and take effect over the whole system. There’s no other way to do this.

A security problem

Single points of control present a quite considerable security problem. They are necessary so that the system can be managed efficiently, but it doesn’t say anywhere that the changes made from such a single point of control are good changes. So two things are extremely important:

all the single points of control need to be known about and their number should be kept as small as possible;
all the single points of control must be very carefully managed, with extensive controls over access, carefully managed logs and so forth.

I suspect most organisations fail at both of these, unfortunately: they neither keep a careful catalogue of the single points of control and nor do they control access to them carefully enough. This essay, however, is not about how to deal with this problem except in one respect.

Transitive closure

To understand what the single points of control are you need to understand the notion of transitive closure. This is pretty simple, fortunately: if a system $a$ controls a system $b$, and system $b$ controls systems $c, d, \ldots$, then, by transitive closure, system $a$ controls all of systems $c, d, \ldots$. And similarly, if $d$ controls $g$, then $a$ also controls $g$. What this means is that, in order to understand what the single points of control are, you need to construct graphs³ of the transitive closure of control. This is not hard to do, but it is quite hard for people to remember these graphs: they really need to exist in some explicit form. Doing this is also a good exercise in making sure you actually do think hard about what the nodes in the graph are: what are the things which grant control over some system, and how are they being managed.

An important thing about this transitive closure of control is that everything gets more sensitive as you go up the graph⁴: the higher nodes in the graph control more lower nodes, and often very many more lower nodes. If the graph is a tree with a constant branching factor then the number of nodes controlled goes up like $n!$ as you get higher in the tree, and that’s fast: it’s tempting to say it goes up exponentially, but it doesn’t: it goes up much faster than that.

All of this means that for large installations the points of control near the top of the tree are extremely sensitive: they need to be very tightly controlled indeed. It would be foolish, wouldn’t it, to allow third-parties to manage these points of control?

We’re all fools

Of course, we all do exactly that, all the time. We all run software we have neither written nor exhaustively checked⁵, on hardware we don’t really understand, for instance, and thus outsource our security to the people who write this software and make this hardware. And most of the time it’s OK. Most of the time. Sometimes bad things are found in the software or the hardware and we have to rush around to deal with them. Well, not so much ‘sometimes’ as ‘quite often’ in fact.

But we don’t really have much choice about this: in theory we could build our own hardware and write our own software to run on it as people did in the 1940s and 1950s, but in practice that’s absurdly impractical.

But that’s not where it ends, of course. We now all love our cloud computing: running our software on top of platforms and hardware managed by other people, and keeping our data on their storage systems. Because of course no-one could ever compromise one of these suppliers of computing resources without us realising, quietly changing the cloud platform so it recorded interesting things about what we’re doing. And of course these, very large, computing infrastructures are not managed in turn from single points of control which now, by transitive closure, have control over the computing infrastructures of a huge number of organisations. Oh, wait.

Well this, too, seems to have worked out reasonably well. So far. And this essay is not about the risks of cloud computing.

Some more than others

There are things we can do to control the risks we all take. For instance, when dealing with software we haven’t written or checked in detail, we can carefully run it first in a controlled, isolated environment to try and assess any problems with it. This doesn’t ensure safety — nothing can do that — but it does mean that we have at least some chance of finding out if the new software is broken or malignant.

What we should not be doing is blindly accepting and deploying updates to software into an environment we care about. And we should very, very definitely not be doing that when that software has access to control our systems. If we were to do that, then, by the time we know that the people we’re getting the software from have been compromised, or were perhaps always malignant, it’s far too late: the damage is done. And, worse, we probably will never know what the damage that has been done is.

A target painted on our backs

Points of control which are both far up the graph and well-known have targets painted on their backs. If Dr Evil, President Evil or General Secretary Evil decides that they’d like to compromise a large number of organisations, the things they are going to go for are the points of control which are far up the graph. And they’ll be willing to put a great deal of time, skill and money into this.

Points of control which are far up the graph are, as a result, all but certain to be attacked, and all but certain to be attacked by people with effectively unbounded resources. The only safe assumption to make is that these points of control will be compromised in due course: assuming otherwise is hopelessly naïve.

So you should be very, very careful to test anything you get from such places — especially software, which is far more mutable than hardware. And, if you are in charge of one of these places you should certainly not be suggesting that anyone blindly take your updates: that would be extremely irresponsible.

And yet this is exactly what happens: we are all actively encouraged to blindly trust software we receive from organisations with targets painted on their backs.

And that’s what this essay is about.

Insecurity solutions

There are many good choices here, but I’ll just pick one: Qualys.

The Qualys Cloud Platform and its powerful Cloud Agent provide organizations with a single IT, security and compliance solution — from prevention to detection to response! —Qualys⁶

That sounds good, right? Except, wait: they’re providing security solutions. It’s in the nature of such solutions that they both need to be updated very frequently as new threats appear and require privileged access to systems. It almost certainly is not possible to do the kind of staged test and deploy I suggested above for software like this: if there’s a new compromise you want to know about it now, not in two weeks. Instead you really need to just accept updates from Qualys as and when they appear or, perhaps worse, allow them to pull data from your systems to check ‘in the cloud’ where you do not have control over the security of that data. That means that, if you are using Qualys tools on live systems, Qualys are a single point of control for you.

Qualys

has over 10,300 customers in more than 130 countries, including a majority of the Forbes Global 100. — Wikipedia

That means that they’re a single point of control for a large number of very high-value targets for President Evil: Qualys have a target painted on their back, are illuminated by bright searchlights and are surrounded by flashing neon arrows pointing at the target.

So, well, they’ll know about this, won’t they? And although they can’t avoid being a target to some extent⁷, they certainly will be addressing these problems to reduce the risk somehow, won’t they? Certainly they will have many documents and guides describing how to minimise the inevitable risk associated with using their products.

Not so much.

How to lose friends and alienate people

Start from https://www.qualys.com/documentation, then ‘Cloud Platform’ / ‘Scan authentication’ / ‘Unix record’ / ‘online help’ / ‘What credentials should I use?’ / ‘Learn more’ and you should find a link entitled ’*NIX Authenticated Scan Process and Commands’ whose target is https://success.qualys.com/discussions/s/article/000006220⁸, from which

When Qualys performs an authenticated scan against a *nix system with a properly configured authentication record we will create an ssh session using the credentials in the authentication record, check the effective UID (level of access), execute “sudo su -” (or other root delegation command configured in the record), re-check effective UID to ensure the elevation worked, then begin our checks.

sudo su - means ‘become root and spawn a shell’. Or, in other words, gain completely unconstrained access to the system with the highest possible level of privilege. Further down the same page you’ll find this:

First, customers should be strongly discouraged from placing granular controls around the Qualys service account because of the reasons stated above. […] Even if it were possible to publish this list, it would likely take a lot of effort to maintain its currency.

In other words: ‘don’t use fine-grained control to limit what our tool can do, because maintaining the list of commands it might run would be a lot of work for us.’

Yet further down the page is:

Below is a list of commands that a Qualys service account might run during a scan. Remember not every command is run every time, and *nix distributions differ. This list of commands is neither comprehensive nor actively maintained.

This is followed by a list of commands which includes awk(equivalent to uncontrolled root access), firefox (WTF?), java (root access again) and just a huge number of other commands all of which imply unconstrained root access.

That page also links to https://success.qualys.com/discussions/s/article/000006228⁹. Which contains this obvious falsehood:

In a nutshell, all of our data point detections are scripts that need to be run as root. Running them as a non-root user would, in most cases, result in permission errors which cannot be distinguished from other error sources. That would result in incorrect data being returned by the scanner, which is why we do not support this. There is no way to make non-root scanning work reliably with a scanning model based on shell commands or shell scripts.

It also contains this lovely example of why sudo is no good:

sudo /usr/bin/find . -maxdepth 0 -name . -exec /bin/sh -c "su -" ";" -quit

This is truly magnificent: anyone who has looked after sudo configuration will know, immediately that this is why you don’t allow unconstrained find in the commands you allow to be run. But apparently the people at Qualys don’t understand that.

The terrifying conclusion

It is hard to read this material without coming to the conclusion that the people writing it — the people on whom you are relying to check your systems for security — do not care about the security of their customers’ systems if that security might cause momentary inconvenience for them. Worse, it is hard to read this material without coming to the conclusion that the people writing it do not understand the security architecture of *nix systems¹⁰ at all.

But they have no choice

Well, the people who wrote the documents excerpted above are certainly patronising, and they also seem alarmingly incompetent. But, surely, the problem is real: I can poke fun at them all I like but that doesn’t actually help anything, does it?

This is a security scanner and this means that the things it is checking for change very fast: people who write malware do not give warning of what they are going to do in advance and do not make it easy to know when they are attacking you. When a new attack becomes known about it needs to be checked for right away. And since the nature of the attack can’t be known in advance, the techniques needed to check for it can’t be known in advance, which means both that you will need to allow the scanner to run programs it has just fetched from Qualys, and also that those programs must be able to use all the facilities of the system, at the highest privilege level, to do there work. There’s just no way around this, is there?

And, despite what might appear from reading the above material, we therefore have to assume that everyone at Qualys knows they an enormously attractive target for President Evil and that their security is thus impeccable: we have no choice.

One of many

And Qualys are just one of many: I have picked on them only because I had to pick on someone. As another example, there’s a company — a very famous company with a three-letter name — who sell a product which, if you install it according to their recommendations, requires you to grant unconstrained root access via sudo to an entire directory containing a huge number of shell scripts some of which are tens of thousands of lines long, and some of which write other shell scripts. The chances of that system not containing security problems are close to zero. But again, we have to trust them, even though the evidence that they don’t even understand what security means is overwhelming: after all they do have a three-letter name.

And this is everywhere you look: we are trusting the security of our systems to people who do not appear to understand what security means.

Supply chain

Isn’t this all just a bit alarmist? It’s all very well for me to go on about single points of control and companies with targets painted on their backs, but surely nothing bad ever really happens?

If you think that, then you haven’t been paying attention.

SolarWinds

SolarWinds are a company which write network-management tools used by many other companies, government organisations and others. One of their products is called Orion, which is used by about 33,000 public and private-sector organisations. Most or all of those organisations download updates to the product either automatically or semi-automatically. This makes SolarWinds a very attractive target. Starting before October 2019 SolarWinds were compromised and in particular the build system for Orion was compromised in such a way that releases of the product contained malicious code. Between, perhaps, March and December 2020 the attackers used these compromised updates, together with other compromises to attack at least 200 organisations, including multiple parts of the US federal government, NATO, the UK government, the European parliament, Microsoft and many others. A good description of this attack can be found here. The people who did the attack were the Russian Foreign Intelligence Service, Sluzbha Vneshney Razvedki¹¹. I don’t know what the results of this attack were, and perhaps no-one outside Russia knows what was taken and what will be done with it. It is certainly very safe to say that the results were extremely severe, if not catastrophic.

It’s worth noting that the result of the build system for Orion being compromised was that the compromised releases were properly digitally signed: it is not safe to rely on digital signatures to prove that software has not been compromised in the case where the organisation signing the software has been compromised.

Qualys again

in early 2021 there was a security breach at Qualys. It seems that this breach didn’t compromise their security tools: they got away with it, this time.

This is not the end

These are both supply chain attacks: many others have happened, and without doubt many more will happen. In the context of this essay, supply chain attacks are a result of having single points of control for security management which are outside an organisation and which serve many organisations, making them interesting to attackers with large resources.

But what can we do? It is inevitable that these organisations will be attacked, and almost inevitable that they will be compromised. In many cases we can mitigate the risk by having a fairly long test and deployment cycle and hoping that either we find the problems or that others do before we start relying on the tool. For security scanners we can’t do that, because we can’t afford to wait. We have to trust suppliers of security products, and we have to allow them to run privileged code on our systems which we can not check because the alternative of not checking for security compromises is even worse.

We have to trust them because, in fact, we have no other choice.

Is this the end?

So, this seems like an insoluble problem, doesn’t it? A security scanner has terrifying properties, by its nature:

it must be updated very frequently, far too frequently to perform safety checks;
it must have privileged access to live systems.

There’s just no way around that, is there? And of course, President Evil knows this too: the organisations providing these tools make extremely good targets because the nature of the tools means both that any compromise is very serious and compromises are very hard to detect. And there is therefore no way around the fact that the suppliers of these tools will be targets for President Evil, will, in due course, be compromised, and all is therefore lost.

Well, perhaps not. Perhaps it is possible to reduce the risk.

A sketch

The problem to solve is that a security scanner must be updated very frequently and must run with high privilege. Suppliers of such tools, even if they are competent which is not always clear, are extremely valuable targets for attackers with very large resources and thus are almost certain to be compromised. So running these scanners on live systems needs to be avoided, even though the scanners need access to the live systems to run.

Well, there’s a way around that. If you could make an identical copy of any system then you could scan the copy. If the machine has a vulnerability, so will the copy. If the scanner is compromised then it will attack only the copy, which doesn’t matter, since it’s only a copy, which will be destroyed immediately after being scanned.

It is more complicated than that, of course: the copy needs actually to be running as lots of things will almost certainly only really show up when a system is running (what network ports does it have open, for instance). So the copy needs to be more than just a blob of data: it needs to be a real thing running programs. And the copy has to think it’s not a copy: enough of the world around it needs to be faked up so it thinks it’s doing real work. But all of this world must be fake — under no circumstances should the copy be able to see real data or talk to real live systems. Finally, the scanner needs to be very restricted in the data it can upload: since the whole point is that we don’t trust the scanner we can’t allow it to ship all the data on the system to who-knows-where when it’s been compromised. Ideally the scanner should return a single bit: is the thing it is scanning compromised? If it is then this tells us to look more closely at it, for instance by looking at a report stashed locally on the copy.

Doing this is not simple to arrange, but it is perfectly possible. Here are some objections with answers.

But, cloning systems like this is hard, isn’t it? Not really. For a start, if the systems concerned are virtualised then pretty much all serious hypervisors support making snapshots and clones of the virtual machines they’re running, and moving those snapshots and clones between different physical hardware. If the systems aren’t virtualised then things are harder, but this kind of ‘make a carbon copy of a system’ is what you should already be doing for backup and disaster recovery (DR). Some people, apparently, maintain DR systems by manually keeping them up to date with the live systems. If you are doing that, stop: create the DR systems by cloning the live systems. If you don’t have a good approach to cloning do it by restoring backups. If you can’t restore your backups (or you aren’t making backups) then you are already dead, so nothing matters.

But, this means doubling the size of the environment, doesn’t it? No: you only need enough extra computational resources to scan each little chunk of your environment, since you can reuse them. But, you already need enough extra resources to support DR: just use those!

But, this will be hard to set up, won’t it? Yes, it will require a fair amount of work. But if you don’t do this, or something like it, then within the next few years your systems (almost certainly) will be compromised and your data (almost certainly) will leak to bad people as a result. So the question is: is the cost of that higher, or lower, than the cost of this, or something like it?

But, the things that do the cloning can be attacked, can’t they? Yes, they can. But these tools are a tiny fragment of your infrastructure. They are, in fact, a single point of control, and one you have to be very, very careful about. This sketch doesn’t remove the problem since nothing can do that: it just makes it much less severe and much better controlled.

But, lots of details are missing, aren’t they? Yes. This is a sketch, written by some person on the internet: it’s not a complete solution. (If you want a complete solution pay me lots of money and I’ll make you one.)

But, you haven’t thought of this thing, and that thing, and …, have you? No. It’s a sketch.

Because we want to

Solving these problems, in the sense of making them much less likely to happen and the consequences when they do happen much less bad, is not easy. But it is possible, as the sketch in the previous section shows. Not solving them means that, almost certainly, in the next few years a catastrophe will happen. I said at the beginning of the essay

it is surprising that companies whose stated aims are to increase security are effectively working to make their customers’ systems less secure.

But it isn’t, not really: it is depressing, but not really surprising, because the entire history of computing has been made up of people avoiding solving problems through laziness, lack of imagination, or the desire to make a quick buck.

I think that should stop. Solving these problems will be hard, but we can solve them if we only want to.

Appendix: ‘large, complex computing installations’

I’ve used this term above without ever really defining it. Defining it is not entirely easy, and the meanings of definitions change over time: once an IBM System/360 Model 70 might have been thought of as a very large computing installation, but today it would be a very small one other than, perhaps, physically.

Every time I want to write about large computing installations I find I don’t know the right words any more: is a large computing installation one with many systems, or is it one large system? What, anyway, is a ‘system’? Once everyone knew what it meant: the system was the departmental VAX, and later there were several systems which were the VAX (still creaking along on life-support) and a bunch of Suns, some of which were workstations and some of which were fileservers.

But that meaning has dissolved away. For a while it was safe to talk about ‘servers’: everyone knew that a server was something that lived in a rack along with other servers¹². But that in turn has dissolved away as the relationship between physical hardware and the programs that run on it becomes more complicated and often more remote.

So what, today, are the right words? What is a large installation and what a small one? Here’s my attempt at a definition.

An installation is large if it has a very large number of truly concurrent threads of control. ‘Truly concurrent’ means ‘supported by hardware’, and what is meant by ‘very large’ will increase over time: at the time of writing (mid 2021) this probably means at the very least tens of thousands.
An installation is complex if it is performing a large number of conceptually distinct tasks. Again the definition of what is a large number may change over time although it will probably increase more slowly than the number of threads of control.

This definition, for instance, would make many HPC systems large, but not complex: although they have large number of independent threads of control, they probably run a rather small number of different programs, and perhaps only one (probably several copies of that one, of course). It’s possible for a system to be complex, but not large, although unusual.

I’m not sure if this definition is adequate, but I think it will serve here.

In the main text I use ‘installation’ and ‘system’ interchangeably: I should probably only use ‘installation’ but I don’t. When I talk about an individual computer in a large installation I’ve tried to say ‘machine’.

See appendix. ↩
Once upon a time I worked for a then-famous company which sold holidays over the internet. We used to sneer at Amazon for picking a simple problem — mostly selling books, then — to solve: books just sit in a warehouse waiting to be bought, for decades if need be, while everyone wants a different holiday and holidays have very definite sell-by dates. One day I realised that what Amazon had done — picking a simple, scalable problem to solve — was smart, and what we were trying to do was not smart and that was why they were going to get rich and we weren’t. I didn’t get rich, and I don’t know if that company even still exists. ↩
A graph here is not a plot: it’s a drawing of some kind of network consisting of nodes (points of control, for instance) and arcs between those nodes which may or may not have arrows on them indicating direction: if a controls b then there will be a node for a, a node for b and an arrow from a to be indicating control. ↩
By ‘up’ I mean in the direction of ‘is controlled by’ while ‘down’ means in the direction of ‘has control over’. ↩
Of course we can’t exhaustively check software in any case, but we can do a lot better than ‘not checking it at all’. ↩
All the text in this essay was extracted from the linked sources in early September, 2021. Things may have changed since then, but the what is here was there then. I have marked elisions with ’[…]’. ↩
For instance, if Qualys can be compromised in such a way that their tools fail to report other compromises, then this would allow those other compromises to propagate undetected, even if the tools provided by Qualys are not themselves doing direct harm. ↩
This may formerly have been https://qualys-secure.force.com/discussions/s/article/000006220. ↩
May formerly have been https://qualys-secure.force.com/discussions/s/article/000006228 ↩
To be fair, ‘the security architecture of *nix systems’ does give the impression that there is one — that it is something made of marble and stainless steel rather than partly-dissolved mud bricks and rotting straw. ↩
In other words, this time it was indeed President Evil. ↩
Some very large or very old servers might have been whole racks, or even several. ↩

Useful idiots

Tim Bradshaw — Fri, 09 Apr 2021 13:24:38 UT

The authors of the Signal messaging system are acting as useful idiots for state security and police services: while they are almost certainly not working for them or funded by them, what they are doing is extremely convenient for them.

There is a conspiracy theory that Signal is in fact created by some state security service: this is pretty obviously silly. Instead, I think that the people who create and endorse Signal are acting as useful idiots for various state security and police services.

useful idiot, noun
a naive or credulous person who can be manipulated or exploited to advance a cause or political agenda

The art of the possible

The people who work for state security and police services, unlike their political masters, understand cryptography. And in particular they understand that the mathematics of cryptography makes it effectively impossible to stop people from using cryptographic communication systems which can not usefully be broken. The only ways this could be prevented would be either to forbid people access to general-purpose computers, which is not practical, or to ensure that all such computers are compromised at a low level which is also not practical¹.

In other words they understand that people will be able to communicate with each other in such a way that this communication can not be overheard in bulk, and that there is nothing they can do about that.

What they can do is to compromise individual communication links: once they’ve worked out that, for instance, two people who are of great interest to them are talking to each other they can work to compromise the systems that these people are using to communicate — installing things like key-loggers, rootkits or both, which will sniff the communications before they are encrypted. Doing this is a lot of work and probably requires a significant amount of traditional tradecraft: by far the easiest way to do it will be by gaining physical access to the devices they want to compromise and doing so without arousing suspicion, for instance.

Their difficulty, then, is filtering the people that they want to overhear sufficiently badly from the huge mass of people that they don’t care about. This is where Signal comes in.

Useful idiots

Signal is a tool which allows encrypted communication between individuals and groups. There is no reason to believe that this communication can be broken.

But Signal has been designed in such a way that it is inherently unsafe: it uses phone numbers for identifiers and its contact discovery works in such a way that anyone who knows your phone number can know if you are a Signal user, whether or not you know their phone number. This approach means that if you have Signal installed then you will get a notification whenever anyone who is in your phonebook installs Signal, whether or not you are in their phonebook. This was done intentionally, and presumably as an attempt to drive growth in users with the eventual aim of making money from the large userbase.

This makes Signal a seriously bad choice for, for instance, people who are suffering abuse or being stalked. The moment you install Signal in order to talk to someone who might help you, the person you are being abused by or who is stalking you can know this, and you won’t know that they know.

On the other hand this is very convenient for state security and police services. They don’t care about the cryptographic security because they know that people can use tools which they can’t attack. But finding someone’s phone number (all someone’s phone numbers) is a pretty easy thing to do if you’re a state security or police service, and Signal’s contact discovery then means that they can silently trawl through people they might be interested in and work out who has Signal installed.

What this means is that, assuming Signal tends to be used by people who really do have something to hide² it works as a filter which allows state security and police services to identify people who are likely to be of interest to them from larger lists of people.

The coronation of the idiots

Until recently it has been rather unclear how Signal’s authors intend to use the product to attempt to make themselves very rich. Well, they’ve just answered that question: they are going to glue a cryptocurrency into it, so it will be possible to make anonymous payments to and from Signal. Conveniently Signal’s authors have an ownership stake in the cryptocurrency involved: something which should not be very surprising³.

So Signal’s authors have now revealed their proposed solution to their underpants gnome problem: they intend to make money from Signal by making money from the transactions people make using it. Lots of people have been saying that this is a bad idea: why entangle a messaging system with a payment system? Well, they’re just not thinking very hard about this because the answer is terribly simple: they are being entangled so Signal’s authors can make money.

So, what kind of person would be particularly interested in a tool which allows encrypted communication (with disappearing messages, even), and allows anonymous, secure payments? People who deal in illegal goods would be. If you’re dealing in illegal drugs, or illegal pornography, or anything similar, Signal will soon look like a tool designed especially for you.

But, really, it turns out to have been designed for someone else. If you are a state security or police service, soon you will be able to look at a list of people who you suspect may be dealing in illegal goods, use Signal’s contact discovery to find the people who have it installed, and now you have a shorter list of people who are much more likely to be of interest to you.

Signal is the tool that state security or police services would have built, but they didn’t have to do so: some useful idiots built it for them.

It is, inevitably, the subject of other conspiracy theories. ↩
Rather than the sort of people who wear ‘tactical’ watches so they can pretend they are in the special forces. ↩
It does at least appear that MobileCoin, the cryptocurrency Signal will use, does not use Bitcoin’s ‘proof of work’ approach which is currently causing significant carbon emissions. ↩

How the backtrace was conquered

Tim Bradshaw — Fri, 26 Mar 2021 11:37:22 UT

Once upon a time, when the world was younger, a young and rather foolish physics student used to debug his FORTRAN programs using printed backtraces.

And I do mean printed backtraces: when the machine crashed the chain printer attached to it would vomit out many sheets of paper which had procedure names and line numbers on them. And, after restarting the machine so the next user could make it crash in their turn he would take this printout and take his printout of his program and compare the line numbers: looking at the code, trying to work out what had gone wrong and marking corrections in pencil. He spent many hours late at night in this way.

Later on, this same student (now a maths student) discovered a wonderful thing: a programming language called Lisp in which you could write programs to solve complex algebra problems which were of interest in his field. And although, in theory, if you had the kind of computer which maths departments could not afford, Lisp was an interactive language, this was not true in practice if all you had was the kind of computer that was all a maths department could afford. So things went on much as before: he would make some changes to his program, set up the equations it was going to try to solve, and then, late at night when there were no other users to inconvenience, set it off running. In the morning there might occasionally be a solution, and even more occasionally a solution which was useful. But more often there would be only the corpse of the program in the form of an elaborate backtrace after it had been mortally wounded by some fierce bug (error handling was a thing not yet thought of, at leasy by the student). This time, though, the backtrace would be in a log file from the run.

And the student made another discovery: there was a certain text editing environment used by some far-off people who had access to much bigger and better computers, and this editing environment purported to support Lisp programming rather well: certainly better than the rudimentary editor he used then. And he managed to get a copy of this environment (legend has it it was version 17.64) on a tape from someone, and he managed to make it run, just, on the maths department’s machine. And he taught it enough about the Lisp dialect he was using that it was indeed helpful, if often annoying to other users as it took rather a lot of the capacity of the machine to support it. And everything was a little better.

And this text editing system came with a rather wonderful tool: a program whose name may have been ‘tags’ which would, for the languages it understood, make a file which mapped between definition names and their locations in the filesystem. And he modified this tags program to understand the dialect of Lisp he was using as well. Very wonderfully, the system would also cope with the case where the definition had moved, which it almost always had, and which made things like line and column numbers so brittle and useless (source control might have been invented by then, but the student knew nothing of that). This, of course, was the one of the primitive ancestors of the automatic systems which will find definitions of symbols that any reputable editor, and even some that are perhaps a little less than reputable, now has.

And now, when he came in in the morning to find a new backtrace from the previous night’s run, he would edit this backtrace in the editing system and find interesting lines in it, at which he would type the very wonderful ‘meta-dot’ or, as he knew it (not being blessed with a keyboard with a meta key), ‘escape dot’ command. And the disk light would come on for a little, and then he would be looking at the definition he was interested in.

Thus was the backtrace conquered. And from that day to this it has never dared raise its head again in polite company, but instead lurks, unheeded except by the few who now remember it, in the darker corners of the system. As for the student, well, no-one now remembers him at all.

What's wrong with Signal's contact discovery

Tim Bradshaw — Sat, 16 Jan 2021 11:35:36 UT

After WhatsApp’s threatened change to their terms of service, which may allow them to leak information to Facebook, many people are moving to Signal, a tool which purports to be more secure. If you want security which is not at least partly theatrical you should not use Signal.

On or about the 6th of January 2021, WhatsApp users were required to agree to new terms of service or to stop using the service by the 8th of February. These terms of service were at best confusing, but given that WhatsApp is owned by Facebook, a company whose entire business model is selling its users’ souls to its customers and which has been heavily implicated in that other thing that happened on the 6th of January 2021, the conclusion was not likely to be good.

I’m glad to say this seems to have been a disaster for WhatsApp: so many users changed to Signal — an app which sells itself as being more secure — that it fell over under the load for a while on the 15th of January. People are apparently leaving WhatsApp in droves, and moving to Signal and other platforms.

WhatsApp / Facebook were so alarmed by this that they’ve both issued a number of clarifications, delayed the implementation date until the 15th of May — probably in the hope that people will have forgotten by then — and made clear that the changes do not apply in Europe, where there are reasonable privacy laws, and not even, yet, in the UK which has not yet completed its transition to Boris Johnson’s hereditary feudal fiefdom.

So that’s, perhaps, good, right? Lots of people were driven to Signal which is ever so much more secure and written and run by very nice people who understand and care about security.

Signal

Well, the people who wrote Signal and run its infrastructure care about their users’ security only as far as it suits them. Yes, they make a great deal of noise about how secure and safe it is: their website is covered in quotes from people like Edward Snowden and Bruce Schneier and generally makes a very big deal about the security of the platform. If you don’t read what they write quite carefully you could be forgiven for thinking that Signal was completely safe, and completely private.

It’s not. And it’s not safe by design: the Signal people know it is not safe, and they don’t care.

Signal’s contact discovery

Here is a sketch of how contact discovery works in Signal. If you are a Signal user you have some identity on the system, and that identity is derived from your phone number. In particular, if you know the phone number you can work out the identity¹. If you allow Signal access to your contacts (which it will ask you for), then every once in a while it will work out something equivalent to identities corresponding to your contacts, upload them, ephemerally, to Signal’s infrastructure, and compute the intersection. Once it’s done that, you know which of your contacts have Signal.

There are several obvious problems with this approach. The most obvious of these is that if any of the data on your contacts leaks, even in encrypted form — if someone attacks Signal’s infrastructure, or if Signal themselves are not trustworthy, say — then it is, obviously, a bad thing. And Signal have gone to heroic lengths to protect against this. Here is their initial outline of what it does (the following text comes from the link below):

Private contact discovery using SGX is fairly simple at a high level:

Run a contact discovery service in a secure SGX enclave.

Clients that wish to perform contact discovery negotiate a secure connection over the network all the way through the remote OS to the enclave.

Clients perform remote attestation to ensure that the code which is running in the enclave is the same as the expected published open source code.

Clients transmit the encrypted identifiers from their address book to the enclave.

The enclave looks up a client’s contacts in the set of all registered users and encrypts the results back to the client.

There is much more description of this. And it’s all fine: it really does go to very great lengths to make it very hard for Signal themselves or any other malicious actor who might be able to compromise their systems to gain access to your contacts, and still less to your messages. And that’s all very wonderful.

Now you’re probably expecting me to spout some conspiracy theory about how the SGX enclaves themselves have been compromised at the hardware level by some state-level entity, possibly with a three-letter name, so everything is worthless. Well, there have been rumours that that sort of thing has happened, certainly. But, well, they probably haven’t happened: the conspiracy theories probably are just conspiracy theories as they usually are. Even if they have happened, defending against state-level entities, with or without three-letter names, is generally futile: if these people are interested enough in what’s on your phone they probably will find out, either by fancy technology or by more traditional techniques, possibly involving a rubber hose.

No, that’s not the problem. The problem is laughably simpler than that.

Alice and Elizabeth

Let’s imagine two people: Alice and Elizabeth, her partner. Alice is physically violent towards Elizabeth who lives in serious fear of her, is regularly being beaten by her and is terrified that worse things will happen soon. Elizabeth desperately wants and needs to escape from the relationship before something really bad happens, but she doesn’t know how: she needs to talk to someone privately. Alice, needless to say, doesn’t want this to happen.

Elizabeth realises that she can install Signal on her phone and then use it to communicate, privately, with people who might be able to help her — the police, perhaps. She does so.

Unbeknownst to her Alice already has Signal, perhaps on a phone the number of which Elizabeth does not know. Signal’s contact discovery promptly tells Alice that Elizabeth has installed Signal, and since she’s running it on a phone which doesn’t appear in Elizabeth’s contacts, Elizabeth doesn’t know this. And this story ends with Alice beating Elizabeth to death.

Vladimir and the dissidents

Or let’s imagine Vladimir. Vladimir runs a country which was once, briefly, a democracy but now, once more and inevitably, is a kleptocracy and a police state. Many, many people in Vladimir’s country don’t like him: his problem is knowing which ones to have dealt with. Well this is easy. Vladimir extracts from the telephone companies the phone numbers of the people he’s interested in — either with bribes or with pliers, it does not matter which. He then buys a burner phone, puts all these numbers in its contact list, and installs Signal. Now he knows which of his enemies have Signal, but since his burner phone is most certainly not in their contact lists they have no idea that he knows they have it and thus cannot run. Doors are knocked on at 3 in the morning, people vanish, their assets are acquired by Vladimir who uses them to build another vast, tasteless palace.

Unsafe at any speed

What Signal have done is to produce a beautifully secure implementation of a contact discovery algorithm which is designed to be unsafe, because it allows anyone who knows your phone number to know whether you have Signal, and if you don’t know their phone number — if they are, for instance, stalking you — it will not, and can not, tell you that they know this. The contact discovery algorithm is designed to leak information.

And they know this, and they don’t care. I’ll repeat that: they know that their product enables stalking, and they do not care about that.I don’t know why they made these choices, but I don’t expect the reasons are very good ones.

Some ideas which are mostly useless

It’s tempting to say that, well, the contact discovery algorithm should be mutual: it should only tell me that you have Signal if both you are in my contacts list and I am in yours. That can’t work, because the only way to do this would be to allow my contact list (in encrypted form) to persist, indefinitely, on Signal’s infrastructure, which would leave it open to attack.

Another approach would be to have a bit you could set on your identity which says ‘this identity should not partake in contact discovery’: if it was set then Signal would not allow either it to be discoverable or it to discover others, with the second restriction existing to prevent people deliberately setting it so they could stalk other people while not themselves being discoverable. This is closer to working: it protects against users of the service, but it does not protect against people who can acquire its data: they can simply strip the privacy bits from the identities they’ve captured and run contact discovery on their own copy of the infrastructure.

Strangely, something which should make Signal’s stalking problem less serious is Facebook’s catastrophic misjudgement over WhatsApp’s privacy policy: large numbers of users have migrated from WhatsApp to Signal or, at least, have installed Signal and thus now have identifiers in the system. Stalking someone by discovering they have Signal installed now tells you a lot less about them than it did previously. Of course Elizabeth has Signal, and Vladimir may discover that both his real and potential enemies also have it². This makes things, at least, less bad, although it does not make them good.

One idea which is not useless

The underlying problem is that Signal uses phone numbers as identifiers, where phone numbers are essentially public information. This enables stalking and worse.

Well, instead, the system could use completely randomly created identifiers which were not tied in any way to phone numbers. This would make the users of the system completely anonymous: the only way you could discover someone’s identifier is if they gave it to you. For added value it might be made, optionally and not by default, possible to attach things like phone numbers and email addresses to the random identifiers, whereupon they would be discoverable, by an algorithm essentially identical to Signal’s. Using such a system you could choose either to be completely undiscoverable or, and only if you wanted to be, to be more-or-less discoverable.

That would be easy, wouldn’t it? The Signal people, who are clearly ever so smart, must have thought of that, and decided not to do it: I wonder why?

Well, of course, other people — people who actually care about the safety of these sorts of systems — have not only thought about doing it this way, they have done it this way. Threema is one such app³.

The theatre of the absurd

Signal’s authors make a lot of noise about how secure it is. But they know it is, by design, not safe. If you care about safety you should use tools which really are safe rather than tools whose authors treat safety as a matter of theatre.

Whether you can go the other way is not clear: ideally the answer would be ‘no’ but the space of phone numbers is so small that it’s not completely implausible to simply search by brute-force to find out which identities correspond to which numbers if you have the computational resources to do so. However this does not matter here. ↩
Vladimir is not the sort of person who has friends. ↩
This article is not an advertisement for Threema: it just happens to be a system I know of which does this. I do not personally use it although it does appear to be very competently designed and implemnted by people who really do care about safety rather than are merely pretending to do so. I am sure there are other similar systems. ↩

Generic interfaces in Racket

Tim Bradshaw — Fri, 08 Jan 2021 18:25:59 UT

Or: things you do to distract yourself from watching an attempted fascist coup.

A thing that exists in many languages with a notion of a sequence of objects is a function variously known as fold or reduce: this takes another function of two arguments, some initial value, and walks along the sequence successively reducing it using the function. So, for instance:

(fold + 0 '(1 2 3)) turns into (fold + (+ 0 1) '(2 3)) which turns into …
(fold + 1 '(2 3)) turns into (fold + (+ 1 2) '(3)) which turns into …
(fold + 3 '(3)) turns into (fold + (+ 3 3) '()) which turns into …
6.

It’s pretty easy to write a version of fold for lists:

(define (fold op initial l)
  (if (null? l)
      initial
      (fold op (op initial (first l)) (rest l))))

Racket calls this (or a more careful version of this) foldl: there is also foldr which works from the other end of the list and is more expensive as a result.

Well, one thing you might want to do is have a version of fold which works on trees rather than just lists. One definition of a tree is:

it’s a collection of nodes;
nodes have values;
nodes have zero or more unique children, which are nodes.
no node is the descendant of more than one node;
there is exactly one root node which is the descendant of no other nodes.

A variant of this (which will matter below) is that the children of a node are either nodes or any other object, and there is some way of knowing if something is a node or not¹.

You can obviously represent trees as conses, with the value of a cons being its car, and the children being its cdr. Whatever builds the tree needs to make sure that (3), (4) and (5) are true, or you get a more general graph structure.

But you might want to have other sorts of trees, and you’d want the fold function not to care about what sort of tree it was processing: just that it was processing a tree. Indeed, it would be nice if it was possible to provide special implementations for, for instance, binary trees where rather than iterating over some sequence of children you’d know there were exactly two.

So, I wondered if there was a nice way of expressing this in Racket and it turns out there mostly is. Racket has a notion of generic interfaces which are really intended as a way for different structure types to provide common interfaces, I think. But it turns out they can be (ab?)used to do this, as well.

Generic interfaces are not provided by racket but by racket/generic: everything below assumed (require racket/generic).

A generic `treelike` interface

A treelike object supports two operations:

node-value returns the value of a node;
node-children returns a list of the node’s children.

The second of these is a bit nasty: it would be better perhaps to either provide an interface for mapping over a node’s children, or to return some general, possibly lazy, sequence of children. But this is just playing, so I don’t mind.

Here is a definition of a generic treelike interface, which includes default methods for lists:

(define-generics treelike
  ;; treelike objects have values and children
  (node-value treelike)
  (node-children treelike)
  #:fast-defaults
  (((λ (t)
      (and (cons? t) (list? t)))
    ;; non-null proper lists are trees: their value is their car;
    ;; their children are their cdr.
    (define node-value car)
    (define node-children cdr))))

Notes:

This uses #:fast-defaults instead of #:defaults, which means that the dispatch to objects which satisfy list? happens. This is fine in this case: lists are never going to be confused with any other tree type.
This relies on Racket’s (and Scheme’s?) list? predicate returning true only for proper lists rather than CL’s cheap listp which just returns true for anything which is either nil or a cons.
There are lots of other options to define-generics which I’m not using and many of which I don’t understand.

With this definition:

> (treelike? '())
#f
> (treelike? '(1 2 3))
#t
> (treelike? '(1 2 . 3))
#f
> (node-children '(1 2 3))
'(2 3)

So, OK.

A `treelike` binary tree

We could then define a binary-tree type which implements this generic interface:

(struct binary-tree (value left right)
  #:transparent
  #:methods gen:treelike
  ((define (node-value bt)
     (binary-tree-value bt))
   (define (node-children bt)
     (list (binary-tree-left bt)
           (binary-tree-right bt)))))

The #:methods gen:treelike tells the structure we’re defining the methods needed for this thing to be a treelike object.

And now we can check things:

> (treelike? (binary-tree 1 2 3))
#t
> (node-value (binary-tree 1 2 3))
1
> (node-children (binary-tree 1 2 3))
'(2 3)

OK.

Two attempts at a generic `foldable` interface

So now I want to define another interface for things which can be folded. And the first thing I tried is this:

(define-generics foldable
  ;; broken
  (fold operation initial foldable)
  #:defaults
  ((treelike?
    (define (fold op initial treelike)
      (let ([current (op initial (node-value treelike))]
            [children (node-children treelike)])
        (if (null? children)
            current
            (fold op (fold op current (first children))
                  (rest children))))))
   ((const true)
    (define (fold op initial any)
      (op initial any)))))

So this tries to define a fold generic function, which has two implementations: one for treelike objects and one for all other objects. So this means that all objects are foldable, and, for instance (fold + 0 1) simply turns into (+ 0 1). This is a bit odd but it simplifies the implementation of the interface for treelike objects on the assumption that the children of nodes may not themselves be nodes (see above).

There is another complexity: if the list of a treelike node’s children isn’t null, then it’s a treelike, so it can safely be recursed over rather than explicitly iterated over. This is a slightly questionable pun I think, but, well, I am a slightly questionable programmer.

And this … doesn’t work:

> (fold + 0 '(1 2 3))
; node-value: contract violation:
; expected: treelike?
; given: 2
; argument position: 1st

It took me a long time to understand this, and the answer is that the definitions of fold inside the define-generic form aren’t adding methods to a generic function: what they are doing is defining a little local function, fold which then gets glued into the generic function. So references to fold in the definition refer to the little local function. It is exactly as if you had done this, in fact:

(define-generics foldable
  ;; this is why it's broken
  (fold operation initial foldable)
  #:defaults
  ((treelike?
    (define fold
      (letrec ([fold (λ (op initial treelike)
                       (let ([current (op initial (node-value treelike))]
                             [children (node-children treelike)])
                         (if (null? children)
                             current
                             (fold op (fold op current (first children))
                                   (rest children)))))])
        fold)))
   ((const true)
    (define (fold op initial any)
      (op initial any)))))

And you can see why this can’t work: the fold bound by the letrec calls itself rather than going through the generic dispatch.

The way to fix this is to use the magic define/generic form to get a copy of the generic function, and then call that. This is syntactically horrid, but you can see why it is needed given the above. So a working version of this interface purports to be:

(define-generics foldable
  ;; not broken
  (fold operation initial foldable)
  #:defaults
  ((treelike?
    (define/generic fold/g fold)
    (define (fold op initial treelike)
      (let ([current (op initial (node-value treelike))]
            [children (node-children treelike)])
        (if (null? children)
            current
            (fold op (fold/g op current (first children))
                  (rest children))))))
   ((const true)
    (define (fold op initial any)
      (op initial any)))))

And indeed it is not broken:

> (fold + 0 '(1 2 3))
6

and with some tracing added:

> (fold + 0 '(1 2 3))
fold/treelike + 0 (1 2 3)
fold/any + 1 2
fold/treelike + 3 (3)
6

Adding a special case to `fold` for the binary tree

So now, finally, we can add a special case to fold to the binary tree defined above, rather than needlessly consing a list of children. We will need the same explicit-generic-function hack as before as the children of a binary tree may not be binary trees.

(struct binary-tree (value left right)
  #:transparent
  #:methods gen:treelike
  ((define (node-value bt)
     (binary-tree-value bt))
   (define (node-children bt)
     (list (binary-tree-left bt)
           (binary-tree-right bt))))
  #:methods gen:foldable
  ((define/generic fold/g fold)
   (define (fold op initial bt)
     (fold/g op
             (fold/g op (op initial (binary-tree-value bt))
                     (binary-tree-left bt))
             (binary-tree-right bt)))))

And now

> (fold + 0 (binary-tree 1
                         (binary-tree 2 3 4)
                         (binary-tree 5 6 7)))
28

and with some tracing

> (fold + 0 (binary-tree 1
                         (binary-tree 2 3 4)
                         (binary-tree 5 6 7)))
fold/bt + 0 #(struct:binary-tree 1 #(struct:binary-tree 2 3 4) #(struct:binary-tree 5 6 7))
fold/bt + 1 #(struct:binary-tree 2 3 4)
fold/any + 3 3
fold/any + 6 4
fold/bt + 10 #(struct:binary-tree 5 6 7)
fold/any + 15 6
fold/any + 21 7
28

Missing CLOS

In some ways this makes me miss CLOS: the explicit-generic-function hack is very annoying, single dispatch is annoying, not being able to define predicate-based methods separately from the define-generics form is annoying. But on the other hand predicate-based dispatch is pretty cool.

Perhaps these should be called ‘sloppy trees’ or something. ↩

Backup retention

Tim Bradshaw — Sat, 02 Jan 2021 17:33:01 UT

Or: should you keep that tape?

There is an interesting curve of backup retention.

Initially, you should definitely keep them because they’re, well, backups that you might need to restore.

Then there comes a time where you should almost certainly not keep them because they’re too old to be useful as backups.

If they survive that they become, accidentally, archives: perhaps that tape sitting in some box has the only remaining copy of whatever-it-is. So don’t erase it.

At the point where nothing will read the tape any more, well, whatever was on it is effectively gone now, so throw it away.

At some point after that, one or both of two things happen: people become willing and able to do seriously heroic things to read really old media which might have the last remaining copy of something on it and/or the media itself becomes rare enough that it’s now a historical artifact worth preserving. The second thing can’t happen unless enough copies of the media get thrown away in earlier phases of the process: I don’t think minidiscs would be interesting historical artifacts (yet), but if I still had a Fuji Eagle I would definitely not throw it away.

Later still it becomes possible to print, cheaply, replicas accurate at the atomic level of the thing, at which point its value should drop to the cost of making another clone, but in fact people start spending huge amounts of effort authenticating the original copy of the object which is held to be somehow ineffably different to all the perfect clones. At some point, people lose track: no-one now knows which the original is any more, and since there is no physical distinction no-one ever will again. The people who have paid to have their copy authenticated as the original now spend much of their time arranging to have the other people who have done that assassinated.

I forget which film this is.

MIME as a disease vector

Tim Bradshaw — Thu, 27 Aug 2020 10:34:35 UT

MIME, the Multipurpose Internet Mail Extensions, seems like a good idea: what’s not to like about being able to send arbitrary data by email? In 1996, when I wrote the below, I didn’t think it was.

Let’s say there are two computer system vendors:

Vendor 1 provides a proprietary OS of high quality and high price, with good quality support. Is committed to ‘open systems; and publishes specifications of its interchange formats for mail, files and so forth. Its interchange formats may be ’high value’ — text with logical rather than visual markup.
Vendor 2 provides a proprietary OS of lower quality but much lower price, with essentially no support. Is completely uninterested in open systems: does not publish its interchange formats, changes them frequently and incompatibly. Its interchange formats may also be ‘low value’ — for instance text with visual rather than logical markup.

Obviously one should buy systems from vendor 1: since purchase and vendor-support cost is rather small compared to the costs caused by low-quality systems this is clearly the right thing to do.

Wrong. Circumstances can easily arise where buying from vendor 2 is the only viable option. This will increase greatly the cost of computing over what it ‘should be’, and will probably ensure that computing systems are of marginal benefit, if any. Even so it is necessary to buy these inferior systems.

How does this happen? The key is data interchange. If the systems of vendor 2 become popular — they are cheap, after all, and they will run on cheap hardware, so they are quite seductive to people who are not costing their systems thoroughly, as well as for home use — and if people who have these systems once start interchanging data — say mail messages using MIME — with the owners of vendor 1 systems, then vendor 1 is doomed.

Vendor 2 system owners will soon start getting mail in formats supported by vendor 1. But these are open standards: vendor 2 can implement displayer and editors for these formats. In fact it’s likely that free versions of these things sill become available. Owners of vendor 2 equipment are happy.

Vendor 1 system owners will start getting mail in formats supported by vendor 2. These are closed, rapidly changing, formats. Vendor 1 has a problem: it has to reverse-engineer the format as it is closed, and as soon as it has done that, vendor 2 changes the format. Even if it can reverse-engineer the formats, the upward conversion from visual markup to logical markup is a hard problem which does not have a general solution.

If vendor 2 systems are common, then it becomes commercially important to owners of vendor 1 equipment to be able to deal with vendor 2’s formats. But vendor 1 cannot keep up with the vendor 2 formats.

The solution is to give up and buy from vendor 2 rather than vendor 1, and use vendor 2’s interchange standards. This will allow you to survive, since you can interchange data with other vendor 2 owners, but will mean that your computing systems are marginally useful, if at all:

data is kept in low-value formats so you cannot reuse it;
formats change so old data cannot be used even in vendor 2 systems;
support costs go up as the lower-quality systems provided by vendor 2 break more often, and the poor or nonexistant support from vendor 2 forces local support at great cost.

Of course, vendor 2 needs to be able to force its data formats on people who have vendor 1 systems. This is now easy: computer networks and email are so prevalent that almost anyone has to be able to do interchange with almost anyone else. In particular MIME opens the door: if I’m on a vendor 1 machine, and vendor 1 has implemented MIME in its MUA (after all, vendor 1 is committed to open standards), then I will shortly find vendor 2 documents arriving in my mailbox, and shortly after that I will find myself buying a vendor 2 system.

It’s all a catastrophe.

I wrote this in early August 1996: the text above has been converted to markdown from its original HTML but is otherwise essentially unchanged from then. ‘Vendor 1’ was Sun, and ‘vendor 2’ was, of course, Microsoft, wth the low-value interchange format was Word.

I don’t think I was completely right, but I was at least partly so: a lot of really terrible, very low-value data formats have become very prevalent, at least in part because MIME allows them to be easily transmitted.

One thing I didn’t see coming (or saw coming but had not yet accepted) is that the disease spread by MIME would spread to even systems provided at very low or zero up-front cost, such as Linux: if you use OpenOffice or a derivative, you have been infected by the disease.

Another thing that was not obvious was that some of the low-value formats would become effectively standardised, and so would be less toxic. Rich Text Format is perhaps one good example, but even Word’s own native format may now be effectively a standard. This means that writing in these formats, while still very seriously limiting the value of your data, does not lock you in to a vendor as much as it once did.

It is still, however, a catastrophe.

Do not use Duplicacy on macOS

Tim Bradshaw — Sat, 22 Aug 2020 10:17:02 UT

Duplicacy is a backup tool. It may possibly have good uses, but if you are using it on a Mac it is probably not actually making backups.

The architecture of the application

The Duplicacy application¹ on the Mac presents itself as a little web server which you can then talk to (only via localhost, which is good) to configure, run and monitor backups.

What it does behind the scenes is more complicated. Other than some keychain entries (perhaps only one keychain entry) for a master password which is used to encrypt all the other sensitive data, all of its state lives in ~/.duplicacy-web. This includes all the configuration, logs and so on and, critically, an executable which is the actual program which runs backups, which lives in ~/.duplicacy-web/bin and has a name like duplicacy_osx_x64_2.6.1. The application simply invokes this program to run backups for it. The application will also update this executable when it notices a new one.

This itself is mildly terrifying: where did this executable come from? How safe is it? Can you be sure that the place it comes from will never be compromised? This executable is about to read all your files and copy them somewhere: you probably want to be a bit more sure about it than this.

(This is very different than the case of updating the application itself: this is, or should be, something done under human control. At least in principle you can, and should, check that the thing you have just downloaded actually is what it says it is, and if you don’t, well, that’s a risk you are conciously taking.)

It gets worse: the default configuration of the application will fetch the latest executable, not a stable one (however that is defined), thus maximising the chance that you will be running something that doesn’t work to do your backups, and also maximising the chance that you’ll get a compromised executable. If you are not frightened by now, you will be in a minute.

The annoyances of macOS

From, I think, 10.14, macOS has developed a complicated and annoying protection system which is completely orthogonal to file permissions. I do not understand this system at all, but it essentially involves various policies about what programs can read and write to what. The intention seems to be that, for instance, some application you install should not be able to read or write personally-sensitive data without your explicit permission, even if the filesystem or other permissions would allow it to do so.

‘Personally-sensitive data’ includes things like your email, your contacts, location information and so on. You can see these permissions in the ‘Privacy’ pane of the ‘Security & Privacy’ entry in ‘System Preferences’ and presumably there is some configuration file somewhere which backs all this, and the tccutil command can be useful as well. The protection system also controls various APIs, such as the one that provides location information.

Although this system is irritating in the usual Apple way, I think it’s well-motivated: my email contains personally-sensitive data about me if no-one else, and I definitely don’t want some random program I run snooping on it, or finding out where I am, without explicitly asking me first.

A place where this protection system really gets in the way is for backup tools. Backup tools really need to be able to, well, make backups, and the most important things they need to back up are often the most sensitive. I really want my backup program to be able to back up my email, for instance, as well as my calendar configuration and so on, and all the other stuff that the macOS protection mechanism would not normally let it read.

So, Apple have thought of this. If you trust some application you can grant it ‘full disk access’ which lets it read (and write, probably) the whole filesystem, only limited by filesystem permissions. This is exactly what you need for a backup program.

The first disaster

So, obviously, when you get Duplicacy, you anoint it suitably in the Privacy pane so that it can have full disk access. (It does not tell you to do this, which is a bad sign in itself.)

This doesn’t work. I think it doesn’t work because the program that is doing the backups is not the Duplicacy application, but this little executable which it downloaded. And, in fact, that’s a good thing: I would really rather not allow an application to secretly download some executable which can read (and write) all my files and send them who-knows-where. It may be that the reason it does not work is that the executable is not signed, although it does appear to be signed, so I am not sure.

In any case, what happens is that the executable fails to read sensitive data and thus fails to back it up. And it dutifully logs this, in ~/.duplicacy-web/logs/backup-*.log:

2020-08-21 15:27:40.769 WARN LIST_FAILURE Failed to list subdirectory: open /Users/tfb/Library/Application Support/com.apple.TCC: operation not permitted
2020-08-21 15:27:40.955 WARN LIST_FAILURE Failed to list subdirectory: open /Users/tfb/Library/Calendars: operation not permitted
[...]
2020-08-21 15:27:43.830 WARN LIST_FAILURE Failed to list subdirectory: open /Users/tfb/Library/Containers/com.apple.mail: operation not permitted
[...]
2020-08-21 16:26:53.142 WARN BACKUP_SKIPPED 23 directories and 20 files were not included due to access errors

In other words: the backup worked, partially, but it didn’t succeed in reading some of the the most critical data. If you need to restore from this backup, all your email will be gone.

Well, perhaps you could suitably anoint the downloaded executable? You could do that, if you could work out how to get the Finder to let you see directories whose names have leading .s, which is possible but fiddly. And it would work, for a while, until a new version with a new name appears, and then it will all break again and you’ll have to do it all again.

So that’s a disaster. But it’s not the most serious one.

The second disaster

So, you are configuring this thing via the web interface, like a good person. And you’ve thought to anoint the application so it can read everything, even though at no point did it tell you to do this (unlike other, competently-written, backup tools). And you run backups, and the executable dutifully logs that they failed. And there is no indication of this, at all in the web interface, which simply tells you that the backup completed, by which it apparently means ‘the program ran, and after a while it stopped running, and that means everything must be OK’.

In other words: if you are using a recent macOS, then Duplicacy is almost certainly not making good backups for you, and it is certainly not telling you about it when it does not.

Don’t use Duplicacy

I don’t understand how this happened other than that, very clearly, a lot of testing simply was never done. I do understand that it tells you something very, very bad about Duplicacy. I certainly would not, ever, use it on a Mac, and I find it so alarming that I would not in fact use it on any system at all.

Backup tools need to work, because when you need them you really need them. Duplicacy is backup theatre: something that looks like a backup tool but in fact is not.

This refers to ‘Duplicacy Web Edition’ — there was an older GUI application which I don’t know anything about. ↩

The glorious work of Dominic Cummings

Tim Bradshaw — Tue, 02 Jun 2020 16:59:52 UT

Or: the Cummings-Johnson effect.

I thought it would be interesting to get an idea of how many people will die because Dominic Cummings thought it was fine to ignore the lockdown rules, and Boris Johnson agreed with him. So I wrote a program to explore this Cummings-Johnson effect.

All the reasons you had to die

Jesus don’t want me for a sunbeam,
because sunbeams are not made like me,
and don’t expect me to cry,
for all the reasons you had to die,
don’t ever ask your love of me.

There are two ways that what Cummings did in March 2020 will probably be killing people:

he drove a long distance, presumably taking breaks¹, while knowing he was infected with CV19;
now his actions are known, and now Johnson has supported them, other people’s behaviour will change.

The first of these is likely to have killed people, and still be killing people, by spreading the virus: for instance to the toilets in service stations². The second of these is likely to kill people, and perhaps has done so already, because now it is general knowledge that Cummings & Johnson think that lockdown rules are for other people — for the little people, not people like them — then they will take lockdown and social distancing less seriously, and people will die as a result of that.

It’s this second way that they are killing people that I looked at.

Ths simulator described below is a toy: it’s very much a physicist’s ‘spherical cow’ model. It has no notion of locality for instance: infected individuals simply randomly pick other individuals to try to infect. The results it gives may be qualitatively reasonable, but if they are quantitively correct this is coincidence. The purpose of writing it, and of the runs described here, was simply to see if the Cummings-Johnson effect is visible, and to get some kind of qualitative estimate of how large it might be: if their actions will probably only kill only a few tens of people then they are doing no more harm than a common-or-garden mass murderers, while if their actions may kill thousands of people, then they’re working on a completely different scale.

Epidemic models which are far better than this exist. For instance the MRC Centre for Global Infectious Disease Analysis — Professor Neil Ferguson’s group — must have one. I would be very surprised if these people haven’t run much better versions of the scenarious I describe below. But the results of these runs don’t seem to have been published. This is sad but, perhaps, not surprising given what we know about Cummings & Johnson and their attitudes to facts which disagree with their fantasy worlds.

Still: if there are results from better models I would very much like to know them.

A mindless epidemic simulator

I wrote a very simple-minded simulator: it is unlikely to be realistic, it’s really a toy model. The results are unlikely to be quantitively correct, but they may be qualitatively interesting. In the model individuals go through the standard three phases:

initially they are uninfected & hence susceptible;
once they are infected they incubate the disease for $t_l$ days, where $t_l = 7$ in all the runs below
they are then infectious for $t_i = 14$ days.
on each of these days, they randomly pick another individual, and if that individual is susceptible they infect them with a probability which is initially $p_i = 0.14$.
at the end of the period they either die, with probability $p_d = 0.01$, or they survive but become non-susceptible.

Additionally there may be a small ‘leakage’: every day, every susceptible person in the population can stand a small chance of becoming infected. This models the infection leaking in from abroad, for instance. In all the runs here the leakage $p_l = 10^{-8}$.

Finally the initial number of seeds can be set, the idea being to start the simulation after a good few people have become infected to avoid too much uncertainty in the trajectory of the epidemic. By default $n_s = n_p/1000$, where $n_s$ is the number of seeds and $n_p$ is the population size.

All of the parameters are adjustable as is how long to run for and what the stopping criteria are (with a leaky model things can keep on happening even after the number of infectious individuals reaches zero).

It is straightforward to computee $R_0$ for this model: a person is infectious for $t_i$ days and each day they stand a $p_i$ chance of infecting another person if no-one is yet infected, so

\[ \begin{align} R_0 &= p_i t_i\\ &= 0.14 \times 14\\ &= 1.96 \end{align} \]

And then $R$ declines over time as more people are removed from the population. When $R < 1$ the epidemic dies out, more-or-less gradually, except for leaks causing occasional infections.

Source code for this model is not currently available, although it may be in future.

How the simulations run

All of $t_l$, $t_i$, $p_i$, $p_d$ and $p_l$ can be adjusted during a run: the simulator is told to run for a few days, the values can then be adjusted and then it runs again for some given time. In practice the only parameter that I adjusted was $p_i$: the probability of infection. Changing this during the run directly changes $R_0$ and hence $R$ and alters the course of the epidemic.

There is nothing in the model which prevents any of these parameters being adjusted dynamically, based on the current behaviour of the modelled epidemic. In fact I didn’t do that but instead set up ‘configuration sequences’ which are sequences of configurations where the parameters (in practice, just $p_i$, as well as some reporting parameters) are changed at fixed times, between which the model simply runs.

Because there is inevitable variation between runs, the simulations get run several times, and the model also forks: if I wanted to look at the effect of changing parameters on, say, $d = 120$, a single simulation is run to $d = 119$ and then multiple copies are run on from then. This means that any variation before $d = 120$ is removed from the forks, since they all come from the same simulation run. This process can happen recursively if need be.

Some example runs

Here are some simple cases which show the behaviour of the model.

Abandoning mitigation

Here is output for a model epidemic in which the mitigation is abandoned after 2 years:

Mitigated giving up after 2 years, cumulative deaths, population of 1 million

This is the output of a 4 year run for a population of a model with

$n_p = 10^6$;
$p_i = 0.14$ initially;
$p_l = 10^{-8}$

For the unmitigated forks, $p_i$ remains at its initial value.

For the completely mitigated forks

\[ p_i = \begin{cases} 0.14&d \lt 40\\ 0.06&120 \le d \lt 120\\ 0.08&120 \le d \lt 200\\ 0.06&d \ge 200 \end{cases} \]

For the ‘giving up’ forks

\[ p_i = \begin{cases} 0.14&d < 40\\ 0.06&120 \le d \lt 20\\ 0.08&120 \le d \lt 200\\ 0.06&200 \le d \lt 730\\ 0.14&d \ge 730 \end{cases} \]

In other words what this is showing is a scenario where there is no vaccine, but mitigation is abandoned, after about 2 years. Because some leakage happens, at some point after the mitigation is abandoned the epidemic takes off again and a lot of people die. Exactly when it takes off depends on chance, but in all 5 runs here it’s within about a year and a half.

Scaling the average results from this run to a population of 70 million³ results in the following figures, all to 3 significant figures:

551,000 deaths for the unmitigated epidemic;
40,300 deaths for the completely mitigated epidemic;
535,000 deaths for the epidemic in which mitigation is abandoned on day 730.

For the mitigated epidemic this is somewhat lower than what the UK has so far seen, but it is in the right area: the model is clearly not hopeless. In later runs I adjusted the mitigation slightly to compensate for this (see below).

What these results make clear is that, unless there is a vaccine³, mitigation has to continue essentially indefinitely, or the epidemic will take off again.

Chancy runaways

Here are two runs which have an initial infected population, $n_s = 0$: there are initially no infected people and the epidemic takes off due to leakage, with $p_l = 10^{-8}$ as before.

Firstly for a population of a million:

Unmitigated, no seeds, cumulative deaths, population of 1 million, 10 runs

Well, you can see that the epidemic takes off again after less than two years in all cases.

How likely this runaway is to happen in a given interval of time depends on the population size, as smaller populations experience fewer leakage events. Here is a run for a population of 10,000:

Unmitigated, no seeds, cumulative deaths, population of 10k, 10 runs

You can see that only one runaway happened in the three year simulation.

The Cummings-Johnson effect

To model this I started with an epidemic whose $p_i$ values are initially:

\[ p_i = \begin{cases} 0.14&d < 40\\ 0.06&40 \le d <120\\ 0.08&120 \le d < 200\\ 0.06&200 \le d < 300\\ 0.08&300 \le d < 600\\ 0.07&d \ge 600 \end{cases} \]

All of the models run for 3 years, or 1095 days, and in addition the unmitigated epidemic is always plotted⁴. Each model ran 5 times and quoted figures are averages, scaled to a population of 70 million, to 3 significant figures

Cummings-Johnson on day 120

For this model

\[ p_i = \begin{cases} 0.14&d < 40\\ 0.06&40 \le d < 120\\ 0.08\times \left\{1.02, 1.05, 1.10\right\} &120 \le d < 200\\ 0.06\times \left\{1.01, 1.03, 1.06\right\} &200\le d < 300\\ 0.08\times \left\{1.005, 1.02, 1.04\right\} &300 \le d < 600\\ 0.07\times \left\{1.002, 1.01, 1.02\right\} &d \ge 600 \end{cases} \]

Where the triples of numbers represent the Cummings-Johnson effect causing weakening of social distancing of 2%, 5% and 10% respectively on day 120, with the weakening declining over time. Here are plots for this:

Cummings-Johnson on day 120, 2%, 5% and 10%, population of 1 million

Here:

the brown curves are the normal courses of the epidemic with and without mitigation;
the blue curves are 2%;
the orange curves are 5%;
the red curves are 10%;

The figures are:

551,000 deaths for the unmitigated epidemic;
63,100 deaths for the mitigated epidemic;
70,300 death for the 2% weakening;
86,500 deaths for the 5% weakening;
109,000 deaths for the 10% weakening.

Or in other words:

7,200 additional deaths for 2% weakening;
32,400 additional deaths for 5% weakening;
45,900 additional deaths for 10% weakening.

These numbers seemed far too high to me. And I also suspect that the epidemic in my model happens more slowly (takes more simulated days) than the real one. So I ran three more models, with the Cummings-Johnson effect taking place at successively later times.

Cummings-Johnson on day 200

For this model

\[ p_i = \begin{cases} 0.14&d < 40\\ 0.06&40 \le d < 120\\ 0.08&120\le d < 200\\ 0.06\times \left\{1.02, 1.05, 1.10\right\} &200\le d < 300\\ 0.08\times \left\{1.01, 1.03, 1.06\right\} &300 \le d < 600\\ 0.07\times \left\{1.005, 1.02, 1.04\right\} &d \ge 600 \end{cases} \]

As you can see this allows the mitigated epidemic to run until day 200, when the same decaying effect happens. Here are plots for this:

Cummings-Johnson on day 200, 2%, 5% and 10%, population of 1 million

Figures:

546,000 deaths unmitigated;
69,900 deaths mitigated;
75,100 deaths 2%;
93,700 deaths 5%;
128,700 deaths 10%.

Excess deaths:

5,200 2%;
18,600 5%;
53,600 10%.

This is a little better, but not much, and the 10% case is bizarrely bad.

Cummings-Johnson on day 300

For this model

\[ p_i = \begin{cases} 0.14&d < 40\\ 0.06&40 \le d < 120\\ 0.08&120\le d < 200\\ 0.06&200\le d < 300\\ 0.08\times \left\{1.02, 1.05, 1.10\right\} &300 \le d < 600\\ 0.07\times \left\{1.01, 1.025, 1.05\right\} &d \ge 600 \end{cases} \]

Here are plots for this:

Cummings-Johnson on day 300, 2%, 5% and 10%, population of 1 million

Figures:

551,000 deaths unmitigated;
59,800 deaths mitigated;
73,200 deaths 2%;
90,000 deaths 5%;
138,000 deaths 10%.

Excess deaths:

13,400 2%;
30,200 5%;
78,200 10%.

All these figures are worse than the day 200 case, which think is because the big increase is happening when things are already too relaxed.

Cummings-Johnson on day 600

For this model

\[ p_i = \begin{cases} 0.14&d < 40\\ 0.06&40 \le d < 120\\ 0.08&120\le d < 200\\ 0.06&200\le d < 300\\ 0.08&300 \le d < 600\\ 0.07\times \left\{1.02, 1.05, 1.10\right\} &d \ge 600 \end{cases} \]

Here are plots for this:

Cummings-Johnson on day 600, 2%, 5% and 10%, population of 1 million

Figures:

546,000 deaths unmitigated;
61,700 deaths mitigated;
63,600 deaths 2%;
68,500 deaths 5%;
80,200 deaths 10%.

Excess deaths:

1,900 2%;
6,800 5%;
18,500 10%.

These seem a little less frightening

Why is it so fierce?

I was really surprised by how large the differences are. I think part of the answer can be seen by looking at $R$: at any point the progress of the epidemic goes something like $e^{\alpha (R -1)t}$, where $\alpha$ is some fudge factor. The only reason that the exponential runaway doesn’t continue is that $R$ is a function not only of $p_i$ but also of the proportion of people who are no longer susceptible. But if that proportion is low, which you very much want it to be, then everything is, more, or less, exponential, and really tiny changes in $R$ can cause huge explosions.

To control the epidemic over any length of time you need to keep $R = 1 - \epsilon$ where $\epsilon \ll 1, \epsilon > 0$: you want to do this because the epidemic will die out so long as $R < 1$, but the social and economic cost of keeping it significantlly below 1 for any length of time is enormous. And for an epidemic which has infected, and therefore killed, only a relatively small proportion of the population then $R \approx R_0$. So the useful thing to look at is $\ln R$ & $\ln R_0$, as this shows small changes near $R = 1, R_0 = 1$ which is where all the action is⁵.

Here’s a plot of $\ln R$ and $\ln R_0$ for the Cummings-Johnson on day 120 2% variant, and the mitigated version without the 2% bump:

ln R, ln R0, Cummings-Johnson on day 120, 2% and mitigated

Interestingly you can see that, for $d \gtrapprox 500$ the Cummings 2% $R$ is lower than the mitigated $R$. But it’s significantly higher for $d \in [120, 200)$ and somewhat higher for $d \in [200, 300)$ (although less than 1 in the second interval).

So, well, very small changes for parameters in exponential processes can make very large differences: that should be obvious.

It certainly would be the case that runs with more principled values for things (for instance my ‘decaying Cummings-Johnson effect’ is pretty ad-hoc: it would be better to model it by having some increase which exponentially decays with time: $p_i = p_{i0}e^{-(t - t_0)/\tau}$ as people forget, which would be easy to model. Maybe I will have a go at that in due course.

How many people will Cummings and Johnson kill?

I don’t know. This model is not adequate to give a numerically-correct answer by a long way: it’s full of assumptions, and is in any case an extremely oversimplified model⁶.

But I couldn’t get the number of people they will kill lower than 1,900, and I worked fairly hard to get it that low. I think my model is too sensitive, even though the numbers of people it kills for the mitigated epidemic are pretty reasonable and I did not fine-tune it for that, so I expect the real number will be somewhere between many hundreds and a few thousand. This is somewhere between mass murder and genocide⁷.

Did Cummings & Johnson do this deliberately? Probably not. Are these the only people they will kill, or even most the people they will kill, due to their ideological, careless and incompetent handling of the epidemic and other things? No. Would the harm have been reduced if Johnson had promptly sacked Cummings? Yes. Would the harm still be reduced if he were to sack him now? Yes. Will he sack him? Of course not. Do either of them care that they will kill a lot of people? Definitely not: the people they have killed and will kill are only little people, like ants.

This is the glorious work of Dominic Cummings, aided and abetted by his idiot stooge, Boris Johnson.

Don’t expect me to lie,
don’t expect me to cry,
don’t expect me to die for thee.

He says he did not take breaks. This seems a deeply implausible claim given that he drove 260 miles with a small child in the car. ↩
Which, again, he claims none of his family visited. ↩
Another option is that the epidemic becomes globally extinct, when leakage would stop: this seems unlikely. ↩
This is not really helpful as it makes the plots harder to read. ↩
In my model I’m treating $R_0$ as something you adjust via changes to $p_i$, rather than a constant of the epidemic. $R_0 = p_i t_i$, and I am adjusting $p_i$. It would perhaps be better to say $R_0 = p_{i,0}t_i$ and then define $p_i = p_{i,0} - p_{i,m}$, where $p_{i,m}$ is the parameter you adjust, and use that together with the proportion of people remaining susceptible to define $R$: it doesn’t make any difference to what actually happens though. ↩
I would be extremely interested in results about the Cummings-Johnson effect from more serious models. Please get in touch if you know of any. I am happy to sign nondisclosure agreements if need be. ↩
Since we know that BAME people are disproportionately affected by CV19 this really is looking like genocide. Perhaps not a deliberate one, but I wonder how much Cummings & Johnson care that a bunch of BAME people will die because of their actions? Not much, I should think. ↩

Sexism in computer science

Tim Bradshaw — Sat, 09 May 2020 17:16:02 UT

Anyone who says that the facts show that men are innately better than women in computing either does not know the facts, does not understand them, or is lying.

The facts

In 1971, about 14% of US computer science and information science graduates were women. By 1984, about 38% were. But by 2011 the proportion had fallen to under 18%¹. Here is a graph of the proportions by year from 1971 to 2011:

CS & IS graduate ratio, US, 1971–2011

What the facts show

This entire process happened in about two generations: the proportion of women more than doubled in less than one generation, and then about halved in a generation: some of the women studying CS in 2011 could be the daughters of the cohort of 1984, and the granddaughters of the 1970 cohort.

No genetic change in a human population can happen this fast: evolution operates on timescales of thousands to millions of years, not over a small number of decades. This means that whatever caused these changes was not a change in innate ability. There simply can be no question about that: there must be some other explanation, since the innate ability of women to do computer science, or any other innate ability, cannot have changed significantly over this period.

This means that the changes were caused by something environmental. Perhaps in 1984 there was enormous positive discrimination, or in 1970 and 2011 there was enormous negative discrimination, or some combination of the two².

This data is also perfectly compatible with the conclusion that women may be innately as good at computing as men: 38% is not very far from 50%, and if we assume some level of sexism in 1984³ it is easily possible that the underlying figure was 50%.

What this data tells us, unambiguously, that whatever has caused these changes is environmental, and is not due to any differences in innate ability as such changes simply cannot happen over this timescale. It also tells us that things have got a lot more skewed since 1984: progress in this area has not only stopped, it is being reversed and has been so since the mid 1980s: the situation now is only about 28% less skewed than it was in 1970.

What the facts don’t show

What the data does not say is why this has happened, except that it is not due to changes in innate ability.

While it is almost certain that there was strong institutional discrimination against women in 1970, it seems unlikely that, in 2011, there was any kind of institutional discrimination, as this would be illegal and institutions are pretty good targets for legal action. So it seems unlikely that the decline is due to institutional discrimination. However all the data says is that there has been a decline: not why.

If we assume that most of the change is not due to institutional discrimination then it’s tempting to speculate on what did cause it. Well, I’m not going to do that: I have theories but they are based either on no evidence or on anecdotal evidence. Perhaps someone has done proper research into the causes, but I don’t know. There is a vast surfeit of theories based on little or no data, and outright made-up stuff on the internet — wild speculation, outright lies and ‘alternative facts’⁴ — and people are dying of this surfeit: I won’t add any more to it.

One possible inference is that women who, today, succeed at computing degrees, have done so against significant odds. It’s very likely that this means that they are better than men who achieve the same grades. So companies, if they are legally able to, might consider actively selecting female candidates for jobs, on the grounds that they are, probably, better.

In any area where people make a claim that some group is innately better than some other group based on some metric, and where the scores of one or both of those groups has changed radically over time, then it is immediately safe to conclude that those claims are either lies, confusions or both, because either the metric is junk, or it is not measuring innate ability. The obvious example of this is racial ‘science’.

Source. Later figures may be available, but I couldn’t find them. I also don’t have the figures for other countries but I expect they are broadly similar. ↩
I worked in academic computing from shortly after 1984 to the late 1990s and although I am not female I can say with some certainty that there was not enormous positive discrimination. ↩
Again, in my experience there was some level of sexism in academia in this period. ↩
Which are, of course, lies. ↩

The U combinator

Tim Bradshaw — Mon, 09 Mar 2020 17:45:22 UT

The U combinator allows you to define recursive functions and I think it is simpler to understand than the Y combinator.

It’s not obvious how things like letrec get defined in Scheme, without using secret assignment. In fact I think they are defined using secret assignment:

(letrec ([f (λ (...) ... (f ...) ...)])
  ...)

turns into

(let ([f ...])
  (set! f (λ (...) ... (f ...) ...))
  ...)

But it’s interesting to see how you can define recursive functions without relying on assignment, including mutually-recursive collections of functions. One way is using the U combinator.

I suspect that there is lots of information about this out there, but it’s seriously hard to search for anything which looks like ’*-combinator’ now (even now I am starting a set of companies called ‘integration by parts’, ‘the quotient rule’ &c).

You can famously do this with the Y combinator, but I didn’t want to do that because Y is something I find I can understand for a few hours at a time and then I have to work it all out again. But it turns out that you can use something much simpler: the U combinator. It seems to be even harder to search for this than Y, but here is a quote about it:

In the theory of programming languages, the U combinator, $U$, is the mathematical function that applies its argument to its argument; that is $U(f) = f(f)$, or equivalently, $U = \lambda f \cdot f(f)$.

Self-application permits the simulation of recursion in the λ-calculus, which means that the U combinator enables universal computation. (The U combinator is actually more primitive than the more well-known fixed-point Y combinator.)

The expression $U(U)$ is the smallest non-terminating program.

(Text mildly edited from here, which unfortunately is not a site all about the U combinator other than this quote.)

Prerequisites

All of the following code samples are in Racket. The macros are certainly Racket-specific and some of the other code probably is as well. To make the macros work you will need syntax-parse via:

(require (for-syntax syntax/parse))

However note that my use of syntax-parse is naïve in the extreme: I’m really just an unfrozen CL caveman pretending to understand Racket’s macro system.

Also note I have not ruthlessly turned everything into λ: Rather than ((λ (...) ...) ...) there is (let ([... ...] ...) ...) in this code; there is use of multiple values including let-values; there is (define (f ...) ...) rather than (define f (λ (...) ...)) and so on.

Two versions of U

The first version of U is the obvious one:

(define (U f)
  (f f))

But this will run into some problems with an applicative-order language, which Racket is by default. To avoid that we can make the assumption that (f f) is going to be a function, and wrap that form in another function to delay its evaluation until it’s needed: this is the standard trick that you have to do for Y in an applicative-order language as well. I’m only going to use the applicative-order U when I have to, so I’ll give it a different name:

(define (U/ao f)
  (λ args (apply (f f) args)))

Note also that I’m allowing more than one argument rather than doing the pure-λ-calculus thing.

Using U to construct a recursive functions

To do this we do a similar trick that you do with Y: write a function which, if given a function as argument which deals with the recursive cases, will return a recursive function. And obviously I’ll use the Fibonacci function as the canonical recursive function.

So, consider this thing:

(define fibber
  (λ (f)
    (λ (n)
      (if (<= n 2)
          1
          (+ ((U f) (- n 1))
             ((U f) (- n 2)))))))

This is a function which, given another function, U of which computes smaller Fibonacci numbers, will return a function which will compute the Fibonacci number for n.

In other words, U of this function is the Fibonacci function!

And we can test this:

> (define fibonacci (U fibber))
> (fibonacci 10)
55

So that’s very nice.

Wrapping U in a macro

So, to hide all this the first thing to do is to remove the explicit calls to U in the recursion. We can lift them out of the inner function completely:

(define fibber/broken
  (λ (f)
    (let ([fib (U f)])
      (λ (n)
        (if (<= n 2)
            1
            (+ (fib (- n 1))
               (fib (- n 2))))))))

Don’t try to compute U of this: it will recurse endlessly because (U fibber/broken) -> (fibber/broken fibber/broken) and this involves computing (U fibber/broken), and we’re doomed.

Instead we can use U/ao:

(define fibber
  (λ (f)
    (let ([fib (U/ao f)])
      (λ (n)
        (if (<= n 2)
            1
            (+ (fib (- n 1))
               (fib (- n 2))))))))

And this is all fine ((U fibber) 10) is 55 (and terminates!).

Purists can then turn let into λ in the usual way:

(define fibber
  (λ (f)
    ((λ (fib)
       (λ (n)
         (if (<= n 2)
             1
             (+ (fib (- n 1))
                (fib (- n 2))))))
     (U/ao f))))

And this is really all you need to be able to write the macro:

(define-syntax (with-recursive-binding stx)
  (syntax-parse stx
    [(_ (name:id value:expr) form ...+)
     #'(let ([name (U (λ (f)
                        (let ([name (U/ao f)])
                          value)))])
         form ...)]))

Or, for the pure of heart:

(define-syntax (with-recursive-binding stx)
  (syntax-parse stx
    [(_ (name:id value:expr) form ...+)
     #'((λ (name)
          form ...)
        (U (λ (f)
             ((λ (name)
                value)
              (U/ao f)))))]))

And this works fine:

(with-recursive-binding (fib (λ (n)
                               (if (<= n 2)
                                   1
                                   (+ (fib (- n 1))
                                      (fib (- n 2))))))
  (fib 10))

A caveat on bindings

One fairly obvious thing here is that there are two bindings constructed by this macro: the outer one, and an inner one of the same name. And these are not bound to the same function in the sense of eq?:

(with-recursive-binding (ts (λ (it)
                              (eq? ts it)))
  (ts ts))

is #f. This matters only in a language where bindings can be mutated: a language with assignment in other words. Both the outer and inner bindings, unless they have been mutated, are to functions which are identical as functions: they compute the same values for all values of their arguments. In fact, it’s hard to see what purpose eq? would serve in a language without assignment.

This caveat will apply below as well.

Two versions of U for many functions

The obvious generalization of U, U*, to many functions is that $U^*(f_1, \ldots, f_n)$ is the tuple $(f_1(f_1, \ldots, f_n), f_2(f_1, \ldots, f_n), \ldots)$. And a nice way of expressing that in Racket is to use multiple values:

(define (U* . fs)
  (apply values (map (λ (f)
                       (apply f fs))
                     fs)))

And we need the applicative-order one as well:

(define (U*/ao . fs)
  (apply values (map (λ (f)
                       (λ args (apply (apply f fs) args)))
                     fs)))

Note that U* is a true generalization of U: (U f) and (U* f) are the same.

Using U* to construct mutually-recursive functions

I’ll work with a trivial pair of functions:

an object is a numeric tree if it is a cons and its car and cdr are numeric objects;
an objct is a numeric object if it is a number, or if it is a numeric tree.

So we can define ‘maker’ functions (with an ’-er’ convention: a function which makes an x is an xer, or, if x has hyphens in it, an x-er) which will make suitable functions:

(define numeric-tree-er
  (λ (nter noer)
    (λ (o)
      (let-values ([(nt? no?) (U* nter noer)])
        (and (cons? o)
             (no? (car o))
             (no? (cdr o)))))))

(define numeric-object-er
  (λ (nter noer)
    (λ (o)
      (let-values ([(nt? no?) (U* nter noer)])
        (cond
          [(number? o) #t]
          [(cons? o) (nt? o)]
          [else #f])))))

Note that for both of these I’ve raised the call to U* a little, simply to make the call to the appropriate value of U* less opaque.

And this works:

(define-values (numeric-tree? numeric-object?)
  (U* numeric-tree-er numeric-object-er))

And now:

> (numeric-tree? 1)
#f
> (numeric-object? 1)
#t
> (numeric-tree? '(1 . 2))
#t
> (numeric-tree? '(1 2 . (3 4)))
#f

Wrapping U* in a macro

The same problem as previously happens when we raise the inner call to U* with the same result: we need to use U*/ao. In addition the macro becomes significantly more hairy and I’m moderately surprised that I got it right so easily. It’s not conceptually hard: it’s just not obvious to me that the pattern-matching works.

(define-syntax (with-recursive-bindings stx)
  (syntax-parse stx
    [(_ ((name:id value:expr) ...) form ...+)
     #:fail-when (check-duplicate-identifier (syntax->list #'(name ...)))
     "duplicate variable name"
     (with-syntax ([(argname ...) (generate-temporaries #'(name ...))])
       #'(let-values
             ([(name ...) (U* (λ (argname ...)
                                (let-values ([(name ...)
                                              (U*/ao argname ...)])
                                  value)) ...)])
           form ...))]))

And now, in a shower of sparks, we can write:

(with-recursive-bindings ((numeric-tree?
                           (λ (o)
                             (and (cons? o)
                                  (numeric-object? (car o))
                                  (numeric-object? (cdr o)))))
                          (numeric-object?
                           (λ (o)
                             (cond [(number? o) #t]
                                   [(cons? o) (numeric-tree? o)]
                                   [else #f]))))
  (numeric-tree? '(1 2 3 (4 (5 . 6) . 7) . 8)))

and get #t.

As I said, I am sure there are well-known better ways to do this, but I thought this was interesting enough not to lose. This originated as an answer to this Stack Overflow question.

Polkit: wat

Tim Bradshaw — Mon, 24 Feb 2020 16:41:11 UT

What polkit is, why you should worry about it, some ways to defang it.

What polkit is

Polkit¹ is part of the freedesktop.org project. The documentation for polkit describes what it does:

polkit provides an authorization API intended to be used by privileged programs (“MECHANISMS”) offering service to unprivileged programs (“SUBJECTS”) often through some form of inter-process communication mechanism. In this scenario, the mechanism typically treats the subject as untrusted. For every request from a subject, the mechanism needs to determine if the request is authorized or if it should refuse to service the subject. Using the polkit APIsu, a mechanism can offload this decision to a trusted party: The polkit authority.

In other words, polkit provides a mechanism by which applications can run parts of themselves with elevated privilege, in a similar way that sudo and other mechanisms do. There are no limits to the privilege that can be gained using polkit, and in particular there is nothing preventing it from allowing programs to run as any user, including root via the pkexec utiity. As well as polkit’s own documentation the Wikipedia article on it is fairly good.

An example of the sort of problem that polkit wants to solve, I think, is that it’s desirable that someone using a desktop system should be able to turn it off without needing to be a privileged user. But it’s rather undesirable that someone using the same machine via ssh for instance should be able to turn it off, even if they are the same user. So there needs to be some framework which lets you express the idea that ‘if this person is using a GUI on the console of this machine, they should be able to shut it down, but they should not be able to do that if they are not using the GUI on the console (for instance, they should almost certainly not be able to set up a cron or at job to turn the machine off)’. There are enough other such operations, such as connecting USB disks to machines, which need to have similar controls around them to make a general framework worth having.

Polkit ships as part of the basic installs of several Linux distributions, including (but not limited to):

RHEL 7;
Ubuntu 19.10 (older version of polkit);
CentOS 7 & 8.

Polkit is included as part of server as well as desktop installs of these platforms. I’m not sure what purpose it serves on server installs: I suspect that it may be used for device management.

A simple example of pkexec

pkexec is a command-line tool which uses polkit to decide whether a user is allowed to run a command as another user, with that other user being, by default, root:

$ groups
tfb wheel
$ id -u
1000
$ pkexec id -u
==== AUTHENTICATING FOR org.freedesktop.policykit.exec ====
Authentication is needed to run `/usr/bin/id' as the super user
Authenticating as: Tim Bradshaw (tfb)
Password:
==== AUTHENTICATION COMPLETE ====
0
$

So you can see that pkexec is doing the same thing that sudo would do: it has some rules which say that tfb is allowed to do things as root and is then asking that user to authenticate themselves. In fact, as configured on the machine this ran on, tfb is allowed to become root by virtue of being in the wheel group (sudo has equivalent rules on this machine).

Enough polkit to be dangerous

Polkit is a big complicated system and part of an even bigger and more complicated system: in order to understand it you need to read the manuals, and also to understand about how things like D-bus work. I don’t understand all of those things, but here is enough information to be able to poke around in the configuration files and get some idea about what is going on. This is not a definitive guide: reading the manuals or the source is the only way to get that.

There have been at least two versions of polkit: I’m mostly describing the newer one here. As of 19.10, Ubuntu still uses an older version.

The names of things

An unprivileged program making a request to polkit to do something is known as a subject.
What the unprivileged program is asking for is an action.
A privileged program which performs an action is a mechanism.
The thing that verifies whether a given subject can get a given mechanism to perform a given action is the authority.
An authentication agent is something which is asked by the authority to get someone or something authenticate themselves.

An overview of polkit

Polkit overview

In this figure:

links in red are (usually?) mediated by dbus;
polkitd is the authority at the centre of the process, and deals with checking if an action is allowed, and getting authentication for it;
the policies files describe what actions exist;
the rules files provide rules which tell you if a given requested action should be allowed.

The most important part of the process is polkitd, together with the rules and policies files it consults.

I am fairly sure that the requesting program (subject) and the privileged program (mechanism) can be the same: this is the case for pkexec for instance. However it could be the intent is that the subject is whatever invoked pkexec in this case.

polkitd

polkitd is the daemon which is at the centre of polkit. Its job is to serve as the authority: it answers the question of whether a given request should be allowed or not and deals with any required authentication by talking to an authentication agent. polkitd does not itself have any particular privilege, and runs as the polkitd user: the questions it answers can be very critical to security however.

polkitd is configured by two sets of files:

policy files, also known as action files which describe what sort of ‘actions’ polkit knows about;
rules files, which describe the conditions under which a given action should be allowed.

Policy files

Policy files live in the /usr/share/polkit-1/actions/ directory, and have extension policy. All the files in that directory are read, and I’m reasonably sure that polkitd watches for changes in the directory and reads or rereads things appropriately.

Policy files are XML, and their content is described in polkit(8). The important elements are <action>s, which specify what the actions are. A given policy file can specify many actions. Because the files are XML and also because they often have a lot of internationalisation support they are fairly hard to read. However there’s a nice utility called pkaction which will tell you what actions exist and display them in a more readable format: pkaction on its own will list all of the available actions and pkaction --verbose will display details about them. You can also use the --action-id option to specify an individual action to display, as here:

$ pkaction --verbose --action-id org.freedesktop.policykit.exec
org.freedesktop.policykit.exec:
  description:       Run a program as another user
  message:           Authentication is required to run a program as another user
  vendor:            The polkit project
  vendor_url:        http://www.freedesktop.org/wiki/Software/polkit/
  icon:
  implicit any:      auth_admin
  implicit inactive: auth_admin
  implicit active:   auth_admin

This corresponds to the following XML fragment²:

<action id="org.freedesktop.policykit.exec">
  <description>Run a program as another user</description>
  <message>Authentication is required to run a program as another user</message>
  <defaults>
    <allow_any>auth_admin</allow_any>
    <allow_inactive>auth_admin</allow_inactive>
    <allow_active>auth_admin</allow_active>
  </defaults>
</action>

The org.freedesktop.policykit.exec action is the one that pkexec uses to do things: the policy file that specifies it is probably /usr/share/polkit-1/actions/org.freedesktop.policykit.policy.

The interesting part of action specifications in policy files is their defaults: these tell you what is required to perform the action in various circumstances. pkaction reports these defaults as implicit ... at the end. It’s not completely clear from the documentation, but I strongly assume that these are minimum requirements for the action to be performed. In the example above, anything requesting the action is required to authenticate as an administrative user, and that authentication is not remembered for any period.

Additionally there can be annotations added, which are key/value pairs which let you specify various things like paths.

Rules files

Rules files live in two locations: /etc/polkit-1/rules.d and /usr/share/polkit-1/rules.d, and have extension rules. All files in both directories are read, after being sorted in lexical order by filename, with files in /etc being read first when there’s a tie. The daemon watches for changes in the directories and rereads everything in that case.

The contents of rules files is JavaScript. Polkit defines an object called polkit and there are various methods on this object which do useful things:

addRule(fn) adds a rule, which is a function which, given arguments representing an action and a subject, is responsible for saying if the action is allowed and what authorisation is needed to run it;
addAdminRule(fn)adds a rule — a function again — which gets to say what counts as being an administrator;
log(message)will log things in some suitable way;
spawn(argv) will spawn a program, capturing its output.

The functions added by addRule are called in the order they were added, until one returns a non-null result, which can either unconditionally allow or deny the action, or require authorisation of various kinds.

The functions added by addAdminRule are called in the order they were added until one returns a description of what an administrator is.

These functions can call polkit.log(...) to log things and polkit.spawn(...) to run programs.

There are bounds on how long a rule may run for, and also on how long programs spawned by polkit.spawn(...) can run for.

More details on the rules files are in the documentation.

Example rules and actions

Here is a sample rule which tries to require administrator authentication to run pkexec:

polkit.addRule(function(action, subject) {
    if (action.id == "org.freedesktop.policykit.exec") {
        polkit.log("pkexec rule hit\n");
        return polkit.Result.AUTH_ADMIN;
    } else {
        polkit.log("pkexec rule missed\n");
        return polkit.Result.NOT_HANDLED;
    }});

If this is installed as, for instance /usr/share/polkit-1/rules.d/00-pkexec.rules then it will try to ensure that anyone trying to use pkexec requires administrator authorisation (equivalently: is required to authenticate themselves as an administrator). Since it is almost certainly first in the sort order, it also gets to control things before any other rules get their hands on things.

Except this rule does not work: it does catch actions whose id is org.freedesktop.policykit.exec, but these are not the only actions which pkexec can use: it can also use actions which have an org.freedesktop.policykit.exec.path annotation. For instance this policy file

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE policyconfig PUBLIC "-//freedesktop//DTD polkit Policy Configuration 1.0//EN"
"http://www.freedesktop.org/software/polkit/policyconfig-1.dtd">
<policyconfig>
  <vendor>The sinister TFEB organisation</vendor>
  <vendor_url>https://www.tfeb.org/</vendor_url>
  <action id="org.tfeb.tc.explode">
    <description>Explode</description>
    <message>Authentication is not required to explode</message>
    <annotate
        key="org.freedesktop.policykit.exec.path">/usr/sbin/explode</annotate>
    <defaults>
      <allow_any>yes</allow_any>
      <allow_inactive>yes</allow_inactive>
      <allow_active>yes</allow_active>
    </defaults>
  </action>
</policyconfig>

will allow /usr/sbin/explode to be run by pkexec with no authentication at all:

$ /usr/sbin/explode
exploded as UID 1000 GID 1000
$ pkexec /usr/sbin/explode
exploded as UID 0 GID 0

To catch this, one approach is to rely on the fact that the Action objects passed to the rule have properties which can be looked up with a lookup method, and pkexec sets a program property. So the following version of the above rule should catch all pkexec rules:

polkit.addRule(function(action, subject) {
    if (action.id == "org.freedesktop.policykit.exec"
        || action.lookup("program")) {
        polkit.log("pkexec rule hit\n");
        return polkit.Result.AUTH_ADMIN;
    } else {
        polkit.log("pkexec rule missed\n");
        return polkit.Result.NOT_HANDLED;
    }});

A similar rule can simply disable pkexec altogether³:

polkit.addRule(function(action, subject) {
    if (action.id == "org.freedesktop.policykit.exec"
        || action.lookup("program")) {
        polkit.log("pkexec rule hit\n");
        return polkit.Result.NO;
    } else {
        polkit.log("pkexec rule missed\n");
        return polkit.Result.NOT_HANDLED;
    }});

And now:

$ pkexec /usr/sbin/explode
Error executing command as another user: Not authorized

This incident has been reported.

Why polkit is a security disaster

There are at least two reasons why the way polkit works is a security disaster:

expressing rules in JavaScript (or any general programming language) is a terrible idea;
the implementation is deficient.

Writing rules in a general-purpose language is a terrible idea

It might seem like a clever idea to write rules in JavaScript:

using a general-purpose programming language means that very general rules can be implemented;
given that decision JavaScript is a common language which is not entirely awful.

But in fact this is a terrible idea, just because it means that very general rules can be implemented. In particular it is not possible, even in principle, to statically determine what polkit will allow or deny. JavaScript is a fully-fledged programming language which means that the only way you can know what a program will do, in general, is to run it. There is, at least, no halting problem since the execution time of the rules is bounded, but all of the other problems associated with general-purpose programming languages are still present.

What this means is that any kind of security analysis of a system needs to

check the rules are valid JavaScript, which can be done statically;
check what the rules do, which can’t be done statically, but requires the rules to be run.

A possible counter argument to this is

Well, only very simple rules will ever be written: no-one is actually going to make use of all this power. In particular the rules people actually write will be so simple that they can in fact be analysed statically.

That’s exactly the same argument as

Well, no-one is ever going to do anything bad, so they can all have the root password.

and it’s equally stupid. Secure systems should make it impossible to do things which are undesirable, not rely on people just not doing them. The language in which rules are expressed should be just expressive enough that allows the options needed, but no more expressive than that, and it should certainly always be possible to statically analyse a rule to know what it will allow. Using a general-purpose programming language for rules is just dumb.

Just to drive home this point it turns out that the rules supplied with the system are indeed mildly hard to analyse: here is /etc/polkit-1/rules.d/49-polkit-pkla-compat.rules from a CentOS 8 system:

polkit.addAdminRule(function(action, subject) {
        //polkit.log('Starting pkla-admin-identities\n');
        // Let exception, if any, propagate to the JS authority
        var res = polkit.spawn(['/usr/bin/pkla-admin-identities']);
        //polkit.log('Got "' + res.replace(/\n/g, '\\n') + '"\n');
        if (res == '')
                return null;
        var identities = res.split('\n');
        //polkit.log('Identities: ' + identities.join(',') + '\n');
        if (identities[identities.length - 1] == '')
                identities.pop()
        //polkit.log('Returning: ' + identities.join(',') + '\n');
        return identities;
});

polkit.addRule(function(action, subject) {
        var params = ['/usr/bin/pkla-check-authorization',
                      subject.user, subject.local ? 'true' : 'false',
                      subject.active ? 'true' : 'false', action.id];
        //polkit.log('Starting ' + params.join(' ') + '\n');
        var res = polkit.spawn(params);
        //polkit.log('Got "' + res.replace(/\n/g, '\\n') + '"\n');
        if (res == '')
                return null;
        return res.replace(/\n$/, '');
});

Well, it’s possible to work out what this is doing, if you try hard. But note that, in particular what it is doing is deferring to completely separate programs both to work out who administrative users are, and whether an action should be allowed. So now you need to understand that program as well. And yes, it is doing all sorts of string hacking to parse the output of that program, which is always a really good sign.

The implementation is deficient

Even given the design, polkit’s implementation is deficient.

The first and most obvious sign of deficiency is that rules can invoke external programs: those programs run as the polkitd user and can do anything it can do, including writing to the filesystem.

If SELinux is enabled on the system (which can be checked with sestatus), and if the correct policy is loaded, then it may well prohibit this, as polkit’s rules run under a policy which prevents them writing to the filesystem. But polkitd doesn’t check that SELinux is enforcing, or that the correct policy is in place: it just blunders on, trusting whatever external programs it runs to be well-behaved.

But this is only the start of the horrors. The actions, and even more so the rules that polkitd uses are security-critical. If I can install an early rule such as, for instance

polkit.addRule(function(action, subject) {
    return polkit.Result.YES;
});

then I have completely bypassed security on the system, because pkexec will let me do anything with no authentication at all.

So polkit, and specifically polkitd should be very careful about the ownership and permissions of the files and directories it looks at. In particular everything in the path down to any file it looks at should be owned by a privileged user and writable only by that user, and polkitd. That user should almost certainly be root. polkitd should check this every time it reads anything.

It doesn’t do that. In fact it doesn’t check at all:

$ id
uid=1000(tfb) gid=1000(tfb) groups=1000(tfb),10(wheel)
$ pwd
/usr/share/polkit-1/rules.d
$ ls -ld .
drwxrwx---. 2 polkitd tfb 80 Feb 24 14:23 .
$ cat > 00-bypass.rules
polkit.addRule(function(action, subject) {
    return polkit.Result.YES;
});
$ pkexec
#

In the presence of a massive, easily-detectible, security compromise like this, polkitd should refuse to do anything at all and log security alerts. It doesn’t: it just blunders on.

Finally, the default owner of, for instance, /usr/share/polkit-1/rules.d/ is polkitd: this might seem reasonable, except that it means that any external program spawned by a rule could, for instance write a rule (unless SELinux prevents this, which it will only do if it’s enabled). This is an acceptable risk only if you assume that no external program is ever compromised, even momentarily, and that if it is then all is immediately lost. It would also help if rules were easy to analyse: it’s quite possible to imagine a rule which could be persuaded to execute some program of an attacker’s choosing. This is all just extremely brittle: secure systems are not brittle.

I found these problems on rather casual inspection of polkit. There may very well be others, and I’d assume since I found these so easily that there are.

Conclusion

Polkit is yet another mechanism which allows privilege escalation on Linux systems: it has functionality broadly equivalent to programs like sudo. Every additional mechanism for privilege escalation increases the attack surface of the system and increases the burden on people who need to ensure the security of systems, and is thus undesirable of itself.

Additionally, polkit:

is significantly complicated;
has rules which govern privileged access which can’t be statically analysed in general by design, and which can invoke arbitrary programs during their evaluation;
has serious security problems in its implementation.

Polkit almost certainly contains other security problems. Red Hat, and probably other vendors, now ship polkit as part of core installs and will not support systems without it⁴. This means it’s hard to remove: a safe approach is therefore to defang it by installing a rule which simply denies access altogether: install a file in /etc/polkit-1/rules.d/00-defang.rules which contains

polkit.addRule(function(action, subject) {
    return polkit.Result.NO;
});

Such a rule should minimise the security risk from polkit, if it can’t be removed.

Appendices

Disclaimer

All of this is what I’ve worked out by playing around with polkit. Any of it may be wrong, and in particular all of the rules or actions above are only samples: you should check them yourself, and I’m not responsible if they don’t work.

If this happens⁵:

$ pkexec id -u
==== AUTHENTICATING FOR org.freedesktop.policykit.exec ====
Authentication is needed to run `/usr/bin/id' as the super user
Authenticating as: Tim Bradshaw (tfb)
Password:
polkit-agent-helper-1: error response to PolicyKit daemon: GDBus.Error:org.freedesktop.PolicyKit1.Error.Failed: No session for cookie
==== AUTHENTICATION FAILED ====
Error executing command as another user: Not authorized

This incident has been reported.

then this seems to be because of some problem with the authentication agent. Here is a terrible hack to make it work so you can test things.

Open another terminal window to the same machine.
In the main terminal window find the PID of the shell by echo $$.
In the second window run pkttyagent --process PID, using the PID from the previous step.
When you authenticate you will now get prompted by the pkttyagent running in the second window.

Yes, this is as horrid as it sounds, but it’s enough to get by.

Wat?

Wat.

Previously known as ‘PolicyKit’. ↩
The actual XML is more complicated than this as it includes versions of the description & message in several languages. The <action> element is also not the top-level element. ↩
DISCLAIMER: while I believe this rule disables pkexec completely, I don’t warrant that it does: caveat emptor. ↩
This raises questions about the approach of these companies to security, of course, which I’m not addressing here. ↩
This seems to be a problem with RHEL 8, but not RHEL 7 (based on experiments with CentOS 8 & 7 respectively). ↩

Function calling conventions and bindings

Tim Bradshaw — Fri, 04 Jan 2019 10:19:36 UT

An attempt to describe three well-known function calling conventions in terms of bindings.

A little while ago I wrote an article on bindings which, in turn, was based on my answer to this Stack Overflow question. I have since written another answer to a more recent question and I thought it would be worth summarising part of that to describe how three famous function calling conventions can be described in terms of bindings¹.

Bindings in brief

A binding is an association between a name (a variable) and a value, where the value can be any object the language can talk about. In most Lisps (and other languages) bindings are not first-class: the language can not talk about bindings directly, and in particular bindings can not be values. Bindings are, or may be, mutable: their values (but not their names) can be changed by assignment. Many bindings can share the same value. Bindings have scope (where they are accessible) and extent (how long they are accessible for) and there are rules about that.

Call by value

In call by value the value of a binding is passed to a procedure. This means that the procedure can not mutate the binding itself. If the value is a mutable object it can be altered by the procedure, but the binding can not be.

Call by value is the convention used by all Lisps I know of. Here is a function which demonstrates that call by value can not mutate bindings:

(defun pbv (&optional (fn #'identity))
  ;; If FN returns then the first value of this function will be T
  (let ((c (cons 0 0)))                 ;first binding
    (let ((cc c))                       ;second binding, shares value with first
      (funcall fn c)                    ;FN gets the *value* of C
      (values (eq c cc) c))))           ;C and CC still refer to the same object

Call by reference

In call by reference, procedures get the bindings themselves as arguments. If a procedure modifies the binding by assignment, then it is modified in the calling procedure as well.

Lisp does not use call by reference: Fortran does, or can, use a calling mechanism which is equivalent to call by reference².

It is possible to implement what is essentially call by reference in Lisp (here Common Lisp, but any Lisp with lexical scope, indefinite extent & macros can do this) using some macrology:

(defmacro capture-binding (var)
  ;; Construct an object which captures a binding
  `(lambda (&optional (new-val nil new-val-p))
     (when new-val-p
       (setf ,var new-val))
     ,var))

(declaim (inline captured-binding-value
                 (setf captured-binding-value)))

(defun captured-binding-value (cb)
  ;; value of a captured binding
  (funcall cb))

(defun (setf captured-binding-value) (new cb)
  ;; change the value of a captured binding
  (funcall cb new))

And now, given

(defun mutate-binding (b v)
  (setf (captured-binding-value b) v))

(defun sort-of-call-by-reference ()
  (let ((c (cons 1 1)))
    (let ((cc c))
      (mutate-binding (capture-binding cc) 3)
      (values c cc))))

> (sort-of-call-by-reference)
(1 . 1)
3

The trick here is that the procedure created by the capture-binding macro has access to the binding being captured, and can mutate it.

Call by name

Call by name is the same as call by value, except the value of a binding is only computed at the point it is needed. Call by name is a form of delayed evaluation or normal-order evaluation strategy.

Lisp (at least Common Lisp: Lisps which have normal-order evaluation strategies exist) does not have call by name, but again it can be emulated with some macrology:

(defmacro delay (form)
  ;; simple-minded DELAY.  FORM is assumed to return a single value,
  ;; and will be evaluated no more than once.
  (let ((fpn (make-symbol "FORCEDP"))
        (vn (make-symbol "VALUE")))
    `(let ((,fpn nil) ,vn)
       (lambda ()
         (unless ,fpn
           (setf ,fpn t
                 ,vn ,form))
         ,vn))))

(declaim (inline force))

(defun force (thunk)
  ;; forcd a thunk
  (funcall thunk))

(defmacro funcall/delayed (fn &rest args)
  ;; call a function with a bunch of delayed arguments
  `(funcall ,fn ,@(mapcar (lambda (a)
                            `(delay ,a))
                          args)))

And now

(defun return-first-thunk-value (t1 t2)
  (declare (ignorable t2))
  (force t1))

(defun surprisingly-quick ()
  (funcall/delayed #'return-first-thunk-value
                   (cons 1 2)
                   (loop repeat 1000000
                         collect
                         (loop repeat 1000000
                               collect
                               (loop repeat 1000000
                                     collect 1)))))

> (time (surprisingly-quick))
Timing the evaluation of (surprisingly-quick)

User time    =        0.000
System time  =        0.000
Elapsed time =        0.001
Allocation   = 224 bytes
3 Page faults
(1 . 2)

The second argument to return-first-thunk-value was never forced, and so the function completes in reasonable time.

This, in turn, is distantly descended from a post on comp.lang.lisp by Erik Naggum. ↩
I think Fortran is allowed to implement its ‘by reference’ calls by copying any modified bindings back to the bindings in the parent procedure, and this is largely equivalent, at least for single-threaded code. ↩

Call by value in Scheme and Lisp

Tim Bradshaw — Tue, 11 Dec 2018 10:50:28 UT

I find the best way to think about this is to think in terms of bindings, rather than environments or frames, which are simply containers for bindings.

Bindings

A binding is an association between a name and a value. The name is often called a ‘variable’ and the value is, well, the value of the variable. The value of a binding can be any object that the language can talk about at all. Bindings, however, are behind-the-scenes things (sometimes this is called ‘not being first-class objects’): they’re not things that can be represented in the language but rather things that you can use as part of the model of how the language works. So the value of a binding can’t be a binding, because bindings are not first-class: the language can’t talk about bindings.

There are some rules about bindings:

there are forms which create them, of which the most important two are lambda and define;
bindings are not first-class — the language can not represent bindings as values;
bindings are, or may be, mutable — you can change the value of a binding once it exists — and the form that does this is set!;
there is no operator which destroys a binding;
bindings have lexical scope — the bindings available to a bit of code are the ones you can see by looking at it, not ones you have to guess by running the code and which may depend on the dynamic state of the system;
only one binding for a given name is ever accessible from a given bit of code — if more than one is lexically visible then the innermost one shadows any outer ones;
bindings have indefinite extent — if a binding is ever available to a bit of code, it is always available to it.

Obviously these rules need to be elaborated significantly (especially with regards to global bindings & forward-referenced bindings) and mare formal, but these are enough to understand what happens. In particular I don’t really think you need to spend a lot of time worrying about environments: the environment of a bit of code is just the set of bindings accessible to it, so rather than worry about the environment just worry about the bindings.

Call by value

So, what ‘call by value’ means is that when you call a procedure with an argument which is a variable (a binding) what is passed to it is the value of the variable binding, not the binding itself. The procedure then creates a new binding with the same value. Two things follow from that:

the original binding can not be altered by the procedure — this follows because the procedure only has the value of it, not the binding itself, and bindings are not first-class so you can’t cheat by passing the binding itself as the value;
if the value is itself a mutable object (arrays & conses are example of objects which usually are mutable, numbers are examples of objects which are not) then the procedure can mutate that object.

Examples of the rules about bindings

So, here are some examples of these rules.

(define (silly x)
  (set! x (+ x 1))
  x)

(define (call-something fn val)
  (fn val)
  val))

> (call-something silly 10)
10

So, here we are creating two top-level bindings, for silly and call-something, both of which have values which are procedures. The value of silly is a procedure which, when called:

creates a new binding whose name is x and whose value is the argument to silly;
mutates this binding so its value is incremented by one;
returns the value of this binding, which is one more than the value it was called with.

The value of call-something is a procedure which, when called:

creates two bindings, one named fn and one named val;
calls the value of the fn binding with the value of the val binding;
returns the value of the val binding.

Note that whatever the call to fn does, it can not mutate the binding of val, because it has no access to it. So what you can know, by looking at the definition of call-something is that, if it returns at all (it may not return if the call to fn does not return), it will return the value of its second argument. This guarantee is what ‘call by value’ means: a language (such as Fortran) which supports other call mechanisms can’t always promise this.

(define (outer x)
  (define (inner x)
    (+ x 1))
  (inner (+ x 1)))

Here there are four bindings: outer is a top-level binding whose value is a procedure which, when it is called, creates a binding for x whose value is its argument. It then creates another binding called inner whose value is another procedure, which, when it is called, creates a new binding for x to its argument, and then returns the value of that binding plus one. outer then calls this inner procedure with the value of its binding for x.

The important thing here is that, in inner, there are two bindings for x which are potentially lexically visible, but the closest one — the one established by inner — wins, because only one binding for a given name can ever be accessible at one time.

Here is the previous code (this would not be equivalent if inner was recursive) expressed with explicit lambdas:

(define outer
  (λ (x)
    ((λ (inner)
       (inner (+ x 1)))
     (λ (x)
       (+ x 1)))))

And finally an example of mutating bindings:

(define (make-counter val)
  (λ ()
    (let ((current val))
      (set! val (+ val 1))
      current)))

> (define counter (make-counter 0))
> (counter)
0
> (counter)
1
> (counter)
2

So, here, make-counter (is the name of a binding whose value is a procedure which, when called,) establishes a new binding for val and then returns a procedure it has created. This procedure makes a new binding called current which catches the current value of val, mutates the binding for val to add one to it, and returns the value of current. This code exercises the ‘if you can ever see a binding, you can always see it’ rule: the binding for val created by the call to make-counter is visible to the procedure it returns for as long as that procedure exists (and that procedure exists at least as long as there is a binding for it), and it also mutates a binding with set!.

Why not environments?

SICP, in chapter 3, introduces the ‘environment model’, where at any point there is an environment, consisting of a sequence of frames, each frame containing bindings. Obviously this is a fine model, but it introduces three kinds of thing — the enviromnent, the frames in the environment and the bindings in the frame — two of which are utterly intangible. At least for a binding you can get hold of it in some way: you can see it being created in the code and you can see references to it. So I prefer not to think in terms of these two extra sorts of thing which you can never get any kind of handle on.

However this is a choice which makes no difference in practice: thinking purely in terms of bindings helps me, thinking in terms of environments, frames & bindings may well help other people more.

Shorthands

In what follows I am going to use a shorthand for talking about bindings, especially top-level ones:

’x is a procedure which …’ means ’x is the name of a binding whose value is a procedure which, when called, …’;
’y is …’ means ’y is the name of a binding the value of which is …’;
’x is called with y’ means ‘the value of the binding named by x is called with the value of the binding named by y’;
’… binds x to …’ means ’… creates a binding whose name is x and whose value is …’;
’x’ means ‘the value of x’;
and so on.

Describing bindings like this is common, as the fully-explicit way is just painful: I’ve tried (but probably failed in places) to be fully explicit above.

The answer

And finally, after this long preamble, here’s the answer to the question you asked¹.

(define (make-withdraw balance)
  (λ (amount)
    (if (>= balance amount)
        (begin (set! balance (- balance amount))
               balance)
        "Insufficient funds")))

make-withdraw binds balance to its argument and returns a procedure it makes. This procedure, when called:

binds amount to its argument;
compares amount with balance (which it can still see because it could see it when it was created);
if there’s enough money then it mutates the balance binding, decrementing its value by the value of the amount binding, and returns the new value;
if there’s not enough money it returns "Insuficient funds" (but does not mutate the balance binding, so you can try again with a smaller amount: a real bank would probably suck some money out of the balance binding at this point as a fine).

Now

(define x (make-withdraw 100))

creates a binding for x whose value is one of the procedures described above: in that procedure balance is initially 100.

(define (f y) (y 25))

f is a procedure (is the name of a binding whose value is a procedure, which, when called) which binds y to its argument and then calls it with an argument of 25.

(f x)

So, f is called with x, x being (bound to) the procedure constructed above. In f, y is bound to this procedure (not to a copy of it, to it), and this procedure is then called with an argument of 25. This procedure then behaves as described above, and the results are as follows:

> (f x)
75
> (f x)
50
> (f x)
25
> (f x)
0
> (f x)
"Insufficient funds"

Note that:

no first-class objects are copied anywhere in this process: there is no ‘copy’ of a procedure created;
no first-class objects are mutated anywhere in this process;
bindings are created (and later become inacessible and so can be destroyed) in this process;
one binding is mutated repeatedly in this process (once for each call);
I have not anywhere needed to mention ‘environments’, which are just the set of bindings visible from a certain point in the code and I think not a very useful concept.

I hope this makes some kind of sense.

A more elaborate version of the above code

Something you might want to be able to do is to back out a transaction on your account. One way to do that is to return, as well as the new balance, a procedure which undoes the last transaction. Here is a procedure which does that (this code is in Racket):

(define (make-withdraw/backout
         balance
         (insufficient-funds "Insufficient funds"))
  (λ (amount)
    (if (>= balance amount)
        (let ((last-balance balance))
          (set! balance (- balance amount))
              (values balance
                      (λ ()
                       (set! balance last-balance)
                       balance)))
            (values
             insufficient-funds
             (λ () balance)))))

When you make an account with this procedure, then calling it returns two values: the first is the new balance, or the value of insufficient-funds (defaultly "Insufficient funds"), the second is a procedure which will undo the transaction you just did. Note that it undoes it by explicitly putting back the old balance, because you can’t necessarily rely on (= (- (+ x y) y) x) being true in the presence of floating-point arithmetic I think. If you understand how this works then you probably understand bindings.

This originated as an answer to this Stack Overflow question. ↩

Worse is better

Tim Bradshaw — Wed, 28 Nov 2018 12:46:50 UT

In 1990, Richard Gabriel gave a talk from which Jamie Zawinski later extracted a section called ‘worse is better’ which he distributed widely. It’s strange but, perhaps, interesting, how prescient this idea was.

The paper describes two approaches to design¹.

The Right Thing

Designs must be simple, both in implementation and interface. It is more important for the interface to be simple than the implementation.
Designs must be correct in all observable aspects. Incorrectness is simply not allowed.
Designs must be consistent. A design is allowed to be slightly less simple and less complete to avoid inconsistency. Consistency is as important as correctness.
Designs must be complete and cover as many important situations as is practical. All reasonably expected cases must be covered. Simplicity is not allowed to overly reduce completeness.

Worse Is Better

Designs must be simple, both in implementation and interface. It is more important for the implementation to be simple than the interface. Simplicity is the most important consideration in a design.
Designs must be correct in all observable aspects. It is slightly better to be simple than correct.
Designs must not be overly inconsistent. Consistency can be sacrificed for simplicity in some cases, but it is better to drop those parts of the design that deal with less common circumstances than to introduce either implementational complexity or inconsistency.
Designs must cover as many important situations as is practical. All reasonably expected cases should be covered. Completeness can be sacrificed in favor of any other quality. In fact, completeness must be sacrificed whenever implementation simplicity is jeopardized. Consistency can be sacrificed to achieve completeness if simplicity is retained; especially worthless is consistency of interface.

Today

Today I felt it necessary to complain about a particularly stupid bit of behaviour in a filesystem, & I wrote, without conscious thought

[…] that something like this is even possible in 2018 means that, really, the sort of computing environment which seemed like it would happen in 1980 and still seemed possible into the late 1990s is just dead: worse is not just better, worse has taken better, killed it, buried it in a pit and erased any memory that it ever existed.

Of course no-one is listening (none of the people I sent this to even would have recognised the term I expect) just as no-one, or no-one who counted, listened to the original paper. But everything that is making modern computing systems so horrible — all the hardware bugs, all the systemic insecurity that is going to cost us very dearly if it hasn’t already, all of it — is because no-one listened and worse won by default as a result.

Today few people even remember that there was once an option to do things a better way. Soon, no-one will.

Oh, well.

These descriptions are stolen almost directly from the original: any errors I have introduced by rewording things are my own. ↩

Vellum

Tim Bradshaw — Thu, 22 Jun 2017 14:58:37 UT

The UK keeps its laws on vellum: this seems to be a ludicrously archaic thing to do: is it?

Don’t preserve physical artifacts: preserve information

People who deal with archives are used to dealing with physical objects and worrying about their longevity. So they worry about how long paper vellum last, what their decay mechanisms are and how they can be minimised. Everything is kept in controlled conditions so that the physical objects last as long as they can. Thus it is tempting to think that preserving information is the same thing as preserving the physical objects in which it resides: to preserve digital information you must preserve the media — tape, disks and so on — on which it resides. But we know that these media have rather short lifetimes — perhaps a few tens of years at the outside — and even when the media survive, there may be no way of reading them since the infrastructure on which they relied has gone.

This is, of course, confused: to preserve information you do not need to preserve the media on which it resides for any length of time. Since digital information can be copied without loss (or with a very low chance of loss), what you do instead is repeatedly copy the information onto current media. Preserving information is not the same as preserving physical artifacts: rather than a sacred disk rotting in a vault you keep the data spinning all the time on many copies of current media. I have files which originated on Fujitsu Eagles: I doubt there are very many Eagles still spinning or machines which can use them, but the information isn’t in any danger of being lost.

Don’t preserve information: preserve physical artifacts

Everything above is wrong, because it makes a critical assumption which is not true.

You can always keep information on current media.

This is true only if you are continually working on the system: in order to keep information spinning you need to be willing to buy new systems, transfer the information to the new systems, and keep the power on. But there is no evidence that we can keep the power on for any length of time, and plenty of evidence that we can’t.

This isn’t just dealing with a possible collapse of advanced civilisation, although archivists should worry about that: it’s happened before, and there is no reason to believe it won’t happen again. If we go through a period of several hundred years where our society retreats to some preindustrial (or just pre–1970) level, how much of our digitally-stored information will survive? My guess is that almost none will. And such a collapse is likely.

But much less than that is needed for information to be lost. Consider some large scientific data set — climate data for instance. What happens if political power gets into the hands of people for whom that data is inconvenient, and who remove funding from the organisations which look after that data? It may persist for a while, on ageing disk arrays and tapes, until enough of the redundancy goes away; it may persist for a while even after the power is removed from the systems which hold it. But it will not persist when the rent isn’t paid on the buildings in which those systems live. Within quite a short time that information will be irretrievably lost.

The archivists turn out to be right: if you want to preserve information it needs to live on media which remain readable for long periods of time with minimal requirements. In particular there must be no requirement for frequent replacement of hardware, on human intervention, or power. Choosing a medium, samples of which which have already survived for long periods is a good idea as well. Vellum is not such a bad choice if you only need to preserve a small amount of information. Large scientific data sets present a different problem, but ‘just keep the data spinning’ is probably not a very good solution.

Dynamic scope and macros

Tim Bradshaw — Thu, 26 Jan 2017 13:56:36 UT

I’ve recently been writing some Emacs Lisp code to do some massaging of files. Quite apart from having forgotten how primitive elisp is, I hadn’t realised before how hostile dynamic scope was for macros in particular.

A very common pattern for macros is call-with-* / with-*, in which there is a functional level which is wrapped by a more syntacticlly-friendly macro level. For instance, in Common Lisp you can map over lists with mapcar:

(mapcar
 (lambda (e)
   ...)
 ...)

but you might want to map over them with a syntax like

(mapping (e ...)
  ...)

Well, it’s easy to implement this:

(defmacro mapping ((e l) &body forms)
  `(mapcar (lambda (,e) ,@forms) ,l))

Even with CL’s unhygienic macro system & without a mass of gensymmery such a macro is safe.

A good example where CL exposes one side of a pattern like this is with-open-file: you can easily see how to implement this in terms of a function:

(defun call/open-file (fn filespec &rest keys
                          &key &allow-other-keys)
  (let ((s nil))
    (unwind-protect
        (progn
          (setf s (apply #'open filespec keys))
          (funcall fn s))
      (when s (close s)))))

(defmacro with-open-file* ((sn filespecn &rest keysn 
                               &key &allow-other-keys)
                           &body forms)
  `(call/open-file (lambda (,sn) ,@forms)
                   ,filespecn ,@keysn))

(This is probably not completely robust code: it’s just meant to get the idea across.)

Scheme exposes the other side of this pattern with call/cc:

(define-syntax-rule (with-cc (c) form ...)
  (call/cc (λ (c) form ...)))

(define-syntax-rule may be specific to Racket but, again, this is just meant to get the idea across.)

Well, now think about something like the above call/open-file / with-open-file* in a Lisp dialect with dynamic scope. In particular, what does this do:

(let ((s t))
  (with-open-file* (h ...)
    (when s ...)))

This expands to

(let ((s t))
  (call/open-file (lambda (h) (when s ...))))

But call/open-file binds s: so the binding of s in the called function is different than the outer binding, and nothing works.

Well, of course, this is something that happens pervasively with dynamically-scoped languages: every binding above you (or below you, depending on your viewpoint) matters, and can infect your namespace. But it’s particularly toxic for macros, because macros very often interpose bits of code into your code, and that code can include bindings which are dynamically, but not lexically, visible, even in the expansion of the macro. Dynamic scope enormously increases the hygiene problems of a macro system.

Dynamic scope is really useful as an option, and systems written in languages which don’t have it generally have to reinvent it, usually badly. But it’s just toxic and horrible as the only option. I can’t understand any more how I managed to use lisps with dynamic scope at all: perhaps I never wrote macros or just expected things to behave in a mysterious and strange way occasionally. Fortunately, even elisp now has the option of being lexically scoped.

Attacks on financial market infrastructure

Tim Bradshaw — Tue, 26 Jul 2016 12:10:30 UT

A recent article in The Economist talks about a plausible attack on the financial system: If financial systems were hacked: Joker in the pack. I liked this article, although I think it was a little naïve in two ways.

Firstly it wasn’t clear enough that the ‘recover from a serious incident in two hours’ claim is fantasy. Of course everyone would like to be able to do that and will state to regulators that they can do so, and perhaps some people in the organisations concerned really believe that they can do so. And there are mechanisms in place (DR systems, business continuity volumes and so on) which, for a suitably nice incident, will indeed allow very rapid recovery if everyone is on the ball. But for the sort of incidents described in the article — for instance an incident where you don’t trust your data and soon realise that all your backups for some unknown but long interval are also suspect — the recovery time is likely to be much longer than two hours. Indeed, the important question would be whether recovery is possible at all. There have been much smaller incidents, not caused by malice, where complete recovery was never achieved in the sense that some transactions were lost altogether: there is no reason to assume that full recovery is even possible from a really major attack.

Secondly and more seriously the article perpetrates the myth of ‘state sponsored actors’: the assumption being that only with the resources of a state would such an attack be possible, and since even malignant states have no interest in this kind of chaos these attacks are not a real worry. This is a touchingly 1950s view: although everyone knows how to make, say, a fission weapon, to actually make one you need to be able to mine huge quantities of ore, run vast numbers of centrifuges and so on, and do this secretly and securely, and only states have that kind of ability. The argument seems to be that breakng into computer systems is somehow a similarly industrial enterprise: perhaps you need vast caverns with serried ranks of hacker drones, relentlessly typing billions of lines of code or something, or enormous super-powerful computers to brute-force encryption. Well, of course, you don’t: you need a small number (possibly one) of sufficiently motivated people with the right skills who can find and exploit a weakness — probably a human weakness — in the system rather than launching the primitive industrial-scale brute-force attack that seems to be what the article imagines. And while states may not be interested in chaos, these tiny groups may well be.

In summary: it’s a good article but it understates the consequences of such attacks, and misrepresents the likely attackers in a way which makes such attacks seem much less plausible.

I hope that these confusions exist only in the minds of journalists, but I fear that the people actually responsible for the security of financial infrastructure also believe them, or at least pretend to do so as such beliefs are very convenient. I have certainly heard both myths repeated by people who ought to know better.

This is derived from a comment I made on an article in Bruce Shneier’s blog, in turn based on some personal experience in the financial services industry.

Python instead of Lisp

Tim Bradshaw — Thu, 09 Jun 2016 18:43:40 UT

Lots of people, even famous Lisp hackers, like to claim that ‘Python can be seen as a dialect of Lisp with “traditional” syntax’.

Being famous does not make them right.

Python is nothing like Lisp

Expression language. Lisp is an expression language: everything in the language is an expression and has a value, and there is no distinction between expressions and statements, because there are no statements. Python is not: it has expressions, such as 2+3, lambda x: x*2 and statements such as x = 3. If expressions and statements are different things then writing macros and any kind of general-purpose lambda becomes very difficult.

Conses. Lisp has conses, Python does not. Conses are not everything¹, but unless you have them you can’t implement them reasonably, and they are extremely useful data structures for many purposes. In particular for conses to be useful you need two things:

a good syntax for them and for lists built from them;
good performance — conses should be extremely cheap, so you can’t implement them as a special case of some heavyweight data structure such as a Python list, because there is an enormous header.

This means that conses need to be wired into the language: you can’t take a language without conses and add them, because even if you can get the first (you can’t in Python) you can’t get the second.

Symbols. Lisp has symbols, Python does not. You can use strings, and this works sometimes.

Lambda. Lisp has lambda, Python has an extremely limited version. Not being an expression language (see above) and the lack of scoping and block constructs in Python cripples its lambda.

Source code available as a low-commitment data structure. Lisp has this, Python does not. ‘Low-commitment’ means that it is available before it has been decided what it means, but after it has been turned from a stream of characters into something more interesting. This matters because it makes macros possible: macros which work by transforming streams of characters are doomed to the sort of unspeakable horror of which Jinja2 is a good example, while macros which work after it has been decided what the code means then can’t make their own decision about what it means, which is half the point of macros.

Scoping. Lisp has a multiplicity of scoping constructs and all modern Lisps have lexical scope, with some (Scheme) extending this to control constructs. Binding and assignment are irreparably confused in Python: scope does not work properly and this can never be fixed. A language which requires a global declaration is not going to be fixed by adding nonlocal.

Macros. Lisp has them, Python doesn’t. Since macros are the point of Lisp, it is really hard to see how the above quote makes any kind of sense.

There is a terrible truth about the percieved arrogance of Lisp hackers that it has taken me a long time to understand. The arrogance is justified: Lisp is, in fact, a better programming language.

In particular conses are not a useful universal data structure in the way that, perhaps, early Lisp people thought they were. ↩

Macros in Racket, part three: checking boolean operators

Tim Bradshaw — Sat, 12 Dec 2015 10:59:54 UT

I wanted to see if I could write a mildly complicated macro in Racket without becoming too confused. I can, although I am not sure it is terribly idiomatic.

This is the third part of a series on writing macros in Racket for someone used to Common Lisp, although it is mostly independent of the previous parts. The previous parts are part one & part two.

One of the nice things about Lisp-family languages is that you can write your own control constructs, and it’s essentially easy to do so: if when did not exist then you could write it:

(define-syntax-rule (when test form ...)
  (and test
       (begin form ...)))

This kind of extensibility is one of the wonders of Lisp and Scheme: it’s tempting to say that it makes them better than programming languages which can’t do this but that’s not correct: it makes them incomparable to such languages: Lisp¹ programs can reason about themselves and often do². Everything about Lisp really leads to this ability.

When I taught (Common) Lisp to people one of the things I would try to get across was this ability of macros to extend the control constructs in the language: people often thought of macros as a way of essentially inlining code³, but that’s not what they’re actually good for. If you can add control constructs to your language, then you can make a new language, and that’s what Lisp macros are about, and therefore what Lisp is about.

A good way to get this across to people is to pretend that Lisp doesn’t have some control construct, and write it as a macro. This is easier than inventing new control constructs both because it doesn’t require thinking of a domain where they might be useful and because the existing control constructs have clear semantics. Reimplementing existing control constructs also demonstrates how the language is already built up from a more primitive language by macros and how the approach to solving problems in Lisp is to design and implement a language in which to talk about the problem, where that language is seamlessly built on the underlying Lisp, and can inherit all of its power and flexibiliy, including the ability to extend the language.

An advantage of reimplementing existing control constructs for teaching Lisp is that you can compare the new construct to the existing one, and with some small constraints you can do this exhaustively, so you can know whether you have actually implemented it right. This is, obviously, not possible in general, but if the operator has trivial syntax (so not cond) and if you limit the arguments of the operator to booleans then you can enumerate all the possible arguments in the obvious way, and so long as it returns a result for all combinations of arguments (does not fail to halt in other words) and is deterministic then there are only two things you need to check:

does the operator produce the same result for all combinations of arguments ($2^n$ possibilities for $n$ arguments) as the existing one?
does the operator evaluate its arguments the same number of times as the existing one for all these combinations?

So, for instance, if takes three arguments (in Racket) and should evaluate the first exactly once, and the others at most once, as well as returning the correct value.

Obviously such a check is not a full check of the operator — it does not tell you what it does with non-boolean arguments for instance. But I was interested in writing the check largely because it’s clearly a reasonably hairy macro which I know how to write in CL and wanted to see if I could write in Racket (I’m not very likely to teach people Lisp again).

What the macro needs to do

The idea is that to compare two boolean operators o1 and o2 which take n arguments you need to generate code which looks like this:

(for/and ([c (expt 2 n)])
  (let ([a1 (bitwise-bit-set? c 0)] ...)
    (let ([o1c1 0] ...)
      (let ([o2c1 0] ...)
        (and (eq? (o1 (begin (set! o1c1 (+ o1c1 1)) a1) ...)
                  (o2 (begin (set! o2c1 (+ o2c1 1)) a1) ...))
             (= o1c1 o2c1) ...)))))

So a1 is the first argument, o1c1 counts how many times o1 evaluates it, and o2c1 counts how many times o2 evaluates it, and so on. I decided to compare the operators with eq? rather than eqv? for no very good reason except that it works for operators whose results are booleans, which is what I was interested in. I should almost certainly use eqv? I think — certainly the -equivalent in the name would imply that — but I’m not.

It’s clear that a loop like that checks all of the $2^n$ possibilities for the arguments, where each argument can be either #f or #t only. So this does an exhaustive check of all the possibilities, and provided o1 and o2 are deterministic and halt on all their arguments it will tell you whether they are equivalent.

And finally, this must be written as a macro, because the operators it is testing are themselves not generally functions: in particular things like if and or are obviously themselves not functions.

Things I did not know how to do

The big thing I didn’t know how to do here was to make up new identifiers: all the counters need to be created, and possibly also the argument names. In CL you’d do this with make-symbol or gensym or something like that. Assuming I want to use syntax-case rather than writing a CL-style construct-the-form-with-backquote-and-use-datum->syntax macro (which I very much do want to do) then there are two problems:

constructing the names of the counters;
making them available as pattern variables.

Well, (2) is easy: you can use nested syntax-cases, or equivalently but much more prettily, with-syntax to bind the pattern variables. And it turns out that with-syntax is willing to do a lot of work on your behalf: if you give it something which is not a syntax object it will massage it into one for you. So, in particular, this works:

(with-syntax ([(o1c ...) (list ...)])
  ...)

It takes the list it is given, turns it into a syntax object (with datum->syntax I suppose) and then does the matching. So you can be really lazy here: all you need to invent is a list of identifier syntax objects, and with-syntax will do the rest, making the program a lot less noisy. This is a really neat feature, although it might lead you to get confused about what is, and what is not, a syntax object I suppose. Anyway, I used it ruthlessly.

So this leaves (1). You could obviously do this with something like (datum->syntax ctx (string->symbol (format ...))), but Racket provides a nice shorthand for that in the form of format-id: (format-id ctx "~a-count" v) will construct an identifier syntax object from v using ctx as lexical context. And it will do the appropriate magic if v is an identifier syntax object: extract the symbol from it and use it as the argument to format in the appropriate way.

So it looks pretty straightforward to construct lists of identifiers and bind them to pattern variables. The final thing that confuses me is what lexical context to use for the identifiers. The macro should be hygenic, which means they can’t have the context of the syntax object it is working on, but I think can have more-or-less any other context where they have no existing meaning: I just invented an object for them, which I think is safe, although I am a bit confused about this.

What users see

I spent a really long time stuck on what the syntax of the macro should be: this is entirely stupid because it just does not matter that much. The reason I got stuck is that it would matter if this was a real library and I am constitutionally incapable of writing things without worrying about that kind of thing. Eventually I decided that it would be best if the user provided the argument names as a list, because they generally make sense to users and because I didn’t want to get into something which looked as if you could pass it an integer when in fact what it needs is a literal integer. So I decided on a syntax like this:

(boolean-operators-equivalent? o1 o2 (a1 ...))

So, for instance:

(boolean-operators-equivalent? if my-if (test then else))

I still don’t really like this; but I’m just playing so, well, it will do.

Additional cleverness

I wanted to report syntax errors in a reasonable way: apparently the proper way to do this is using syntax-parse but I am not ready to understand that yet, so I used wrong-syntax and the current-syntax-context parameter to get reasonable-looking errors.

I thought it would be nice to be able to report failures of equivalence, so there is a parameter which controls that and the expansion of the macro includes a check for the parameter and prints the failed cases if it’s true. All this happens at run time (phase 0) of course.

The macro itself

So, finally, here it is.

(require (for-syntax (only-in racket/syntax format-id
                              current-syntax-context wrong-syntax)))

(define boe-report-failure? (make-parameter #f))

(define-syntax (boolean-operators-equivalent? stx)
  ;; Given the names of two boolean operators and a list of argument
  ;; names, expand to a form which tests that they are equivalent, by
  ;; evaluating the with arguments bound to all the combinations of #t
  ;; and #f, and also checking that they evaluate the same arguments
  ;; in each case.
  ;;
  (parameterize ([current-syntax-context stx])
    (syntax-case stx ()
      [(_ o1 o2 (v ...))
       (let* ([vars (syntax->list #'(v ...))]
              [nvars (length vars)])
         ;; This check could be a guard, but we need the bindings
         ;; anyway, so.
         (for ([var vars])
           (unless (identifier? var)
             (wrong-syntax var "not an identifier")))
         ;; vars is now a list of identifiers, and nvars is how many
         ;; there are.  We need to construct syntax for check
         ;; variables for each var and and operator, as well as
         ;; construct 2^n and a list of bit numbers.]  This is being
         ;; fairly fast and loose: it turns out that various things
         ;; get automagically converted into syntax objects, and I
         ;; have not cared about the context for numbers (what is
         ;; it?).  In general I am a bit confused about what the
         ;; context should be here, but it clearly should *not* be
         ;; stx.
         ;;
         (with-syntax ([(o1c ...) (for/list ([v vars])
                                    (format-id #'boe "~a-1-eval-count" v))]
                       [(o2c ...) (for/list ([v vars])
                                    (format-id #'boe "~a-2-eval-count" v))]
                       [2^n (expt 2 nvars)]
                       [(b ...) (for/list ([i nvars]) i)])
           ;; And now just write the pattern we want.  '...' is pretty
           ;; clever, it turns out
           #'(for/and ([c 2^n])
               (let ([v (bitwise-bit-set? c b)] ...)
                 (let ([o1c 0] ...)
                   (let ([o2c 0] ...)
                     (or (and (eq? (o1 (begin (set! o1c (+ o1c 1)) v) ...)
                                   (o2 (begin (set! o2c (+ o2c 1)) v) ...))
                              (= o1c o2c) ...)
                         (begin
                           (when (boe-report-failure?)
                             (eprintf "Not equivalent:~% ~a~% ~a~%"
                                      (list 'o1 `(,v ,o1c) ...)
                                      (list 'o2 `(,v ,o2c) ...)))
                           #f))))))))]
      [else
       (wrong-syntax #'else "expecting o1 o2 (a1 ...)")])))

To my astonishment, this worked pretty much first time (it did not initially have the wrong-syntax stuff, but this was easy compared to the rest of it):

> (define-syntax-rule (if/broken test then else)
    (or (and test then) else))
> (boe-report-failure? #t)
> (boolean-operators-equivalent? if if/broken (test then else))
Not equivalent:
 (if (#t 1) (#f 1) (#f 0))
 (if/broken (#t 1) (#f 1) (#f 1))
#f

The macro, complete with some tests and other infrastructure can be found here⁴.

Notes and queries

I still don’t know whether this is really idiomatic Racket, although I am reasonably happy that I understand what is going on. There are a couple of things I am not sure about:

is the context for the count variables right? I think it is, but I am not sure;
the macro relies heavily on Racket’s extremely smart behaviour with ... — I am still unclear just how smart this is and whether I am relying on things which are not actually specified to happen;
similarly it relies on with-syntax being willing to convert things to syntax objects for you, which I am not sure is safe.

However, even with these worries, I think it’s pretty clear that Racket macros are significantly nicer than CL macros, if also significantly more opaque.

I am going to use ‘Lisp’ to mean ‘Lisp-family’ from now on. This is not meant to denigrate Scheme — this post is about Racket, after all — I just need a term which is not too clumsy. ↩
Of course, programs in other languages often do end up reasoning about themselves: people end up writing little languages all the time. But you only have to look at most examples of this sort of thing to realise how far ahead Lisp is: I’m currently having to deal with a system whose configuration files are in a mutant version of Windows ini file syntax, with a preprocessor which is entirely unaware of that syntax, and an entire other language which lives in strings in the base language. The preprocessor does not know about the string syntax so it pokes down into this inner language as well. I’d like to say that Greenspun’s tenth law applies, but that would imply a level of sophistication entirely missing in this horrible thing: all I want to do is leave this job and never think about it again. ↩
Macros were often used to inline code in the days of primitive compilers of course, but that’s a long time ago now. ↩
I may move it somewhere more permanent in due course, so bookmark this at your peril. ↩

The weakest passwords you can get away with

Tim Bradshaw — Wed, 14 Oct 2015 16:55:22 UT

Or: why password strength checkers are useless.

A lot of people work in environments where they have to change password every few months, and where there are restrictions on what passwords must look like. Here is how to deal with that, if you don’t care about security.

Pick two strings which are complicated enough to keep the password checker happy, which I’ll call $s_1$ and $s_2$. Remember them.
Also remember a two-digit count, starting from $00$.
The first password is $0s_10$, the second is $0s_20$, the third is $0s_11$, the fourth $0s_21$ and so on: each time you need to change passwords you swap between the two strings, and every other time you increment the count.

This gives you two hundred passwords, at the cost of remembering two strings and a two-digit count: if you have to change password every three months this will last you fifty years.

This works becaus the thing that is forcing you to change password can know two things:

the current and new passwords, in plain;
the hashes of all your previous passwords.

So what you need to ensure is that each password change changes enough to keep the checker happy, and that all the hashes are different. This algorithm achieves that, while also ensuring that you have to remember almost nothing. The count is wrapped around the strings just in case the checker is looking for things that look like they have trailing counts: you might need to obfuscate it in other ways if checkers get more clever¹.

Of course these passwords are terribly weak: if you know one of them you know half of them, and if you know any sequential pair you know all of them. But, if you don’t care about security but merely the appearance of security, you can use tricks like this.

Counting in hex or base 36 is a good trick: the only thing that matters is to have something you can easily remember and which changes each time. ↩

Greenspunning

Tim Bradshaw — Thu, 08 Oct 2015 15:16:56 UT

Three approaches to solving problems on computers.

When faced with a computational problem there are three common approaches:

write a program to solve the problem;
write a tool to solve the problem and other problems of the same kind;
write a programming language in which you can then write tools which solve problems of the same, and other, kinds.

Most people start by doing the first. Bradshaw’s corollory to Greenspun’s tenth law states:

for problems of size $s \ge s_1$, then, regardless of the initial approach, the final result is as if the third approach had been taken, even if this is not understood by the people solving the problem;
there is a problem size $s_0$ above which it is most efficient to take the third approach from the beginning;
$s_0 \lt s_1$.

What this means is that, if you have a sufficiently large problem ($s \ge s_1$) to solve then, whatever your intentions, you will inevitably end up creating a programming language as part of the solution. And there is a range of problems smaller than this ($s \in (s_0, s_1)$) for which the quickest way to solve the problem is to design and implement a programming language.

So, when approaching a problem, it is important to understand the values of $s_0$ & $s_1$ and how they compare to $s$. These values are hard to discover: a good trick is to start with a platform which makes $s_0$ very small and always take the third approach.

Fog computing

Tim Bradshaw — Thu, 23 Jul 2015 09:57:01 UT

Fog computing is like cloud computing except that no-one can see what you are doing.

A basket of eggs

Here is an interesting quote from the website of a company which provides an ‘enterprise content collaboration platform’:

80% of central government departments use [our system], making it the most trusted cloud-collaboration solution for UK government and public sector organisations.¹

There are several ways of understanding this.

What they want you to think. ‘Gosh, all these government people will be very fussy about security and extremely competent, and we’re a big corporate/government type place too: we should be using this product ourselves.’

What Dr. Evil is thinking. ‘80% of UK central government departments are using these people? That’s a lot of data that I am sure my customers would be willing to pay a great deal for, all in one place. Minions: to your keyboards!’

What President Evil is thinking. ‘80% of UK central government departments are using these people? That fool Dr. Evil is probably wasting a lot of effort trying to break in to sell me the data. Minions: buy that company for me!’

What the government is thinking. ‘Minions: another bottle! And send up another boy: I seem to have broken this one.’

The desert of the real

We all like to talk about ‘the cloud’ as if it is something new, but it isn’t: all it is is centrally-managed and outsourced storage and processing of our data. The only new thing about this is the outsourcing, and that’s not very new.

Central management holds out the hope of saving money and improving security, but means that there is a single point of failure: if the system fails then it fails for everyone, and if it is compromised then it is compromised for everyone. Information can also leak between regions which should be isolated from each other: in particular a hostile user who succeeds in compromising the system can obtain other users’ information.

Outsourcing means that small organisations or individuals don’t have to have expertise in data management but can rely on an external provider to do it for them. Large organisations may think they can save money by outsourcing and occasionally they can. Outsourcing means you are protected only by a contract and lose direct control over the system: this is fine so long as you are sure that the provider is honest, competent, and not subject to a malevolent legislative framework. Well, they may at least be honest.

The thing that makes the economics of cloud computing work is that there will be a relatively small number of relatively large specialist providers who can become really expert at providing these services and exploit economies of scale to make doing so cheap². Unfortunately this is also what makes cloud computing dangerous: if a lot of sensitive data is centralised in a small number of organisations this is like painting targets on the backs of those organisations. Anyone who is interested in that data — bad people, governments (are they different than bad people?) and competitors — will stand to gain enormously by compromising cloud providers.

Of course, they will tell you how secure they are, and imply that they can never be compromised like this. If you believe that you can stop reading now.

Obscured by clouds

So let’s assume that you don’t trust your cloud service providers and you care about your data: Can you still make use of them? The answer is that you can in limited but, I think, still useful ways.

There are two assumptions that you must not make:

don’t assume the cloud provider is reliable — your data and any associated services can vanish at any time and that must not be catastrophic;
don’t assume the cloud provider can be trusted — assume that either they are themselves not trustworthy, or that they have been compromised, legally or illegally, and that anything you store or process there is visible to bad people as a result.

It’s fairly easy to deal with the first point: if the data might go away you need to make sure that you have other copies of it, and ideally copies that you have full control over. Similarly with services: make sure you can survive if things go away.

The second case is harder. If you can’t trust your provider what use are they? Well, still some use. In particular, if all the data that you store on the cloud is encrypted and the encryption keys are not available to the provider then, even if bad people get access to this data there is rather little that they can so with it: it’s just a huge blob of meaningless bits to them. To decrypt the data they must attack your systems, where the encryption keys are held.

Encrypting data like this fairly seriously limits what can be done with the data in the cloud: in fact all that can be done with it is to ship it to from clients and store it in the meantime. No kind of processing which depends on the content of the data can be done at all on the provider’s systems. For many purposes this is a less crippling restriction than it seems: globally-available storage is quite a useful thing to have, in its own right.

For instance, a government agency might want to keep sensitive documents in the cloud: it can do this quite happily so long as the documents are always encrypted before they leave the client with keys which also never leave the client. To edit a document it is fetched, decrypted, edited and encrypted again on the client, and then sent back to the cloud³.

What a system like this can’t do, by design, is process data in the cloud in any way which depends on its content: if you want, say, a shared calendar with server-side appointment management then you can’t have it, because it requires the server to be able to see the content of the data.

The illusion of security

Cloud service providers are very anxious to tell you how secure they are: they will use terms like ‘encrypted at rest’, ‘AES–256’, and ‘military-grade security’, all of which signify nothing. There are only two questions that matter:

do they have the keys to whatever encryption system they are using?
if they do, are you and they the same person?

If the answer to the first of these is true, then the answer to the second must also be true: if it’s not then you should not trust them. Yes, they might mean well, and they might even be competent, but even if they are they can be subject to attacks which they will not be able to withstand: when the people who won’t say who they work for come calling with their bit of paper then the keys will be handed over and they won’t tell you that this as happened.

The only way that your data is safe is if you put it in a box to which only you have the key⁴, and that means that you must encrypt it with keys you control and live with the consequences of that.

In the fog

Fog computing is more-or-less this: it is the use of cloud-based shared storage to share data which is encrypted and decrypted only on the client, providing the possibility of real security rather than the illusion of it that cloud providers currently offer.

One good thing about fog computing is that you can implement it yourself: you do not need to rely on a provider offering the service. A tool which encrypts data on the client can sit on top of any kind of cloud storage provider. This is, indeed, beginning to happen: there are backup tools (notably Arq) which do this client-side encryption and can indeed be configured to sit on top of many different cloud storage providers.

However even encrypting the data like this is not really enough. The bad people can still look at your patterns of access and (if you are not careful to obscure it) metadata such as file names and deduce more than you would like: for instance they can work out who you talk to by noticing who else accesses your data, and so on. This can be avoided by obfuscating these acces patterns but it is much harder to do. But just encrypting the data with keys you control is a big step in the right direction.

Fog computing is inherently limited: since the data in the cloud is entirely opaque, no useful computation can be done with it there. You can not have shared calendars with conflict detection in the cloud, you can not edit documents which live entirely in the cloud, and so on. But it is, or can be, secure, and if you care about security this is what you should be doing.

The quote is current at the time of writing, but edited to remove names. ↩
If you are a large enough organisation to get computers custom-made to your designs then you can make them very cheap, and some cloud providers do just that. Almost all of them will be building custom datacentres. ↩
Documents which are not sensitive or which should be public can of course be left in plain text in the cloud. ↩
And even then the shabby people with their bits of paper and police escort can come calling, but at least you will know they have called, which is the best you can hope for. ↩

Contracts

Tim Bradshaw — Sat, 14 Mar 2015 15:52:45 UT

Do not eat the free lunch: it has probably been poisoned.

On 2015–03–12, Google announced the closure of Google Code, the latest in a succession of services they have switched off over the last few years. This is a perfectly reasonable thing for them to do: they are a commercial organisation and need to focus on the things that make them money — selling advertising and acquiring as much personal data as possible from users of their services to help them do that — and hosting source code repositories is probably not a very efficient way of scraping such data off people.

So there is no reason to complain about this, however annoying it is: it was a service that was being offered for free, after all. But of course, a number of people will be significantly inconvenienced when things like this go away because they have come to rely on them, either personally or as part of their business: this turns out not to have been the smartest idea. The interesting question is whether they will learn from the experience and what they’ll do to stop it happening again.

Too cheap to meter

The cost of many things related to computers and networking has fallen dramatically over time, and continues to fall. We’ve also found out that more things are related to computers and networking than we realised: music, still and moving images, books and so on. In particular the marginal cost — the cost of making an additional copy of something — has often become extremely low because the cost of storing and moving data around has become very low indeed.

It’s quite tempting to think that ‘very small’ is the same as ‘zero’¹, but this is a fatal mistake: if it costs nothing to do something then it costs nothing to do an arbitrary amount of it, while it it merely costs a very small amount then you can make the cost arbitrarily large by doing enough of it. If something with a non-zero cost, however small, is given away for no cost then the giver is in a dangerous situation: nothing is too cheap to meter unless it is free², and nothing is completely free. So if an organisation is giving away a service ‘for free’ there is reason to be suspicious: either things are what they seem, in which case they are going to run out of money at some point and disappear, or things are not what they seem.

If things are what they seem there’s a fairly obvious problem: you probably don’t want to build anything substantial around a service which is inevitably going to evaporate when the organisation providing it falls off a cliff.

Things are more interesting when they are not what they seem: how is the organisation making money if they’re providing something for free?

The first hit is free

One approach is the one traditionally used by people who sell recreational drugs: you get a free taste of the service, but the taste will be limited in ways which make it annoying to use and probably will prevent you from doing some things altogether. Eventually, all being well, you become both dependent on whatever it is they are pushing and frustrated with the limitations of the free version and decide to pay for the unrestricted version.

There is nothing very wrong with this approach: you’re getting something for free, after all: just not what you really wanted. And you have the option of paying for that if you choose to: that’s what the supplier wants you to do, after all. This is not, however, a very good long-term solution: the supplier could always simply stop offering the limited version or, worse, stop offering any version at all.

The place where there is no darkness

Another approach is one you might associate with a person wearing suspiciously well-cut clothes that you once met late at night at a crossroads somewhere in the deep south. Now you can play the guitar pretty well, but can you remember just what it was that you you bargained for your new talent and when the debt will become due?

This is not the sort of bargain you want to make³. But it is exactly this sort of bargain on which a lot of large companies have built their businesses: they provide you with some service, and in return you provide them with your soul, which they then package with a lot of other souls and sell on to you know not whom. They’re not, in fact, interested in providing the service: they’re in the soul collection and resale business.

A lot of people quite clearly think this is all just fine. They’re quite happy to trade their souls for an endless set of distractions: perhaps the point of the distractions is so they don’t realise just what it is they’ve lost and what exactly it was they gained in return if anything; or perhaps they have souls which are not very valuable and the bargain is a perfectly reasonable one, after all.

There is worse. When you met someone late at night to make this sort of bargain, you made very sure that you got a bit of paper with signatures on it detailing just exactly what the deal was⁴. That’s not how the deals that are made so willingly now work: you get something momentarily useful or amusing, and in return you irrevocably give away something of yourself, and that’s as far as it goes. If, later, it becomes convenient for the entity you did the deal with to stop providing whatever entertainment it was, then one day it simply goes away and you have bargained your soul for air and darkness, and precious little of that.

Better living through chemistry

The answer is quite conventional. If there is something you want and on which you might come to rely, then you sign a contract for it: a document which obliges you to pay for it, and in return obliges the provider to actually provide the service.

Contracts really do three things.

They make it clear what exactly is being bought and sold, and avoid the ‘too cheap to meter’ fallacy I talked about above: the contract should detail what you get and what the limits on it are — how much bandwidth or storage you can use for instance — and what you are paying for it, which should generally not be ‘your immortal soul’.
They ensure that the interests of the consumer and the provider are the same, or at least similar: the consumer wants a service or a product that works well, and the provider gets paid if they provide that.
They specify what happens if the contract is terminated: what the responsibilities of each party are and what they are not. For instance the organisation providing your cloud storage might be obliged to give you a way to recover your data.

The second point is particularly important: for a contract to be of any use at all both parties have to get something out of it: you can sign a contract with someone to provide you some service for free, but if they decide to stop doing that what are you going to do — perhaps you could ask them for your money back?

But, well, this is a very conventional and rather boring answer: surely we all live in a future where all this awful tedium is no longer needed. Wasn’t the internet meant to do away with all that? What happened to the gift economy? Are there no flying cars, after all? Sadly, no, the internet didn’t change all that: it simply enabled a collection of large corporations with toxic business models to fool a really large number of people. There are no flying cars.

On 2015–07–16 SourceForge fell over: perhaps it will recover, this time. Once upon a time it was the bright future of source code hosting: who knows what will be lost when it finally goes away?

It is particularly tempting to people who want to make the argument that ‘no harm is done to the artists if I just download this song, because it costs nothing for them to deliver an extra copy: they have already been paid’. I am not sure if this argument is ever made in good faith, but it’s very easy to see that it doesn’t work by reductio ad absurdam: what would happen if everyone made it? However I don’t want to get sidetracked by that here. ↩
‘Metering’ may be simply restriction of supply — for instance a limit to the amount of data you can transfer, which may not seem like metering although it is. In the limiting case the limit may be the physical capacity of the system: you can only transfer so much data per month over link with a given bandwidth. I suspect that the original ‘too cheap to meter’ claim was made based on this assumption for domestic electricity usage (if it was ever really made at all). ↩
Well, perhaps it is a bargain worth making, but probably not in exchange for anything related very closely to computers. ↩
Perhaps in the hope of later renegotiation, although it generally seemed to turn out that the counterparty had rather better negotiation skills than you and, obviously, expensive lawyers with dead eyes. ↩

Rumours of my death

Tim Bradshaw — Sun, 01 Feb 2015 20:54:34 UT

When I first used Lisp, the common refrain was that Lisp was dead.

There was a single free implementation of CL (which required you to physically sign a license of some kind and return it, in exchange for a tape) which was deficient in many respects. The two or three commercial implementations cost about a year’s salary each. Enormous effort had been spent on implementations which ran on special hardware. One variant of these cost more than your house: the other rather less, but turned out to have been implemented by the fey — you seriously did not want to spend too much time with it if you did not want problems involving having your firstborn somehow changed into a strange and somehow absent creature.

(And there was a terrible, unspeakable truth about even the expensive hardware: the people who implemented it didn’t understand computer performance very well with the result you would expect. The systems were faster than a VAX, but everything was faster than a VAX, including some PDP–11s. A Sun 3/260 ate them alive, and you could buy several of those for the cost of a house, with bundled licenses.)

Performance was pretty grim: of course nothing was fast on machines that, on a good day, could execute a few million instructions a second, but Lisp implementations were problematic at best. You spent a lot of time turning recursive code into iterative code by hand and writing macros (no inlining) to get performance to be reasonable and worrying about the primitive garbage collectors.

There was no standard: existing implementations differed in basic details like error handling (not in the aluminium book) and a standard object system was a distant dream. The news from the standards committee was ominous: the special-hardware people were exerting pressure and there were serious worries that the object system would not be efficiently implementable on stock hardware. The language was going to be huge.

Standard or semi-standard libraries were not really thought of.

Everyone knew Lisp was dead: the coming thing was, perhaps, Scheme — tail-call elimination in the language, a small language (yet MIT Scheme somehow had a bigger footprint than the CLs we used) — or C++ or some functional language whose name no-one now remembers. But Lisp was dead: no question about it.

Fast forward.

I have two high-quality CL implementations on my machine and one Scheme-derived system, also of very high quality, which created this blog: I have long ago stopped counting the number of good-quality free implementations. One of the implementations I use is commercial: the annual support is about 10% of my monthly rent. I can run dozens of instances of each without the machine noticing, and I could happily run a full CL development system on a system less powerful and smaller than my phone. Performance is a solved problem: yes, highly-optimised code is, perhaps, slower than optimised C or Fortran but since almost all performance problems are design problems no-one older than about 19 cares any more. CL has an advanced, performant and standard object system and, in effect, a standard metaobject system as well. The library problem has been solved by Quicklisp and a large number of good-quality standard libraries. I am still using code I wrote over twenty-five years ago with essentially no modification: meanwhile the Python code I wrote ten years ago is long rendered obsolete by gratuitous changes in the language (the Perl code I wrote at the same time is doing fine, however).

And yet still the cry goes up: Lisp is dead; Lisp is dead.

Macros in Racket, part two

Tim Bradshaw — Wed, 28 Jan 2015 19:31:18 UT

The second part of my notes on writing macros in Racket.

This is the second part of at least three: the first part is here, and the third part is here. This won’t make much sense unless you’ve read that. As before I make no claims to be an expert in Racket’s macro system although I am familiar with Lisp macros in general: this is just some more notes I wrote while learning it.

The unwashed Lisp hacker’s version of `collecting`

So, we can write clet: can we write collecting? Yes, we can:

(require (for-syntax racket/list))

(define-syntax (collecting stx)
  (datum->syntax
   (quote-syntax collecting)
   `(let ([r '()])
      (define (,(datum->syntax stx 'collect) it)
        (set! r (cons it r)) it)
      ,@(rest (syntax->list stx))
      (reverse r))))

This works because, in the internal definition of collect, we’ve intentionally given it a name which uses the context of the syntax object we’re transforming, not the context of the macro. It’s easy to confirm that this works the way you would expect, and in particular that it’s safe in both directions: for instance

> (let ((reverse (λ (x) x)))
    (collecting (collect 1) (collect 2)))
'(1 2)

shows that the binding of reverse when the macro is called has not ‘infected’ the macro definition.

It seems as if that should be all you need: so long as you are careful about which context you choose, and you make sure that the ‘default’ context is the one from the macro not from where it is used. In fact it isn’t, quite: see below. However even if it were, it’s clearly a pain to write macros this way.

Pattern matching

Pretty much all macros do two things:

deconstruct their arguments in some more-or-less complicated way, but almost always in a way which is significantly more complicated than anything that needs to be done for the arguments of a function;
construct a form which is the result of the macro and which, again, may be complicated.

The beauty of traditional Lisp macros is that since the arguments and results of the macro were just what the reader spat out — lists and symbols and so on — and since Lisp was kind of good at doing things to these structures as it was designed for that, and finally since the whole power of the language was available in the macro, this was not horrible even without special tools, although it was not particularly pleasant for complicated macros.

Hygienic macros make this much less pleasant because the objects that need to be deconstructed and constructed are now opaque syntax objects, and there is additional worrying about context to do. The answer to this is to provide special tools which do the boring bits for you: this makes everything simpler, at the cost of making it still more opaque what is actually happening. In almost all cases that’s a tradeoff worth making. Pattern matching is also a fashionable thing amongst the young and hip, of course.

The way this is done in Racket is via syntax-case, its slightly simpler friend syntax-rules, and by syntax and variants on it.

syntax-case takes a bit of syntax and matches it against patterns, binding matches, which can then be used in syntax forms lexically within it to return syntax objects, whose context is that of the syntax-case form (so hygienic). There is syntactic sugar for syntax: (syntax ...) can be written #'... in the same way that (quote ...) can be written '.... There is also quasisyntax which works the same way as quasiquote, except that the various unquoting things are preceeded with #. quasisyntax, unsurprisingly also has syntactic sugar coating: (quasisyntax ...) can be written #`....

I’m not going to describe the patterns in any detail, largely because I only understand the simple cases. However the simple cases are relatively easy to understand and pleasant to use.

Once a case has matched in syntax-case the corresponding expression is evaluated, and its value is the value of the form. Generally that wants to be a bit of syntax.

The first important thing to understand is that syntax is not quote-for-syntax: it interpolates things which matched in a lexically surrounding syntax-case, if there is one (if there isn’t, then I think it is quote-for-syntax).

The second important thing to understand is that syntax-case and syntax turn Racket into a sort of bodged Lisp–2: the things matched by syntax-case can be used only in syntax forms. But it’s not actually a separate namespace, because if you refer to them outwith such a form you get a compile-time error. I don’t know why this is — perhaps to avoid accidentally naming matches outside a syntax form — but it is certainly annoying.

So, here are some examples.

A simple while form:

(define-syntax (while stx)
  (syntax-case stx ()
    [(_ test body ...)
     #'(let loop ()
         (when test
           body ...
           (loop)))]))

A simple implementation of let, leaving out the named-let case, which shows how good the pattern matching is:

(define-syntax (with stx)
  (syntax-case stx ()
    [(_ ([var val] ...) body ...)
     #'((λ (var ...) body ...) val ...)]))

A better implementation which deals with the empty body case ((λ (...)) is illegal in Racket) and also optimises a simple case:

(define-syntax (with stx)
  (syntax-case stx ()
    [(_ () body ...)
     ;; no vars: trivial case
     #'(begin body ...)]
    [(_ ([var val] ...))
     ;; null body: make sure vars are evaluated
     #'(begin val ... (void))]
    [(_ ([var val] ...) body ...)
     #'((λ (var ...) body ...) val ...)]))

One thing which syntax-case allows is the notion of literal names which must occur in the source. So for instance let’s say I wanted to write some mutant loop macro whose syntax was (loop for x in y do ...): where for, in, do are literals. Well, I can write something to match this:

> (define-syntax (loop stx)
    (syntax-case stx (for in do)
    [(_ for v in l do body ...)
     #'(for ([v (in-list l)]) body ...)]))
> (loop for x in '(1 2 3) do (print x))
123
> (loop with x in '(1 2 3) do (print x))
loop: bad syntax in: (loop with x in (quote (1 2 3)) do (print x))

The syntax object that corresponds to stx here is the whole form: the equivalent to CL’s &WHOLE. It’s almost never necessary to worry about the car of this since it will obviously be loop. However I’m always tempted to provide it as a literal.

syntax-rules is (almost: there is some complexity I think) a wrapper around syntax-case which provides the function wrapper for it and which implicitly wraps the right hand side of the cases, which must be just one form, in a syntax form. So the above definition of with could be written:

(define-syntax with
  (syntax-rules ()
    [(_ () body ...)
     ;; no vars: trivial case
     (begin body ...)]
    [(_ ([var val] ...))
     ;; null body: make sure vars are evaluated
     (begin val ... (void))]
    [(_ ([var val] ...) body ...)
     ((λ (var ...) body ...) val ...)]))

syntax-rules can be defined something like this (this is due to bmastenbrook):

(require (for-syntax 
          (rename-in racket 
                     [syntax-rules racket:syntax-rules])))

(begin-for-syntax
  (define-syntax syntax-rules
    (racket:syntax-rules ()
      [(_ literals (pattern expansion) ...)
       (lambda (s)
         (syntax-case s literals
           (pattern #'expansion) ...))])))

define-syntax-rule combines define-syntax and a single rule for syntax-rules. I think it might be equivalent to this:

(define-syntax define-syntax-rule
  (syntax-rules ()
    [(_ (name pat ...) expansion)
     (define-syntax name
       (syntax-rules ()
         [(name pat ...) expansion]))]))

although I am probably missing some complexity here.

There is a useful variant on syntax-case called with-syntax: it looks more like let-style thing, and all the patterns in the clauses must match, when all the pattern variables will be bound.

So, what about our desirable macros?

collect is pretty easy. Here are two different versions. The first uses quasisyntax:

(define-syntax (collecting stx)
  (syntax-case stx ()
    [(_) #'(void)]
    [(_ body ...)
     #`(let ([r '()])
         (define (#,(datum->syntax stx 'collect) it)
           (set! r (cons it r)) it)
         body ...
         (reverse r))]))

The second uses with-syntax:

(define-syntax (collecting stx)
  (syntax-case stx ()
    [(_) #'(void)]
    [(_ body ...)
     (with-syntax ([collect (datum->syntax stx 'collect)])
       #'(let ([r '()])
         (define (collect it)
           (set! r (cons it r)) it)
           body ...
           (reverse r)))]))

This is pretty nice, I think. Note that you could not do this with syntax-rules, or at least I can’t see how to do it: syntax-rules is quite a lot less general than syntax-case.

clet is harder, because each element of the binding list may be either an identifier or a two-element list. If we insisted on a two-element list it would be easy (see above). Here is the best I can do:

(require racket/undefined)        

(define-syntax (clet stx)
  (syntax-case stx ()
    [(_ ()) #'(void)]
    [(_ () body ...) #'(begin body ...)]
    [(_ (b ...) body ...)
     (let-values ([(vars vals)
                   (for/lists (as vs) ([binding (syntax->list #'(b ...))])
                     (syntax-case binding ()
                       [(var val) 
                        (identifier? #'var)
                        (values #'var #'val)]
                       [var
                        (identifier? #'var)
                        (values #'var #'undefined)]
                       [_ (raise-syntax-error #f "bad binding" stx)]))])
       #`((λ #,vars body ...) #,@vals))]))

Well, this is still quite hairy, but almost all of the hair involves processing the binding list, which is done using syntax-case again, using an additional feature of it whereby it can use a ‘guard’ expression to decide whether a clause matches: identifer? returnt true if a syntax object refers to an identifier. I think there must be a way of using with-syntax to avoid the quasisyntax form.

Even with all this hair, this version of clet is far easier to read than the previous one, and not harder to read than the CL equivalent.

A better version of clet would, I think, need a proper parser for syntax. I think that is what syntax-parse is, although I have not investigated that.

Macro composition

As mentioned above, we don’t yet have quite all the tools we need to write some kinds of macros: specifically macros which are intentionally slightly unygienic, such as collecting. As an example, let’s suppose we wanted a general purpose, intentionally-unhygenic, with-abort macro which provided an abort function which would, well, abort. Without thinking too hard about the implications of call/cc we could write this as:

(define-syntax (with-abort stx)
  (syntax-case stx ()
    [(_ body ...)
     #`(call/cc (λ (#,(datum->syntax stx 'abort))
                  body ...))]))

So now (with-abort (abort 2) (end-the-world)) returns 2 and does not end the world.

Well, we might want to use this macro in another macro:

(define-syntax-rule (while/abort test body ...)
  (with-abort
    (let loop ([r test])
      (when r
        body ...
        (loop test)))))

Now something like the following will work:

> (let ([x 0])
    (while/abort (< x 10) (set! x (+ x 1)) (print x)))
12345678910

But the whole point was to be able to use abort in the body, and that doesn’t work:

> (let ([x 0])
    (while/abort (< x 10) (set! x (+ x 1)) (when (> x 1) (abort 'done))))
abort: undefined;
 cannot reference an identifier before its definition

Oh, dear. The problem here is that while/abort is hygenic, so the abort binding that is introduced by with-abort is not visible in the body.

We could fix this by better design:

(define-syntax-rule (with-named-abort (abort) body ...)
  ;; a better macro
  (call/cc (λ (abort) body ...)))

(define-syntax (with-abort stx)
  ;; backwards compatible
  (syntax-case stx ()
    [(_ body ...)
     #`(with-abort (#,(datum->syntax stx 'abort)) body ...)]))

(define-syntax (while/abort stx)
  ;; the end result
  (syntax-case stx ()
    [(_ test body ...)
     #`(with-named-abort (#,(datum->syntax stx 'abort))
         (let loop ([r test])
           (when r
             body ...
             (loop test))))]))

But that’s not the solution we’re after.

Racket’s answer to this is syntax parameters. I don’t completely understand these, but they are at least close to dynamic variables, except at macro-expansion time. What you do is to define a syntax parameter, and then rebind it during the expansion: the rebound value is visible to macros which are expanded dynamically within the rebinding form. As with Racket’s ordinary special variables these look like functions (yet another namespace in disguise).

So we can define a syntax parameter called abort using define-syntax-parameter:

(require racket/stxparam)

(define-syntax-parameter abort
  (λ (stx)
    (raise-syntax-error #f "not available" stx)))

So now any reference to abort will result in a syntax error:

> (abort)
abort: not available in: (abort)
> abort
abort: not available in: abort

And we can now try to use syntax-parameterize, to rebind abort as a macro:

(define-syntax with-abort
  (syntax-rules (with-abort)
    [(with-abort) (void)]
    [(with-abort body ...)
     (call/cc
      (λ (a)
        (syntax-parameterize ([abort
                               (syntax-rules ()
                                 [(_ ...) (a ...)])])
          body ...)))]))

And this fails horribly, because the outer syntax-rules thinks it owns the patterns and sees ...s that it does not expect. So much for that.

Well, we could at least check this works with a specific number of arguments:

(define-syntax with-abort
  (syntax-rules (with-abort)
    [(with-abort) (void)]
    [(with-abort body ...)
     (call/cc
      (λ (a)
        (syntax-parameterize ([abort
                               (λ (stx)
                                 (syntax-case stx (abort)
                                   [(abort) #'(a)]
                                   [(abort x) #'(a x)]
                                   [_ (raise-syntax-error #f "I give up" stx)]))])
          body ...)))]))

But this is obviously just a rubbish answer.

Well, there is an answer to this: all we really need to do is to make the abort macro attach itself to a, and there is a special hack, make-rename-transformer, to do this:

(define-syntax with-abort
  (syntax-rules (with-abort)
    [(with-abort) (begin)]
    [(with-abort body ...)
     (call/cc
      (λ (a)
        (syntax-parameterize ([abort (make-rename-transformer #'a)])
          body ...)))]))

And this now works:

> (with-abort (abort 1 2 3))
     
1
2
3

And we can use this to write a really robust version of collecting

(require racket/stxparam)

(define-syntax-parameter collect
  (λ (stx)
    (raise-syntax-error #f "not collecting" stx)))

(define-syntax collecting
  (syntax-rules ()
    [(_) (void)]
    [(_ body ...)
     (let ([r '()])
       (define (clct it)
         (set! r (cons it r)) it)
       (syntax-parameterize ([collect (make-rename-transformer #'clct)])
         body ...
         (reverse r)))]))

As far as I can see there is still a problem, however: it is very hard to write macros which expand to other macros which themselves do pattern-matching, since the patterns get acquired by the outer macros. There must be some answer to this, but I can’t see what it is.

On the other hand, this is also extremely painful in CL: here is a version of collecting where collect is a local macro:

(defmacro collecting (&body forms)
  ;; collect lists forwards using a tail pointer
  ;; local macro version
  (let ((rn (make-symbol "R"))
        (rtn (make-symbol "RT"))
        (itn (make-symbol "IT")))
    `(let ((,rn '())
           (,rtn nil))
       (macrolet ((collect (form)
                    `(let ((,',itn ,form))
                       (if (not (null ,',rn))
                           (setf (cdr ,',rtn) (cons ,',itn nil)
                                 ,',rtn (cdr ,',rtn))
                         (setf ,',rn (cons ,',itn nil)
                               ,',rtn ,',rn))
                       ,',itn)))
         ,@forms)
       ,rn)))

This is not easy to understand.

Additionally, the problem almost always comes from ellipses, and in many interesting cases they can be avoided by using dotted pairs as patterns — here is yet another version of with-abort that does this:

(require racket/stxparam)

(define-syntax-parameter abort
  (λ (stx)
    (raise-syntax-error #f "not available" stx)))

(define-syntax with-abort
  (syntax-rules (with-abort)
    [(with-abort) (void)]
    [(with-abort body ...)
     (call/ec
      (λ (a)
        (syntax-parameterize ([abort
                               (syntax-rules (abort)
                                 [(abort . args) (a . args)])])
                             

          body ...)))]))

This is clearly better than the CL version.

Summary

Well, I think I now know enough about Racket’s macros to be going on with: I can certainly write the macros I need to be able to write now without it just being cargo-cult programming. There are still things I don’t understand, and the whole system smells to me as if, by trying remain ideologically pure, it has become vast and essentially incomprehensible. This seems to be a common problem with Scheme, unfortunately.

Small notes

Macro definitions scope properly, so you can define a local macro the same way you can define a local function, so this works:

(define (foo ...)
  (define-syntax-rule (while test body ...)
    (let loop ()
      (when test
        body ...
        (loop))))
  ... (while ... ...) ...)

This makes the equivalent of CL’s MACROLET easy to do.

For fun, here is a version of with which can deal with named-let: There must be a way of implementing this without assignment, but I can never work out what it is.

(define-syntax (with stx)
  (syntax-case stx ()
    [(_ ())
     ;; all null
     #'(void)]
    [(_ () body ...)
     ;; no vars: trivial case
     #'(begin body ...)]
    [(_ ([var val] ...))
     ;; null body: make sure vars are evaluated
     #'(begin val ... (void))]
    [(_ ([var val] ...) body ...)
     ;; normal let
     #'((λ (var ...) body ...) val ...)]
    [(_ n ())
     (identifier? #'n)
     ;; named null
     #'(void)]
    [(_ n ([var val] ...))
     (identifier? #'n)
     ;; named null body
     #'(begin val ... (void))]
    [(_ n ([var val] ...) body ...)
     ;; named let with arguments
     ;; (is there an implementation without assignment?
     (identifier? #'n)
     #'((λ (n)
          ((λ (l)
             (set! n l)
             (l val ...))
           (λ (var ...) body ...)))
        #f)]
    [_ (raise-syntax-error #f "bad syntax" stx)]))

Things I still do not know or understand

At this point I’m mostly comfortable writing macros in Racket, but there are things I still do not understand:

protecting and arming syntax objects — I just don’t understand what this is about at all;
syntax-parse is, I think, not difficult but I have not bothered to learn about it as it seems to add yet another layer.
there are probably other things that I don’t even know I don’t know.

At some point I might write a further part of this series on some of that.

Pointers

Eli Barilay’s paper on syntax-parameterize.

Fear of Macros, again.

Macros in Racket, part one

Tim Bradshaw — Tue, 13 Jan 2015 14:45:48 UT

I’ve written in Lisp for a long time, but I’ve never used a hygienic macro system in any way other than the most simple. Here are some initial notes on my experiences learning Racket’s macro system.

This is the first part of several: see part two and part three. I’m not completely fluent with Racket macros yet: there are almost certainly mistakes and confusions here. Despite appearances, I also have no axe to grind: I’m learning Racket because I want to and I have time. Finally this is not a tutorial: look at Greg Hendershott’s Fear of Macros for something closer to that. This is just some notes which were useful to me, and might be useful to other CL people.

Macros in Common Lisp

Common Lisp’s macro system is, in essence, simple: it’s what you’d end up writing if you had to write a macro system for a Lisp. That’s not surprising because it is the descendent of the first macro systems people wrote for Lisp. In CL what happens is this:

the reader ingests the source text and produces data structures which represent the source of the program;
these structures are possibly transformed by macros, which are simply Lisp functions which are given the Lisp representation of the source and return some other representation;
once all macros are expanded, then the code is compiled, evaluated or both.

(I have missed out some subtleties here, but they don’t matter for my purposes.)

In CL, what the reader produces is exactly what you would expect. If it reads "(defun foo (a) a)" then, with standard settings, it returns a list whose car is the symbol DEFUN (in the CL package) and so on. It is this structure that macros transform.

CL provides relatively limited support for writing macros: there is backquote, which is critical to being able to write macros which are even slightly readable, limited pattern matching in the form of destructuring, and there are mechanisms to generate unique names as well a few other things. There is a semi-standard way of enquiring about bindings in the environment at macro expansion time, although this is not in the standard.

In practice, CL’s macro system has turned out to work very well; in theory it has all sorts of problems, the most important being that the programmer is entirely responsible for making sure that macros don’t introduce or accidentally use names they should not. Consider this:

(defmacro collecting (&body forms)
  ;; collect lists forwards using a tail pointer
  ;; polluting version
  `(let ((r '())
         (rt nil))
     (flet ((collect (form)
              (if (not (null r))
                  (setf (cdr rt) (cons form nil)
                        rt (cdr rt))
                (setf r (cons form nil)
                      rt r))
              form))
       ,@forms)
     r))

This intentionally introduces a function binding, collect, but also accidentally introduces bindings for r and rt.

(let ((r 2))
  (collecting
    (+ r r)))

Does not do what it should. One right way to write the collecting macro is like this:

(defmacro collecting (&body forms)
  ;; collect lists forwards using a tail pointer
  ;; non-polluting version
  (let ((rn (make-symbol "R"))
        (rtn (make-symbol "RT")))
    `(let ((,rn '())
           (,rtn nil))
       (flet ((collect (form)
                (if (not (null ,rn))
                    (setf (cdr ,rtn) (cons form nil)
                          ,rtn (cdr ,rtn))
                  (setf ,rn (cons form nil)
                        ,rtn ,rn))
                form))
         ,@forms)
       ,rn)))

And now the above form does not signal an error and correctly returns ().

Note that the problem is with names and not just bindings. Consider this CL code:

(defvar *stashes* '())
(defvar *mark* nil)
  
(defun stash (name thing)
  ;; Stash something under a name
  (setf *stashes* (acons name thing *stashes*))
  (values name thing))

(defun retrieve (name)
  ;; Retrieve the value of a name, dropping everything stashed more
  ;; recently, and stopping at the mark, if any.
  (let ((mark *mark*))
    (labels ((rl (tail)
               (if (or (null tail)
                       (eq (first tail) mark))
                   (values nil nil)
                 (destructuring-bind ((n . v) . r) tail
                   (if (eql n name)
                       (progn
                         (setf *stashes* r)
                         (values v t))
                     (rl r))))))
      (rl *stashes*))))

(defmacro with-marked-stash (&body forms)
  ;; mark the stack of stashes for the dynamic extent of FORMS
  (let ((mn (make-symbol "MARK")))
    `(let ((*stashes* (cons ',mn *stashes*))
           (*mark* ',mn))
       ,@forms)))

In this code the marks on the stack of stashes established by with-marked-stash are not bound anywhere: they are just names. But it’s important to the correct functioning of the code that they are unique names. (There are better ways of doing this such as using a fresh cons for the mark: I just wanted an example where a name mattered other than as the name of a variable.)

The politically correct way of saying that we’re talking about names is to talk about ‘lexical context’ or ‘lexical information’: it’s the same thing but more confusing to those not initiated into the cult, which is always good.

The disadvantages of the CL macro system are this problem with hygiene and the lack of any clever tools to do pattern matching on macro forms. The second of these is easily overcome by using any of a number of tools, while the first is generally not a problem in practice: CL being a Lisp–2 (separate namespaces for functions and variables) helps here.

The advantage of the CL macro system is that there is no magic: macros get passed the things that the source code looks like — generally a structure whose interesting parts are lists and symbols — which you process using the normal list-processing tools to produce some other structure which is the expansion of the macro. It’s easy enough that you could write it yourself: there are no special opaque objects being handed around.

That being said, having a standard set of tools for pattern matching in macros and a way of dealing with the hygiene problems which is less ugly than in CL might well be worth the cost in transparency.

Macros in Scheme

I am not a native Scheme person, but it has clearly taken the whole hygiene thing very seriously: Scheme, as a set of languages, treats purity as much more than CL, which revels in being a fairly grungy language, does. However these posts are not about Scheme: the only reason I am mentioning it is to say that I have not cared at all whether anything here applies generally to Scheme or is specific to Racket.

Macros in Racket: baby steps

For a long time the only kind of macros that I’ve really been able to define in Racket are annoyingly trivial ones using define-syntax-rule, things like:

(define-syntax-rule (while test body ...)
  (let loop ()
    (when test
      body ...
      (loop))))

That’s all very well, but the ‘obvious’ (and obviously wrong) definition of collect then looks like this:

(define-syntax-rule (collecting body ...)
  ;; horribly wrong	
  (let ([s '()])
    (define (collect it)
      (set! s (cons it s))
      it)
    body ...
    (reverse s)))

(There’s no obvious way to build lists backwards in Racket: reversing the list is probably as cheap as anything). This is either introducing a spurious binding for s or not introducing a deliberate one for collect, and in fact, of course, it’s the latter.

Quite apart from this, define-syntax-rule gives the strong impression that it lets you write only the sort of macros that would give people who write C++ great pride: simple ones. (Actually you can do reasonably hairy things even with this because the pattern matching is very competent:

(define-syntax-rule (mlet ([var val] ...) body ...)
  ((λ (var ...) body ...) val ...))

is an implementation of simple let, for instance. Indeed we can defined named let as well:

(define-syntax-rule (nlet label ([var val] ...) body ...)
  (mlet ()
    (define (label var ...) body ...)
    (label val ...)))

What I can’t work out how to do is to make mlet do both things: I think this is too hard for define-syntax-rule although I might be wrong.)

But for a long time I was stuck with that: whenever I looked at Racket macros in more detail I walked into a wall of opaque terminology and just decided that I had better things to do that year. This year, I don’t.

Two desirable macros

There are many ways people use macros in Lisp: some of them are good. I decided that if I could write two macros and understand them then I would be well on my way.

collecting / collect. This is the macro given above in CL. It’s interesting not for what it does — the tail-pointer stuff is less interesting now than it once was and is hard to implement in Racket anyway — but because it introduces a binding: it is intentionally not completely hygienic, while having an essentially trivial expansion: no complicated destructuring is needed.
CL’s let, which I’ll call clet. This is interesting because it requires destructuring of arguments which is not completely simple, but it does not present problems of hygiene. The reason it’s not just a subset of Racket’s let is that CL allows variables with no initial value, which get bound to nil and should, I think, become undefined in Racket. So (clet ((x 1) y) body ...) should expand to (let ([x 1] [y undefined]) body ...) or something equivalent to that.

Here is a simple implementation of clet in CL, missing any error checking:

(defmacro clet (bindings &body forms)
  (multiple-value-bind (args vals)
      (loop for binding in bindings
            for consp = (consp binding)
            collect (if consp (first binding) binding) into as
            collect (if consp (second binding) nil) into vs
            finally (return (values as vs)))
    `((lambda (,@args) ,@forms) ,@vals)))

Like most macros in CL it’s not particularly pretty but it is reasonably clear what it does.

I will use these two macros as examples below.

Phases

To understand macros in any Lisp you need to develop a strong idea of the various ‘times’ that things happen and the relationships between them: for CL these are things like read time, macro expansion time, compilation time (compiler-macro expansion time), load time, run time and so on. Racket has formalised the parts of this after read time into a notion of ‘phase’:

phase 0 is run-time;
phase 1 is macro expansion time;
phase 2 would, I think, be macros used in macro expansion;
and so on.

However I am not sure how this ties in to read time: is that phase 1? For CL read time is before macro expansion time although the two are, or may be, interleaved at the granularity of forms (rather than a per-file or per-compilation-unit). Also there are negative phases which I don’t understand, although I think they must be to do with code which exists at macro expansion time (phase 1) wanting to make things available at run time (phase 0). All of this is integrated into the module system (and CL gets away without it mostly because it does not have a formalised module system).

Bindings exist at a phase, and the same name can have different bindings at different phases.

Modules can say what they provide at which phase, and, importantly, the racket module does indeed provide different things at different phases: if you look at it you’ll find:

(provide ...
         (for-syntax (all-from-out racket/base)))

Which means that, at phase 1, what is available is racket/base: a significantly smaller language than racket itself. If you need things in macros which are in racket but not racket/base you need to require them:

(require (for-syntax ...))

An example of this is first & rest, both of which are provided at phase 0 by racket but not at phase one: if you want them you need to say (require (for-syntax racket/list)).

Syntax objects

As in CL, Racket macros are source-to-source functions. The difference is that in Racket the source is represented by a syntax object and a macro needs to produce another syntax object, while in CL source is represented as it looks: usually as nested lists.

So then a Racket macro is simply a function which maps from syntax objects to other syntax objects. The reason for having an opaque syntax object is that it can carry around all sorts of information around with it, and in particular it can carry information about names, which help the system maintain hygiene. (There is also information about source location and so on, but this isn’t so important.)

So the Racket macro system needs tools to transform syntax objects into other syntax objects, ultimately by digging around inside them to find out what the source code actually was. This is necessarily more complicated than it is in CL both because the objects are opaque and because they contain information which is not present at all in the objects CL macros get.

Additionally, and mostly independently, there is a layer on top of this which does not exist in CL (without libraries) at all: pattern matching and template filling. This means that for many purposes you can write macros in Racket simply by specifying patterns that the source must match and filling templates with the results of those matches. This is a very nice way of writing macros, although it renders what is actually going on even more opaque. For a CL person, used to feeling the bits between their toes, this can be quite disconcerting at first since what is actually happening can become entirely obscure.

Syntax objects for the unwashed Lisp hacker

Well, of course it is possible to ignore all this terrifyingly modern pattern matching stuff and write macros almost the way you do in CL, and it’s worth doing that at least once, perhaps. So here is clet:

(require (for-syntax racket/list)
         racket/undefined)

(define-syntax clet
  (λ (stx)
    (define ctx (quote-syntax clet))
    (define top-level (syntax->list stx))
    (define bindings (second top-level))
    (define body (rest (rest top-level)))
    (define-values (args vals)
      (for/lists (as vs) ([binding (syntax->list bindings)])
        (define it (syntax->list binding))
        (if it
            (values (first it) (second it))
            (values binding (datum->syntax ctx 'undefined)))))
    (datum->syntax 
     ctx
     `((λ (,@args) ,@body) ,@vals))))

So how does this work? Well, it uses some functions provided by Racket to look inside the syntax object (getting the ‘datum’ in the syntax object) and in turn to construct a new one:

syntax->list takes a syntax object which wraps a proper list and unpacks one level of it, returning a list of syntax objects, or #f if it does not wrap a proper list;
datum->syntax takes a context object and a datum and wraps it into a syntax object, leaving any syntax objects in the datum as they are;
quote-syntax is like quote but it creates a syntax object, and this object contains the lexical information present in the source.

So the macro pulls apart the syntax object in a fairly straightforward way: making it into a list, extracting the second element and all the remaining elements, which will be the binding specifications, and then grinding over the binding specifications, using syntax->list both to work out if the bindings are a list or not and to extract the variable and value if it is, and then reassembles everything as a call to an anonymous function.

The critical trick is that the context that datum->syntax needs is a syntax object and you need to pick the right one: you can use the syntax object you got given, which provides the context of the place where the macro was expanded, or you can use a syntax object of your own devising which provides that object’s context. And in this case we want our own context, not the context of place where the macro was expanded. This is what ctx is for: providing a suitable context.

Notice the require:

we need racket/list at phase 1 (macro expansion time) because the macro uses first and so on;
we need racket/undefined at phase 0 (run time) as the expansion of the macro uses undefined.

So we can try this:

(clet ((x 12) y) (values x y))
12
#<undefined>
> (let ((undefined 'hello)) (clet (x) x))
#<undefined>
> (clet ((undefined 'hello)) (clet (x) x))
#<undefined>
> (clet ((x 1)))
λ: bad syntax in: (λ (x))
> (clet (1) 1)
λ: not an identifier, identifier with default, or keyword in: 1

The second and third examples show why we need the macro context: we don’t want a binding of undefined to alter what the clet picks as the undefined value. The fourth and fifth examples show that the macro isn’t very robust, and has terrible error reporting.

Some notes:

I’ve deliberately written (define-syntax clet (λ (stx) ...) rather than the more pleasant (define-syntax (clet stx) ...) to make it clear that clet is a function which transforms a syntax object;
but I’ve used internal define where in CL there would be let* or nested lets — I’m not sure why other than reducing indentation;
the destructuring of the syntax object is done in a way which is primitive even by the standards of CL;
it should be evident that the macro is not very robust — something like (clet ((x 1) 2) ...) will fail horribly;
it’s not much less clear than the CL version, although I think it is a bit less clear.

I am fairly but not completely sure that this macro is right: I am slightly confused by the handling of undefined: although it is easy to check, by wrapping clet into a module, that clients of that module don’t themselves need to import racket/undefined and do get the right initial values in forms like (clet (x) ...) I am still a bit queasy about what it’s doing.

What is very clear is that this macro is just horrible: even by the standards of CL macros it’s horrible, because there is so much explcit unpacking and repacking going on. Things would be even worse if there was any significant error checking. Something better than this is needed to deal with syntax objects, in a way that it isn’t needed for CL macros. In next week’s exciting episode I’ll look at ways of making this better.

Pointers

Writing ‘syntax-case’ Macros by Eli Barzilay. This was the article that first helped me understand what was going on.

Fear of Macros by Greg Greg Hendershott. This is an introduction to macros, and macros in Racket in particular, by the author of Frog.

The cult of programming

Tim Bradshaw — Mon, 05 Jan 2015 19:24:26 UT

Programming is not meant to be easy and it’s important to make sure that it is as cryptic as possible otherwise people other than cult members might be able to understand it. Of course, you also need to make sure it’s pure, because otherwise cult members will laughingly throw you into a pit full of spikes and the rotting remains of other heretics.

For instance, you can’t be writing this sort of thing:

(defun ss (n)
  (let ((s 0) (i 0))
    (tagbody
     loop
     (when (> i n) (go done))
     (setf s (+ s (* i i))
           i (+ i 1))
     (go loop)
     done
     (return-from ss s))))

This is just terrible code. Non cult members may well be able to understand it, and the cultists will have you in the pit before you know it.

You might think this was better

(defun ss (n)
  (loop for i from 0 to n
    summing (* i i)))

But in fact it’s far worse. Fellow cultists will definitely still be at the laughing and pit-throwing, and the others will certainly understand it and laugh at you because you don’t know the closed form.

Instead, you must write this:

(define (ss n)
  (let-values ([(a i l) (call/cc (λ (c) (values 0 0 c)))])
    (l (+ a (* i i))
       (+ i 1)
       (if (< i (- n 1))
           l
           (λ (a i l) a)))))

This is almost a perfect solution. It’s so achingly pure and cryptic that you will be immediately appointed king of the cult and be able to do your own laughing, and throw other members into pits you have first made them dig, for which they will thank you as they slide down the spikes. Non cult members stand essentially no chance of understanding what it does and sniping about the whole silly closed-form thing: certainly the only way they will be able to learn what it does is by first joining the cult, at which point, as king, you can just throw them straight into the pit.

It’s important you understand this.

Rerooting Frog

Tim Bradshaw — Mon, 29 Dec 2014 17:15:25 UT

Frog wants to create blogs which hang directly under /. I want mine to live under a subdirectory, and to have all its data living under that directory. I’ve made some changes to Frog to support that. As of 20150702 these changes have been merged to the main frog repo: you no longer need to refer to mine, which is obsolete.

What I did was to add a new parameter, uri-prefix (implemented in the code as current-uri-prefix) and write a function which converts between the original name and whatever external name is wanted: at the moment this just adds the prefix but it has ambitions. Most of the problem was then finding all the places where absolute URIs were assumed in the code, and I’m not sure I’ve done that — Racket does not seem to have very good tools for understanding the structure of any significant body of code, which I found surprising: perhaps I am spoiled by the very wonderful LispWorks code browsing tools.

These fixes could be found on GitHub, on the uri-root-fix branch: this is no longer needed as improved versions are now in the main frog repo.

A theory of names

The underlying problem here is that you need a theory of names to do this sort of thing: rather than saying ‘things of type x live in /things/x/...’ and then discovering that in fact they should live in /x/things/... or something, the right answer is to keep the location in some representation which:

doesn’t commit you to what the final pathname, URI or whatever is;
has all the information you need to generate the final representation, including the ability to carry around completely arbitrary information;
can not be confused for the final representation by the program.

Then you can write mapping functions, including extensible mapping functions, to invent the names you actually need from the objects you have.

Common Lisp’s logical pathnames are an early effort in this direction: they offer the ability to translate a logical pathname into a physical pathname in various ways. But they’re not the right answer simply because they are pathnames: they can (and are designed to) leak into functions which expect pathnames, and can also leak into places where strings are expected, since pathnames have representations as strings. It’s important that whatever representation is used for logical names is not compatible with code which wants, for instance, to emit URIs, so that you are forced to map things everywhere they are needed. In addition the mappings you can define for logical pathnames are not really general enough.

Note that it’s not enough to have a good approach to manipulating structured pathnames, URIs or whatever, because those are the wrong type of thing to manipulate.

Fragments: Posts tagged 'computer'

Numerical prediction

2018

2023

Closed as duplicate considered harmful

Vector supercomputers

The proper use of macros in Lisp

What macros are: a first look

What macros are: a second look

An example: two versions of a recursive macro

Two historical uses for macros

Appendix: setting up trace-macroexpand

The best Lisp

Computer insecurity

Managing large, complex computing installations

Single points of control

A security problem

Transitive closure

We’re all fools

Some more than others

A target painted on our backs

Insecurity solutions

How to lose friends and alienate people

The terrifying conclusion

But they have no choice

One of many

Supply chain

SolarWinds

Qualys again

This is not the end

Is this the end?

A sketch

Because we want to

Appendix: ‘large, complex computing installations’

Useful idiots

The art of the possible

Useful idiots

The coronation of the idiots

How the backtrace was conquered

What's wrong with Signal's contact discovery

WhatsApp

Signal

Signal’s contact discovery

Alice and Elizabeth

Vladimir and the dissidents

Unsafe at any speed

Some ideas which are mostly useless

One idea which is not useless

The theatre of the absurd

Generic interfaces in Racket

A generic treelike interface

A treelike binary tree

Two attempts at a generic foldable interface

Adding a special case to fold for the binary tree

Missing CLOS

Backup retention

MIME as a disease vector

Do not use Duplicacy on macOS

The architecture of the application

The annoyances of macOS

The first disaster

The second disaster

Don’t use Duplicacy

The glorious work of Dominic Cummings

All the reasons you had to die

A mindless epidemic simulator

How the simulations run

Some example runs

Abandoning mitigation

Chancy runaways

The Cummings-Johnson effect

Cummings-Johnson on day 120

Cummings-Johnson on day 200

Cummings-Johnson on day 300

Cummings-Johnson on day 600

Why is it so fierce?

How many people will Cummings and Johnson kill?

Sexism in computer science

The facts

What the facts show

Appendix: setting up `trace-macroexpand`

A generic `treelike` interface

A `treelike` binary tree

Two attempts at a generic `foldable` interface

Adding a special case to `fold` for the binary tree