- SECD: about
- Part I and Part Ia describe high-level, formal definition of a SECD machine;
- Part II discusses low-level details of SECD implementation in C;
- Part III (TODO): how to compile a Scheme subset into SECD code.
A SECD machine is a stack-based abstract machine built around lists. That determines the design of its implementation.
First of all, the machine must allocate and free cells as fast as possible. The simplest and the fastest solution is to manage a list of free cells: access to its head is O(1) and (unless you’re Jeff Dean, who can do it in O(1/n)) it’s very good. This solution has severe limitations though: all cells must have fixed size. That limits us to CONS cells, numbers and characters; what about symbols? Theoretically, I could make them lists of characters (string as lists of characters? Ugly, though survivable: just look at Haskell/Erlang), but initially I chose a way that provides simpler interoperation with C: symbols are just cells that point to C strings in the process heap. This choice was good to get started, but it would show its limitations later (I delayed writing any explicit string storage intently).
So, the memory layout is primitive: a large array of cells that are linked internally and must belong to one of the following main lists:
- list representing the stack;
- list of frames of the environment;
- list of operations to execute, the control path;
- list of “dumped” stacks/environments/control paths to use later;
- list of free cells.
(let’s forget for a while about zero-terminated C strings referenced by symbol cells).
What’s in the cell?
What is the cell format? If I wrote a SECD machine in a functional programming language, I would use sum types; the closest counterpart in C is tagged unions. Ok, some tagged unions to go (also, it’s a good representation for AST nodes). Let
cell_t structure start with a tag of cell type and its body be a union of possible tag types. What types are those? For CONS cells, it’s a couple of
cdr. For a symbol, it’s a pointer to a C string with its name. For a number, just hold an integer. It’s a quite restrictive scheme, so some time later I was really surprised that there is an actual Lisp implementation that intently limits itself to the same set of cell types: Picolisp.
Being under influence of the Henderson’s book I’ve made a bad design choice: I divided cells into
CELL_ATOMs, sub-tagging atoms with their own subtypes (
atom_type). In the hindsight, it is a choice that hinders adoption of more diverse cell types and uses more space without actual need, ATOM_* and CELL_* enumerations may be combined into a single type painlessly. The only advantage is that checking for atomness/consness is slightly easier, but this advantage is questionable if there are many compound cell types, not just CONS.
Boolean representation. I chose to use the
NIL cell (
secd->nil) as False value and to have a
#t symbol in the global environment for True. Pretty asymmetric solution: allows omitting implementation of
empty? routine for lists, but is ugly and adds redundant (and expensive) look-ups in the environment. Also it’s an ugly mix of Common Lisp-style
NIL and Schemish
Another choice was keeping pointers to cells vs. indexing cells by numbers. Sophomore C programmers like me often tend to overuse pointers where array indexes are simpler and more human-readable. My initial choice of pointers (“they’re faster”, though now I doubt that, index-based access is still a single command in x86/x86_64) has greatly obscured debugging (looking at something like
*0xffffff17a2e260 in gdb won’t say you much), even with ubiquitous
cell_index() calls in debug output. Also indexing gives you a choice between
Handling native code errors
Most of opcode operations may fail. For example, CAR may encounter an empty list, AP may encounter not a closure on the stack – almost any opcode handler is vulnerable to malformed arguments and empty stack. How to handle that? At the time I introduced
CELL_ERROR type, still unused, though now it’s clear that it is a good idea to accumulate a stack trace of native code that failed. But I explicitly chose returning
NULL as a sign of error (stack traces have to use some managed memory, I used a simpler alternative). That’s why NIL has a representation as the last cell in all lists and the last cell in the cell array (back then I thought it’s a neat trick to not store the size:
secd->nil - secd->data would be enough); now it’s obvious that stack traces are a must-have. I’ll leave discussion of more high-level mechanisms like exceptions/conditions for later.
Another choice was to have a global
secd_t structure or propagate
secd_t* throughout routines. As a functional aficionado who despises even C++
this for its implicitness, now I stubbornly pull it through arguments almost everywhere, though my initial solution was to put
secd_t* in every cell (that was terribly redundant, each of thousands or of a million cells had a copy of the same pointer). Even innocent
is_nil ends up using
secd_t* pointer. Is it worth it? I don’t know. It certainly is for having secd-as-a-library, to have many instances in the same process, to be thread-safe, etc, but it’s cumbersome.
As a programmer with C/C++ background, I still can’t overcome some atavisms like distrust for any memory management schemes that looks non-deterministic to a C programmer; although, having written my own heap implementation, I know that any dynamic memory management scarcely can be deterministic and reference counting does not prevent large pauses too, since the graph to be freed may have arbitrary size, reference counting just alleviates the problem by doing that work often in small chunks. Anyway, reference counting seemed to be a simpler memory management scheme at the time I started writing the code, so I explicitly left a more sophisticated GC implementation for later. So, every cell in the machine must keep number of references to it, that number of references (I abbreviate it as
nref from now on) has to be managed by
drop_cell() routines, the latter frees the memory if
nref drops to zero.
For interaction with the world I also hacked the simplest possible solution: let’s encapsulate parser and printer into special opcodes,
display operating on S-expressions. Yes, it’s a madness to write a parser in a language without even a VM yet, so I did all the heavy lifting in C and hacked together a simple recursive-descent parser (like my first one). This introduced first non-pure operations after playing with some simple hard-coded control paths in C source code (like
(LDC 2 LDC 2 ADD)) and made the machine actually capable of interpreting input.
1.13 Yegge. To be continued