SECD implementation in C: general design

Contents


A SECD machine is a stack-based abstract machine built around lists. That determines the design of its implementation. 

Memory layout


First of all, the machine must allocate and free cells as fast as possible. The simplest and the fastest solution is to manage a list of free cells: access to its head is O(1) and (unless you’re Jeff Dean, who can do it in O(1/n)) it’s very good. This solution has severe limitations though: all cells must have fixed size. That limits us to CONS cells, numbers and characters; what about symbols? Theoretically, I could make them lists of characters (string as lists of characters? Ugly, though survivable: just look at Haskell/Erlang), but initially I chose a way that provides simpler interoperation with C: symbols are just cells that point to C strings in the process heap. This choice was good to get started, but it would show its limitations later (I delayed writing any explicit string storage intently).

So, the memory layout is primitive: a large array of cells that are linked internally and must belong to one of the following main lists:

  1. list representing the stack;
  2. list of frames of the environment;
  3. list of operations to execute, the control path;
  4. list of “dumped” stacks/environments/control paths to use later;
  5. list of free cells.

(let’s forget for a while about zero-terminated C strings referenced by symbol cells).

The SECD structure must hold pointers to all these lists and the actual data array.

What’s in the cell?

What is the cell format? If I wrote a SECD machine in a functional programming language, I would use sum types; the closest counterpart in C is tagged unions. Ok, some tagged unions to go (also, it’s a good representation for AST nodes). Let cell_t structure start with a tag of cell type and its body be a union of possible tag types. What types are those? For CONS cells, it’s a couple of cell_t* pointers, car and cdr. For a symbol, it’s a pointer to a C string with its name. For a number, just hold an integer. It’s a quite restrictive scheme, so some time later I was really surprised that there is an actual Lisp implementation that intently limits itself to the same set of cell types: Picolisp.

Being under influence of the Henderson’s book I’ve made a bad design choice: I divided cells into CELL_CONSes and CELL_ATOMs, sub-tagging atoms with their own subtypes (atom_type). In the hindsight, it is a choice that hinders adoption of more diverse cell types and uses more space without actual need, ATOM_* and CELL_* enumerations may be combined into a single type painlessly. The only advantage is that checking for atomness/consness is slightly easier, but this advantage is questionable if there are many compound cell types, not just CONS.

Boolean representation. I chose to use the NIL cell (secd->nil) as False value and to have a #t symbol in the global environment for True. Pretty asymmetric solution: allows omitting implementation of empty? routine for lists, but is ugly and adds redundant (and expensive) look-ups in the environment. Also it’s an ugly mix of Common Lisp-style NIL and Schemish #t.

Another choice was keeping pointers to cells vs. indexing cells by numbers. Sophomore C programmers like me often tend to overuse pointers where array indexes are simpler and more human-readable. My initial choice of pointers (“they’re faster”, though now I doubt that, index-based access is still a single command in x86/x86_64) has greatly obscured debugging (looking at something like *0xffffff17a2e260 in gdb won’t say you much), even with ubiquitous cell_index() calls in debug output. Also indexing gives you a choice between uint32_t and uint64_t on 64-bit machines: why 8 bytes where 4 may be enough? Also, it’s a more portable solution: e.g. JavaScript and any decent functional programming language have arrays, but not pointers (interoperation with the hosted language would be easier too). So changing pointers to indexes is on my TO-DO list.

Handling native code errors

Most of opcode operations may fail. For example, CAR may encounter an empty list, AP may encounter not a closure on the stack – almost any opcode handler is vulnerable to malformed arguments and empty stack. How to handle that? At the time I introduced CELL_ERROR type, still unused, though now it’s clear that it is a good idea to accumulate a stack trace of native code that failed. But I explicitly chose returning NULL as a sign of error (stack traces have to use some managed memory, I used a simpler alternative). That’s why NIL has a representation as the last cell in all lists and the last cell in the cell array (back then I thought it’s a neat trick to not store the size: secd->nil - secd->data would be enough); now it’s obvious that stack traces are a must-have. I’ll leave discussion of more high-level mechanisms like exceptions/conditions for later.

Another choice was to have a global secd_t structure or propagate secd_t* throughout routines. As a functional aficionado who despises even C++ this for its implicitness, now I stubbornly pull it through arguments almost everywhere, though my initial solution was to put secd_t* in every cell (that was terribly redundant, each of thousands or of a million cells had a copy of the same pointer). Even innocent is_nil ends up using secd_t* pointer. Is it worth it? I don’t know. It certainly is for having secd-as-a-library, to have many instances in the same process, to be thread-safe, etc, but it’s cumbersome.

Garbage collection

As a programmer with C/C++ background, I still can’t overcome some atavisms like distrust for any memory management schemes that looks non-deterministic to a C programmer; although, having written my own heap implementation, I know that any dynamic memory management scarcely can be deterministic and reference counting does not prevent large pauses too, since the graph to be freed may have arbitrary size, reference counting just alleviates the problem by doing that work often in small chunks. Anyway, reference counting seemed to be a simpler memory management scheme at the time I started writing the code, so I explicitly left a more sophisticated GC implementation for later. So, every cell in the machine must keep number of references to it, that number of references (I abbreviate it as nref from now on) has to be managed by share_cell()/drop_cell() routines, the latter frees the memory if nref drops to zero.

Input-output, parsing

For interaction with the world I also hacked the simplest possible solution: let’s encapsulate parser and printer into special opcodes, READ and PRINT, the opcode representation of Scheme read and display operating on S-expressions. Yes, it’s a madness to write a parser in a language without even a VM yet, so I did all the heavy lifting in C and hacked together a simple recursive-descent parser (like my first one). This introduced first non-pure operations after playing with some simple hard-coded control paths in C source code (like (LDC 2 LDC 2 ADD)) and made the machine actually capable of interpreting input.

1.13 Yegge. To be continued

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s