[Openmcl-devel] ARM testing

Thu Jan 27 20:52:15 PST 2011

On Thu, 27 Jan 2011, David Brown wrote:

>
> One interesting tidbit is that there is a mapping at about 1.3MB below
> the stack.  Perhaps that mapping is preventing the stack from growing.
>
> beda8000-bedda000 r-xp 00000000 00:00 0
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

CCL put that there; note that the "r-xp" indicates that pages in that region
don't have write permission.  If lisp code writes to a protected page in that
region, we expect to get a SIGSEGV, (partially) handle that signal on another
stack, write-enable some of those pages, and signal a STACK-OVERFLOW condition
in Lisp:

? (defun foo (x) (abs (foo x))) ; infinitely recursive
FOO
? (FOO 0)
> Error: Stack overflow on control stack.
> While executing: FOO, in process listener(1).

In that case, we're running into the write-protected guard pages at the end
of the listener thread's control stack; the same sequence of events happens
for me if I do:

? (process-interrupt ccl::*initial-process* #'foo 0)

If foreign code (including the GC, including rmark()) tries to write to 
those guard pages we expect to get a SIGSEGV; in general, it's harder to
recover from an exception in foreign code, and I think that we just drop
into the kernel debugger in that case.  (Or at least try to.)

Do you get a STACK-OVERFLOW condition signaled in lisp ?  Or does this just
die with SIGBUS ? Or does something else happen ?

(AFAIK, Unix signal names for synchronous signals are derived from the
hardware exception names on some concrete piece of hardware - perhaps a 
DEC VAX or even PDP/11.  The mapping between hardware exceptions and
Unix signals on other machines isn't always 1:1; whether "attempt to
write to a write-protected page" maps to SIGBUS or SIGSEGV is - again,
AFAIK - up to the OS.)

I haven't seen any kind of memory fault on any ARM Linux that I've used
raise anything but SIGSEGV, but that isn't conclusive and OS kernels
have been known to change their minds about this over time.  If some
ARM Linux kernels raise SIGBUS, we should certainly try to handle SIGBUS.

> beda8000-bedda000 r-xp 00000000 00:00 0
> beddb000-bef0c000 rwxp 00000000 00:00 0          [stack]

Note that the stack region now ends (= has its low address at) at an address
one page from the high end of the protected region; GCstack_limit should
have been #xbeddb000 in this case.

Here's another theory that makes so much sense (at the moment) that it's probably
completely wrong: it's possible that recent Linux kernels are refusing to map
the last page of a stack region and signaling SIGBUS (at least on ARM) when
attempts are made to write to that page.  (That's actually reminiscent of a
Linux kernel change made last summer, where mmap() with the MAP_GROWSDOWN option
refused to map the lowest page in the region it returned; that redefinition of
mmap's behavior was - according to my possibly garbled understanding - related
to stack growth/overflow detection.

This theory would explain another mystery: the stack pointer would have gone
just past the GCstack_limit (onto the last page before the guard pages) and
rmark() could have written to that page (and triggered a SIGBUS) before checking
to see if the stack pointer's past GCstack_limit.

At the moment, I like this theory (but of course I liked the one from the other
day, too.)  One way of testing it is to move GCstack_limit a page higher; it's
set near the start of the function gc() in lisp-kernel/gc-common.c:

/* ignore the other case of the containing 'if'.  This is around
    line 1394 */

     GCstack_limit = (natural)(tcr->cs_limit)+(natural)page_size;

If we change 'page_size' to '2*page_size' in that line and recompile
the kernel, does the problem (loading the bootstrapping image) persist ?

>
> David
>
>