[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql

Gary Byers gb at clozure.com
Wed Oct 30 17:15:41 PDT 2013



On Wed, 30 Oct 2013, Paul Meurer wrote:
> 
> I run it now with --no-init and in the shell, with no difference. Immediate failure with :consing in *features*,
> bogus objects etc. after several rounds without :consing.

So, I can't rant and rave about the sorry state of 3rd-party CL libraries, and
anyone reading this won't be subjected to me doing so ?

Oh well.

I was able to reproduce the problem by running your test 100 times, so apparently
I won't be able to blame this on some aspect of your machine.  (Also unfortunate,
since my ability to diagnose problems that only occur on 16-core machines depends
on my ability to borrow such machines for a few months.)

> 
> My machine has 16 true cores and hyperthreading; I am running CentOS 6.0, and a recent CCL 1.9 (I did svn update +
> rebuild of everything yesterday).
> 
> I also observed that the problem goes away when I replace the constant string in the library by a freshly
> allocated string:
> 
> char *getstring() {?
> ? int index;
> ? char *buffer = (char *)calloc(100 + 1, sizeof(char));
> ? for (index = 0; index < 100; index++) {
> ? ? ? buffer[index] = 'a';
> ? ? }
> ? buffer[100] = '\0';
> ? return buffer ;
> }
> 
> One should expect the strings in the Postgres library to be freshly allocated, but nevertheless they behave like
> the constant string example.

It's unlikely that this change directly avoids the bug (whatever it is); it's more
likely that it affects timing (exactly what happens when.)  I don't yet know what
the bug is, but I think that it's likely that it's fair to characterize the bug
as being "timing-sensitive".  (For example: from the GC's point of view, whether
a thread is running Lisp or foreign code when that thread is suspended by the GC.
The transition between Lisp and foreign code takes a few instructions, and if
a thread is suspended in the middle of that instruction sequence and the GC
misintrprets its state, very bad things like what you're seeing could occur.
That's not supposed to be possible, but something broadly similar seems to be
happening.)



More information about the Openmcl-devel mailing list