[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql

Paul Meurer Paul.Meurer at uni.no
Thu Oct 31 02:48:49 PDT 2013


Am 31.10.2013 um 01:15 schrieb Gary Byers <gb at clozure.com>:

> On Wed, 30 Oct 2013, Paul Meurer wrote:
>> I run it now with --no-init and in the shell, with no difference. Immediate failure with :consing in *features*,
>> bogus objects etc. after several rounds without :consing.
> 
> So, I can't rant and rave about the sorry state of 3rd-party CL libraries, and
> anyone reading this won't be subjected to me doing so ?
> 
> Oh well.
> 
> I was able to reproduce the problem by running your test 100 times,

I am not able to provoke it at all on the MacBook, and I tried a lot.

> so apparently
> I won't be able to blame this on some aspect of your machine.  (Also unfortunate,
> since my ability to diagnose problems that only occur on 16-core machines depends
> on my ability to borrow such machines for a few months.)

I think you can do without a 16-core machine. I am able to reproduce the failure quite reliably on an older 4-core machine with Xeon CPUs and SuSE, with slightly different code (perhaps to get the timing right):

(dotimes (j 100)
  (print (ccl::all-processes))
  (dotimes (i 8)
    (process-run-function
     (format nil "getstring-~a-~a" j i)
     (lambda (i)
       (let ((list ()))
         (dotimes (i 500000)
           (push (getstring) list)))
       (print i))
     i))
  (print (list :done j))
  (sleep 1))

If you really need a 16-core machine to debug this I can give you access to mine. :-)

>> My machine has 16 true cores and hyperthreading; I am running CentOS 6.0, and a recent CCL 1.9 (I did svn update +
>> rebuild of everything yesterday).
>> I also observed that the problem goes away when I replace the constant string in the library by a freshly
>> allocated string:
>> char *getstring() {?
>> ? int index;
>> ? char *buffer = (char *)calloc(100 + 1, sizeof(char));
>> ? for (index = 0; index < 100; index++) {
>> ? ? ? buffer[index] = 'a';
>> ? ? }
>> ? buffer[100] = '\0';
>> ? return buffer ;
>> }
>> One should expect the strings in the Postgres library to be freshly allocated, but nevertheless they behave like
>> the constant string example.
> 
> It's unlikely that this change directly avoids the bug (whatever it is); it's more
> likely that it affects timing (exactly what happens when.)  I don't yet know what
> the bug is, but I think that it's likely that it's fair to characterize the bug
> as being "timing-sensitive".  (For example: from the GC's point of view, whether
> a thread is running Lisp or foreign code when that thread is suspended by the GC.
> The transition between Lisp and foreign code takes a few instructions, and if
> a thread is suspended in the middle of that instruction sequence and the GC
> misintrprets its state, very bad things like what you're seeing could occur.
> That's not supposed to be possible, but something broadly similar seems to be
> happening.)

-- 
Paul

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clozure.com/pipermail/openmcl-devel/attachments/20131031/d53cc18d/attachment.htm>


More information about the Openmcl-devel mailing list