[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql
Jeff Caldwell
jcaldwell at clozure.com
Thu Oct 31 04:50:17 PDT 2013
I wonder if you can detect the first time you see corruption, and do so
without causing a crash?
I know this would change the timing, and perhaps hide the problem or
perhaps not, but consider changing the C function to examine the string
upon entry to the function. Go back to the statically-allocated version so
the pointer to the string doesn't change. During initialization, save that
pointer somewhere "far" from where the pointer that is returned to Lisp is
stored. (E.g. one statically allocated, one on the heap. That is so that if
there is corruption, hopefully not both pointers are corrupted.)
Upon entry to the C function, compare the pointer that will be returned to
the saved pointer. If they are not equal, try and generate a breakpoint. If
they are equal, compare the string to the original string (also saved
elsewhere). If the strings differ in content or length, generate a
breakpoint.
Because of multiple threads, perhaps have a guard variable. When corruption
is detected, set the guard variable. Upon entry to the function, if the
guard variable is set, perhaps return a constant NULL from code, avoiding
referencing the string in memory, so that subsequent calls from other
threads don't continue to corrupt the corrupted memory, or don't reference
the corrupted memory, which may cause other exceptions.
Jeff
On Thu, Oct 31, 2013 at 5:48 AM, Paul Meurer <Paul.Meurer at uni.no> wrote:
>
> Am 31.10.2013 um 01:15 schrieb Gary Byers <gb at clozure.com>:
>
> On Wed, 30 Oct 2013, Paul Meurer wrote:
>
> I run it now with --no-init and in the shell, with no difference.
> Immediate failure with :consing in *features*,
> bogus objects etc. after several rounds without :consing.
>
>
> So, I can't rant and rave about the sorry state of 3rd-party CL libraries,
> and
> anyone reading this won't be subjected to me doing so ?
>
> Oh well.
>
> I was able to reproduce the problem by running your test 100 times,
>
>
> I am not able to provoke it at all on the MacBook, and I tried a lot.
>
> so apparently
> I won't be able to blame this on some aspect of your machine. (Also
> unfortunate,
> since my ability to diagnose problems that only occur on 16-core machines
> depends
> on my ability to borrow such machines for a few months.)
>
>
> I think you can do without a 16-core machine. I am able to reproduce the
> failure quite reliably on an older 4-core machine with Xeon CPUs and SuSE,
> with slightly different code (perhaps to get the timing right):
>
> (dotimes (j 100)
> (print (ccl::all-processes))
> (dotimes (i 8)
> (process-run-function
> (format nil "getstring-~a-~a" j i)
> (lambda (i)
> (let ((list ()))
> (dotimes (i 500000)
> (push (getstring) list)))
> (print i))
> i))
> (print (list :done j))
> (sleep 1))
>
> If you really need a 16-core machine to debug this I can give you access
> to mine. :-)
>
> My machine has 16 true cores and hyperthreading; I am running CentOS 6.0,
> and a recent CCL 1.9 (I did svn update +
> rebuild of everything yesterday).
> I also observed that the problem goes away when I replace the constant
> string in the library by a freshly
> allocated string:
> char *getstring() {?
> ? int index;
> ? char *buffer = (char *)calloc(100 + 1, sizeof(char));
> ? for (index = 0; index < 100; index++) {
> ? ? ? buffer[index] = 'a';
> ? ? }
> ? buffer[100] = '\0';
> ? return buffer ;
> }
> One should expect the strings in the Postgres library to be freshly
> allocated, but nevertheless they behave like
> the constant string example.
>
>
> It's unlikely that this change directly avoids the bug (whatever it is);
> it's more
> likely that it affects timing (exactly what happens when.) I don't yet
> know what
> the bug is, but I think that it's likely that it's fair to characterize
> the bug
> as being "timing-sensitive". (For example: from the GC's point of view,
> whether
> a thread is running Lisp or foreign code when that thread is suspended by
> the GC.
> The transition between Lisp and foreign code takes a few instructions, and
> if
> a thread is suspended in the middle of that instruction sequence and the GC
> misintrprets its state, very bad things like what you're seeing could
> occur.
> That's not supposed to be possible, but something broadly similar seems to
> be
> happening.)
>
>
> --
> Paul
>
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clozure.com/pipermail/openmcl-devel/attachments/20131031/fa0214d5/attachment.htm>
More information about the Openmcl-devel
mailing list