[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql

Paul Meurer Paul.Meurer at uni.no
Thu Oct 31 13:00:18 PDT 2013


Am 31.10.2013 um 12:50 schrieb Jeff Caldwell <jcaldwell at clozure.com>:

> I wonder if you can detect the first time you see corruption, and do so without causing a crash?
> 
> I know this would change the timing, and perhaps hide the problem or perhaps not, but consider changing the C function to examine the string upon entry to the function. Go back to the statically-allocated version so the pointer to the string doesn't change. During initialization, save that pointer somewhere "far" from where the pointer that is returned to Lisp is stored. (E.g. one statically allocated, one on the heap. That is so that if there is corruption, hopefully not both pointers are corrupted.) 
> 
> Upon entry to the C function, compare the pointer that will be returned to the saved pointer. If they are not equal, try and generate a breakpoint. If they are equal, compare the string to the original string (also saved elsewhere). If the strings differ in content or length, generate a breakpoint. 
> 
> Because of multiple threads, perhaps have a guard variable. When corruption is detected, set the guard variable. Upon entry to the function, if the guard variable is set, perhaps return a constant NULL from code, avoiding referencing the string in memory, so that subsequent calls from other threads don't continue to corrupt the corrupted memory, or don't reference the corrupted memory, which may cause other exceptions.

I tried what you suggested, and I still get nice crashes, but I cannot detect any corruption from the C side. Here is my code; please have a look at it, perhaps I did a silly mistake.
I set breakpoints in gdb on set_guard, but they were never reached.

---
static char *asdf = "asdfasdfasdfasdfasdfasdfasdfasdf" ;

char **asdf_pointer;
char *asdf_copy;
int guard = 0;

void initialize() {
  
  char *buffer = (char *)calloc(1000000, sizeof(char));
  
  asdf_copy = calloc(1+strlen(asdf), sizeof(char));
  strcpy(asdf_copy, asdf);
  asdf_pointer = calloc(1, sizeof(char*)) ;
  *asdf_pointer = asdf;

  fprintf(stderr, "\n asdf         = %p\n", asdf);
  fprintf(stderr, " asdf_pointer = %p\n", asdf_pointer);
  fprintf(stderr, "*asdf_pointer = %p\n", *asdf_pointer);
  fprintf(stderr, "*asdf         = %s\n", asdf);
  fprintf(stderr, " asdf_copy    = %p\n", asdf_copy);
  fprintf(stderr, "*asdf_copy    = %s\n", asdf_copy);
}

void set_guard() { guard = 1; }

char *getstring() {
  if (guard == 1) {
    return NULL;
  } else if (asdf != *asdf_pointer) {
    set_guard();
    fprintf(stderr, "asdf(%p) != *asdf_pointer(%p)\n",
	    asdf, *asdf_pointer);
    fflush(stderr);
    return NULL;
  } else if (strcmp(asdf, asdf_copy) != 0) {
    set_guard();
    fprintf(stderr, "asdf != *asdf_copy: %s\n",
	    asdf, *asdf_pointer);
    fflush(stderr);
    return NULL;
  } else {
    return asdf;
  };
}
---

and the Lisp code has some trivial changes:

---
(external-call "initialize" :address)

(defun getstring ()
  (let ((val (external-call "getstring" :address)))
    (when val (%get-cstring val))))
---

Typical output from the fprintf() lines looks like

 asdf         = 0x7fd3c385db78
 asdf_pointer = 0x7fd3c4005b90
*asdf_pointer = 0x7fd3c385db78
*asdf         = asdfasdfasdfasdfasdfasdfasdfasdf
 asdf_copy    = 0x7fd3c4005b60
*asdf_copy    = asdfasdfasdfasdfasdfasdfasdfasdf

and I normally get backtraces for two threads like this one before everything stops. The second one looks really funny. The bogus pointers are from a location far away from the correct one.

  0: (%STR-FROM-PTR #<BOGUS object @ #x302008FC008D> 32 "asdfasdfasdfasdfasdfasdfasdfasdf")
      Locals:
        POINTER = #<BOGUS object @ #x3020057E369D>
        LEN = 32
        DEST = "asdfasdfasdfasdfasdfasdfasdfasdf"
        I = 0

  0: (%STR-FROM-PTR "asdfasdfasdfasdfasdfasdfasdfasdf" 32 #<BOGUS object @ #x30200811AB6D>)
      Locals:
        POINTER = "asdfasdfasdfasdfasdfasdfasdfasdf"
        LEN = 32
        DEST = #<BOGUS object @ #x302004F97F0D>
        I = 0

> 
> Jeff
> 
> 
> 
> On Thu, Oct 31, 2013 at 5:48 AM, Paul Meurer <Paul.Meurer at uni.no> wrote:
> 
> Am 31.10.2013 um 01:15 schrieb Gary Byers <gb at clozure.com>:
> 
>> On Wed, 30 Oct 2013, Paul Meurer wrote:
>>> I run it now with --no-init and in the shell, with no difference. Immediate failure with :consing in *features*,
>>> bogus objects etc. after several rounds without :consing.
>> 
>> So, I can't rant and rave about the sorry state of 3rd-party CL libraries, and
>> anyone reading this won't be subjected to me doing so ?
>> 
>> Oh well.
>> 
>> I was able to reproduce the problem by running your test 100 times,
> 
> I am not able to provoke it at all on the MacBook, and I tried a lot.
> 
>> so apparently
>> I won't be able to blame this on some aspect of your machine.  (Also unfortunate,
>> since my ability to diagnose problems that only occur on 16-core machines depends
>> on my ability to borrow such machines for a few months.)
> 
> I think you can do without a 16-core machine. I am able to reproduce the failure quite reliably on an older 4-core machine with Xeon CPUs and SuSE, with slightly different code (perhaps to get the timing right):
> 
> (dotimes (j 100)
>   (print (ccl::all-processes))
>   (dotimes (i 8)
>     (process-run-function
>      (format nil "getstring-~a-~a" j i)
>      (lambda (i)
>        (let ((list ()))
>          (dotimes (i 500000)
>            (push (getstring) list)))
>        (print i))
>      i))
>   (print (list :done j))
>   (sleep 1))
> 
> If you really need a 16-core machine to debug this I can give you access to mine. :-)
> 
>>> My machine has 16 true cores and hyperthreading; I am running CentOS 6.0, and a recent CCL 1.9 (I did svn update +
>>> rebuild of everything yesterday).
>>> I also observed that the problem goes away when I replace the constant string in the library by a freshly
>>> allocated string:
>>> char *getstring() {?
>>> ? int index;
>>> ? char *buffer = (char *)calloc(100 + 1, sizeof(char));
>>> ? for (index = 0; index < 100; index++) {
>>> ? ? ? buffer[index] = 'a';
>>> ? ? }
>>> ? buffer[100] = '\0';
>>> ? return buffer ;
>>> }
>>> One should expect the strings in the Postgres library to be freshly allocated, but nevertheless they behave like
>>> the constant string example.
>> 
>> It's unlikely that this change directly avoids the bug (whatever it is); it's more
>> likely that it affects timing (exactly what happens when.)  I don't yet know what
>> the bug is, but I think that it's likely that it's fair to characterize the bug
>> as being "timing-sensitive".  (For example: from the GC's point of view, whether
>> a thread is running Lisp or foreign code when that thread is suspended by the GC.
>> The transition between Lisp and foreign code takes a few instructions, and if
>> a thread is suspended in the middle of that instruction sequence and the GC
>> misintrprets its state, very bad things like what you're seeing could occur.
>> That's not supposed to be possible, but something broadly similar seems to be
>> happening.)
> 
> -- 
> Paul
> 
> 
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
> 
> 

-- 
Paul




More information about the Openmcl-devel mailing list