[Openmcl-devel] Binary IO...

Sun Jun 7 17:04:30 PDT 2009

A stream's buffer is nailed down in foreign memory, so we can safely
read from and write to it without worrying about the GC moving it
around (because of activity in other threads.)  That's not generally
true of some arbitrary lisp vector, so in that general case we have
to copy the vector's data into the buffer before writing (and copy
from the buffer to the vector after reading.)

Here's a little bit of profiling output from oprofile on Linux:

samples  %        symbol name
10823    92.8774  <Compiled-function.%COPY-IVECTOR-TO-IVECTOR-BYTES.0x3000400123
4F>
110       0.9440  <Compiled-function.%IOBLOCK-IO-FILE-POSITION.0x3000405C2F6F>
79        0.6779  <Compiled-function.FD-STREAM-FORCE-OUTPUT.0x3000403ADBBF>
61        0.5235  <Compiled-function.%IOBLOCK-BINARY-STREAM-WRITE-VECTOR.0x30004
039764F>
54        0.4634  <Compiled-function.IO-FILE-FORCE-OUTPUT.0x3000405D4BEF>
51        0.4377  <Compiled-function.FD-WRITE.0x3000400831BF>
[a lot more of it omitted]

I think that we can fairly safely ignore everything but
%COPY-IVECTOR-TO-IVECTOR-BYTES for the time being.  In your example,
that function's being used to copy the contents of the sequence being
(repeatedly) read or written to the stream's buffer.  You might be
able to do that copying at least marginally faster, but I think that
it's likely that the C code is just reading/writing the vector
directly (without copying), and since the lisp code is spending >90%
of its time copying, it's likely that this copying accounts for most
of the user-mode time difference.

There are ways to inhibit the GC, obtain the (absolute, non-relocatable)
address of the vector, and do I/O directly (bypassing the buffer).  Whether
that's better overall depends on what the cost of inhibiting the GC would
be (which in turn depends on what kind of consing activitiy is going on
in other threads.)

On Sun, 7 Jun 2009, Jon S. Anthony wrote:

> Hi,
>
> As part of a porting job for my graph store, I'm experimenting with
> various binary IO variations (which need random access as well).
> Originally, this was done in C.  I suppose I can still do that, but the
> CCL version is so much better than the ACL variant that getting rid of
> the C for it seems like a good - simpler - idea.  There are some other
> possibilities as well but that is irrelevant here.  But, there is still
> a head scratching aspect to the CL variant.
>
> Enclosed are two simple programs, one in C and one in CL.  They both do
> the same thing and the CL pretty much mimics the "C level" form of the
> C.  All they do is write and read a binary file in 8MB chunks (320MB
> worth).
>
> Setting aside the "elapsed time" aspects (which seem to pretty clearly
> be tied to the OS disk caching behavior) they are both pretty fast.  But
> the C is still around 3X faster in general.  This seems to be due to the
> fact that the CL burns up typically ~1.5 seconds in user mode.  The C
> version typically runs with 0 (!) user mode time.  The system level time
> of both is about the same ~800ms or so.
>
> I would have thought that read-sequence and write-sequence basically map
> to fread and fwrite respectively - especially (as in this case) the
> sequence involved is a simple array of unsigned byte 8.
>
> Long winded prelude to what would the CL variant be doing in user mode
> (read/write-sequence) that the C version is able to (or "dangerously")
> skips?  Off hand, it doesn't seem like the type determination/resolution
> in read/write-sequence would burn up this much time.
>
> C is compiled simply with gcc.  Here are a couple example timings:
>
> C: gcc (GCC) 4.1.0 20060304 (x8632)
>
> $ time ../megabyte-binary-io
>
> 0.004u 0.840s 0:00.86 97.6%     0+0k 0+0io 0pf+0w
>
> $ time ../megabyte-binary-io
>
> 0.000u 0.736s 0:00.73 100.0%    0+0k 0+0io 0pf+0w
>
>
> CL: (ccl-1.3 x8632)
>
> ? (time (main))
> (MAIN) took 2,661 milliseconds (2.661 seconds) to run
>                    with 2 available CPU cores.
> During that period, 1,680 milliseconds (1.680 seconds) were spent in
> user mode
>                    904 milliseconds (0.904 seconds) were spent in
> system mode
> 6 milliseconds (0.006 seconds) was spent in GC.
> 8,394,104 bytes of memory allocated.
> NIL
> ? (time (main))
> (MAIN) took 7,590 milliseconds (7.590 seconds) to run
>                    with 2 available CPU cores.
> During that period, 1,716 milliseconds (1.716 seconds) were spent in
> user mode
>                    768 milliseconds (0.768 seconds) were spent in
> system mode
> 6 milliseconds (0.006 seconds) was spent in GC.
> 8,394,104 bytes of memory allocated.
>
>
> Thanks,
>
> /Jon
>
>
>