[Openmcl-devel] Speed of concatenate vs format

Fri May 1 16:12:05 PDT 2009

Looking at some profiling output for the (CONCATENATE 'STRING ...)
case:

samples  %        image name               app name                 symbol name
20960    22.3674  lx86cl64                 lx86cl64                 misc_set_common
15148    16.1651  profccl64                profccl64                <Compiled-function.CONCAT-TO-SIMPLE*.0x30004016C01F>
12780    13.6381  lx86cl64                 lx86cl64                 misc_ref_common
12281    13.1056  lx86cl64                 lx86cl64                 _SPbuiltin_aset1
11181    11.9317  lx86cl64                 lx86cl64                 _SPsubtag_misc_set
8086      8.6289  lx86cl64                 lx86cl64                 _SPbuiltin_aref1
7731      8.2501  lx86cl64                 lx86cl64                 _SPsubtag_misc_ref

shows that the biggest culprits are some combination of
CCL::CONCAT-TO-SIMPLE* (which may be a bit of a dog; I haven't looked
at it yet ...) and things that implement AREF and (SETF AREF) on
1-dimensional arrays whose element type isn't known at compile-time.
If the element-type of the result sequence in a call to CONCATENATE is
known at compile-time, then those things - which are written in
assembler and aren't grossly inefficient, for what they do - then
those things can be replaced with simpler, inlined, type-specific
things.  It certainly takes FORMAT or PRINC a while to get to the
point where it realizes that it's transferring characters from one
string to another than it should take some flavor of CONCATENATE, but
the loop that actually transfers those characters - 4000 references
and 4000 assignments on each iteration - can use those simpler,
type-specific reference and assignment operations that the more
generic CONCATENATE and its aux functions can be.  (Saving the extra
generic AREF overhead 4000 times and the generic (SETF AREF) overhead
another 4000 times adds up; you probably wouldn't see results like
this with shorter strings.)

Some uses of CONCATENATE are more common than others, and using
concatenate with a result-type that's equivalent to STRING (and with
source arguments that're very often STRINGs and likely SIMPLE-STRINGs)
may be the most common usage of CONCATENATE.  A fairly straightforward
way of trying to exploit that is:

(defun concatenate-to-string (&rest sequences)
   (declare (dynamic-extent sequences))
   (let* ((len 0))
     (declare (fixnum len))
     (dolist (seq sequences)
       (incf len (length seq)))
     (let* ((result (make-string len))
            (out 0))
       (declare (simple-string result) (fixnum out)
                (optimize (speed 3) (safety 0)))
       (dolist (seq sequences result)
         (etypecase seq
           (simple-string
            (locally (declare (simple-string seq))
              (dotimes (i (length seq))
                (setf (schar result out) (schar seq i))
                (incf out))))
           (string
            (let* ((slen (length seq)))
              (declare (fixnum slen))
              (multiple-value-bind (data offset)
                  (ccl::array-data-and-offset seq)
                (declare (fixnum offset) (simple-string data))
                (dotimes (i slen)
                  (setf (schar result out) (schar seq offset))
                  (incf offset)
                  (incf out)))))
           (vector
            (dotimes (i (length seq))
              (setf (schar result out) (aref seq i))
              (incf out)))
           (list
            (dolist (elt seq)
              (setf (schar result out) elt)
              (incf out))))))))

(I'm sure that this could be made faster still by BLTing things around
and in other ways.)

Since a sequence function result that's of type STRING is always a SIMPLE-STRING,
that recognizes that it can always use (SETF SCHAR) to store into the result. If
we're correct in assuming that the source sequences are often STRINGs (and often
SIMPLE-STRINGs), the function tries to arrange to use SCHAR to access source
characters in those cases.  That's likely to be substantially faster than the
more generic CONCATENATE (though it could surely call this code if it recognizes
that the result type is appropriate, and a compiler macro could do that recognition
at compile time.)

This should also be faster than using FORMAT or PRINC to write to a
STRING-OUTPUT-STREAM, but when the strings involved are large may not be as
much faster as one might expect: the dominant inner loop in the case of doing
WRITE-STRING to a STRING-OUTPUT-STREAM is pretty much the same SCHAR/(SETF SCHAR)
stuff as happens above.

In all of these cases, the memory-management overhead is pretty significant (and
relatively constant).  If I'm doing the math right and factoring that out correctly,
then it looks like CONCATENATE-TO-STRING is around 7X faster than CONCATENATE on
your test case (and a few X faster than the WRITE-STRING versions.)  For shorter
strings, the speedup isn't likely to be that dramatic, but it may still noticeable
and it's unlikely that CONCATENATE-TO-STRING would ever be slower than the other
approaches.

On Fri, 1 May 2009, Daniel Dickison wrote:

> I've been working with some legacy code that contains many instances
> of --
>
> 	(format nil "~a~a" string1 string2)
>
> for concatenating strings.  This code is in some code that is a
> performance bottleneck, and I thought, surely, this would be more
> efficient if rewritten as --
>
> 	(concatenate 'string string1 string2)
>
> but this turns out not to be the case.  In fact, using format for
> string concatenation is significantly faster.  So assuming the
> overhead for parsing the format control string is negligible, the
> format version probably becomes two calls to princ wrapped in a with-
> output-to-string.  How come this is so much faster than concatenate?
>
> Now that I've done the profiling, I know which code to use, but I'm
> still really curious why this is the case.  I would appreciate any
> insight.  Below is the results of some quick-and-dirty profiling on
> CCL Version 1.3-r11987M  (DarwinX8664), an iMac Core 2 Duo.
>
> Thanks!
>
> Daniel
>
>
> CL-USER> (defvar *str* (make-string 2000 :initial-element #\a))
> *STR*
>
> CL-USER> (time (dotimes (i 100000)
> 		 (format nil "~a~a" *str* *str*)))
> (DOTIMES (I 100000) (FORMAT NIL "~a~a" *STR* *STR*)) took 5,547,598
> microseconds (5.547598 seconds) to run
>                     with 2 available CPU cores.
> During that period, 3,288,825 microseconds (3.288825 seconds) were
> spent in user mode
>                     770,732 microseconds (0.770732 seconds) were
> spent in system mode
> 1,076,633 microseconds (1.076633 seconds) was spent in GC.
>  1,606,416,016 bytes of memory allocated.
> NIL
>
> CL-USER> (time (dotimes (i 100000)
> 		 (concatenate 'string *str* *str*)))
> (DOTIMES (I 100000) (CONCATENATE 'STRING *STR* *STR*)) took 10,373,570
> microseconds (10.373570 seconds) to run
>                     with 2 available CPU cores.
> During that period, 8,092,791 microseconds (8.092791 seconds) were
> spent in user mode
>                     878,887 microseconds (0.878887 seconds) were
> spent in system mode
> 1,097,819 microseconds (1.097819 seconds) was spent in GC.
>  1,601,600,000 bytes of memory allocated.
> NIL
>
> CL-USER> (time (dotimes (i 100000)
> 		 (with-output-to-string (s)
> 		   (princ *str* s)
> 		   (princ *str* s))))
> (DOTIMES (I 100000) (WITH-OUTPUT-TO-STRING (S) (PRINC *STR* S) (PRINC
> *STR* S))) took 5,284,978 microseconds (5.284978 seconds) to run
>                     with 2 available CPU cores.
> During that period, 3,153,681 microseconds (3.153681 seconds) were
> spent in user mode
>                     752,712 microseconds (0.752712 seconds) were
> spent in system mode
> 1,087,328 microseconds (1.087328 seconds) was spent in GC.
>  1,606,416,016 bytes of memory allocated.
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>