[Openmcl-devel] Speed of concatenate vs format

Fri May 1 18:36:44 PDT 2009

Gary,

I really appreciate your detailed and insightful response (as  
always)!  I hadn't thought about the difference between char and  
schar, and I didn't realize that it would make that much of a  
difference.  Sure enough, your concatenate-to-string is significantly  
faster than format or concatenate.

I'm on a different machine now, but for concatenating two 2000- 
character simple-strings 100,000 times, I get 4.0 vs 5.4 vs 10.8  
seconds for concat-to-string, format, and concatenate, respectively.   
With 100-character strings, it's about 0.23 vs 0.83 vs 0.63.

I imagine concatenating short simple-strings is quite a common  
operation, so a compiler-macro to automate this optimization might  
help a lot of applications.  Regardless, using concatenate-to-string  
explicitly should give a boost.  Thanks again.

Daniel

On May 1, 2009, at 7:12 PM, Gary Byers wrote:

> Looking at some profiling output for the (CONCATENATE 'STRING ...)
> case:
>
>
> samples  %        image name               app name                  
> symbol name
> 20960    22.3674  lx86cl64                 lx86cl64                  
> misc_set_common
> 15148    16.1651  profccl64                profccl64                 
> <Compiled-function.CONCAT-TO-SIMPLE*.0x30004016C01F>
> 12780    13.6381  lx86cl64                 lx86cl64                  
> misc_ref_common
> 12281    13.1056  lx86cl64                 lx86cl64                  
> _SPbuiltin_aset1
> 11181    11.9317  lx86cl64                 lx86cl64                  
> _SPsubtag_misc_set
> 8086      8.6289  lx86cl64                 lx86cl64                  
> _SPbuiltin_aref1
> 7731      8.2501  lx86cl64                 lx86cl64                  
> _SPsubtag_misc_ref
>
> shows that the biggest culprits are some combination of
> CCL::CONCAT-TO-SIMPLE* (which may be a bit of a dog; I haven't looked
> at it yet ...) and things that implement AREF and (SETF AREF) on
> 1-dimensional arrays whose element type isn't known at compile-time.
> If the element-type of the result sequence in a call to CONCATENATE is
> known at compile-time, then those things - which are written in
> assembler and aren't grossly inefficient, for what they do - then
> those things can be replaced with simpler, inlined, type-specific
> things.  It certainly takes FORMAT or PRINC a while to get to the
> point where it realizes that it's transferring characters from one
> string to another than it should take some flavor of CONCATENATE, but
> the loop that actually transfers those characters - 4000 references
> and 4000 assignments on each iteration - can use those simpler,
> type-specific reference and assignment operations that the more
> generic CONCATENATE and its aux functions can be.  (Saving the extra
> generic AREF overhead 4000 times and the generic (SETF AREF) overhead
> another 4000 times adds up; you probably wouldn't see results like
> this with shorter strings.)
>
> Some uses of CONCATENATE are more common than others, and using
> concatenate with a result-type that's equivalent to STRING (and with
> source arguments that're very often STRINGs and likely SIMPLE-STRINGs)
> may be the most common usage of CONCATENATE.  A fairly straightforward
> way of trying to exploit that is:
>
> (defun concatenate-to-string (&rest sequences)
>  (declare (dynamic-extent sequences))
>  (let* ((len 0))
>    (declare (fixnum len))
>    (dolist (seq sequences)
>      (incf len (length seq)))
>    (let* ((result (make-string len))
>           (out 0))
>      (declare (simple-string result) (fixnum out)
>               (optimize (speed 3) (safety 0)))
>      (dolist (seq sequences result)
>        (etypecase seq
>          (simple-string
>           (locally (declare (simple-string seq))
>             (dotimes (i (length seq))
>               (setf (schar result out) (schar seq i))
>               (incf out))))
>          (string
>           (let* ((slen (length seq)))
>             (declare (fixnum slen))
>             (multiple-value-bind (data offset)
>                 (ccl::array-data-and-offset seq)
>               (declare (fixnum offset) (simple-string data))
>               (dotimes (i slen)
>                 (setf (schar result out) (schar seq offset))
>                 (incf offset)
>                 (incf out)))))
>          (vector
>           (dotimes (i (length seq))
>             (setf (schar result out) (aref seq i))
>             (incf out)))
>          (list
>           (dolist (elt seq)
>             (setf (schar result out) elt)
>             (incf out))))))))
>
> (I'm sure that this could be made faster still by BLTing things around
> and in other ways.)
>
> Since a sequence function result that's of type STRING is always a  
> SIMPLE-STRING,
> that recognizes that it can always use (SETF SCHAR) to store into  
> the result. If
> we're correct in assuming that the source sequences are often  
> STRINGs (and often
> SIMPLE-STRINGs), the function tries to arrange to use SCHAR to  
> access source
> characters in those cases.  That's likely to be substantially faster  
> than the
> more generic CONCATENATE (though it could surely call this code if  
> it recognizes
> that the result type is appropriate, and a compiler macro could do  
> that recognition
> at compile time.)
>
> This should also be faster than using FORMAT or PRINC to write to a
> STRING-OUTPUT-STREAM, but when the strings involved are large may  
> not be as
> much faster as one might expect: the dominant inner loop in the case  
> of doing
> WRITE-STRING to a STRING-OUTPUT-STREAM is pretty much the same SCHAR/ 
> (SETF SCHAR)
> stuff as happens above.
>
> In all of these cases, the memory-management overhead is pretty  
> significant (and
> relatively constant).  If I'm doing the math right and factoring  
> that out correctly,
> then it looks like CONCATENATE-TO-STRING is around 7X faster than  
> CONCATENATE on
> your test case (and a few X faster than the WRITE-STRING versions.)   
> For shorter
> strings, the speedup isn't likely to be that dramatic, but it may  
> still noticeable
> and it's unlikely that CONCATENATE-TO-STRING would ever be slower  
> than the other
> approaches.
>
>
>
>
> On Fri, 1 May 2009, Daniel Dickison wrote:
>
>> I've been working with some legacy code that contains many instances
>> of --
>>
>> 	(format nil "~a~a" string1 string2)
>>
>> for concatenating strings.  This code is in some code that is a
>> performance bottleneck, and I thought, surely, this would be more
>> efficient if rewritten as --
>>
>> 	(concatenate 'string string1 string2)
>>
>> but this turns out not to be the case.  In fact, using format for
>> string concatenation is significantly faster.  So assuming the
>> overhead for parsing the format control string is negligible, the
>> format version probably becomes two calls to princ wrapped in a with-
>> output-to-string.  How come this is so much faster than concatenate?
>>
>> Now that I've done the profiling, I know which code to use, but I'm
>> still really curious why this is the case.  I would appreciate any
>> insight.  Below is the results of some quick-and-dirty profiling on
>> CCL Version 1.3-r11987M  (DarwinX8664), an iMac Core 2 Duo.
>>
>> Thanks!
>>
>> Daniel
>>
>>
>> CL-USER> (defvar *str* (make-string 2000 :initial-element #\a))
>> *STR*
>>
>> CL-USER> (time (dotimes (i 100000)
>> 		 (format nil "~a~a" *str* *str*)))
>> (DOTIMES (I 100000) (FORMAT NIL "~a~a" *STR* *STR*)) took 5,547,598
>> microseconds (5.547598 seconds) to run
>>                    with 2 available CPU cores.
>> During that period, 3,288,825 microseconds (3.288825 seconds) were
>> spent in user mode
>>                    770,732 microseconds (0.770732 seconds) were
>> spent in system mode
>> 1,076,633 microseconds (1.076633 seconds) was spent in GC.
>> 1,606,416,016 bytes of memory allocated.
>> NIL
>>
>> CL-USER> (time (dotimes (i 100000)
>> 		 (concatenate 'string *str* *str*)))
>> (DOTIMES (I 100000) (CONCATENATE 'STRING *STR* *STR*)) took  
>> 10,373,570
>> microseconds (10.373570 seconds) to run
>>                    with 2 available CPU cores.
>> During that period, 8,092,791 microseconds (8.092791 seconds) were
>> spent in user mode
>>                    878,887 microseconds (0.878887 seconds) were
>> spent in system mode
>> 1,097,819 microseconds (1.097819 seconds) was spent in GC.
>> 1,601,600,000 bytes of memory allocated.
>> NIL
>>
>> CL-USER> (time (dotimes (i 100000)
>> 		 (with-output-to-string (s)
>> 		   (princ *str* s)
>> 		   (princ *str* s))))
>> (DOTIMES (I 100000) (WITH-OUTPUT-TO-STRING (S) (PRINC *STR* S) (PRINC
>> *STR* S))) took 5,284,978 microseconds (5.284978 seconds) to run
>>                    with 2 available CPU cores.
>> During that period, 3,153,681 microseconds (3.153681 seconds) were
>> spent in user mode
>>                    752,712 microseconds (0.752712 seconds) were
>> spent in system mode
>> 1,087,328 microseconds (1.087328 seconds) was spent in GC.
>> 1,606,416,016 bytes of memory allocated.
>>
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
>>
>>