[Openmcl-devel] A faster read-line

Ron Garret ron at flownet.com
Wed Oct 20 10:27:52 PDT 2010


Thanks Gary.  I guess the pitfall here is remembering that :ascii is not actually a translation-free format, at least not when it's done right.

rg

On Oct 20, 2010, at 10:13 AM, Gary Byers wrote:

> In the most general case, READ-LINE is something like:
> 
> 
>  (let* ((temp (make-string-with-fill-pointer)))
>    (loop
>      (let* ((ch (read-char stream nil nil)))
>        (cond ((null ch) (return (values (copy-seq temp) t)))
>              ((eql ch #\newline) (return (values (copy-seq temp) nil)))
>              (t (vector-push-extend ch temp))))))
> 
> where a string-with-fill-pointer might or might not be the best way
> to accumulate characters.
> 
> If the stream is buffered (and you know things about how it's buffered),
> no newline translation is going on, and the mapping between octets and
> characters is simple enough, you can do better: you can look for an octet
> with value #\a in the buffer and if you find one, know how many octets are
> used to encode the string (and therefore know the length o the string in
> characters), and there are other things that you can do that can be a lot
> faster than the "just collect characters until EOF or newline" approach.
> 
> The code used in that case (iso-8859-1 encoding, unix line-termination) is
> faster than the general case; it's still likely to be slower than #_fgets
> (read at most N octets into a preallocated buffer, confuse concepts
> "characters" and "octets", etc.)
> 
> There's a lot of room in between the very simple iso-8859-1/unix case and
> the general one (e.g, ASCII/unix is almost as simple as iso-8859-1), but
> CCL doesn't try to do anything special to handle those cases.  Most of those
> special things involve trying to determine whether there's a newline in
> the buffer, which depends on what character(s) are used to represent #\newline
> and on what octet(s) are used to represent those characters.
> 
> On Tue, 19 Oct 2010, Ron Garret wrote:
> 
>> Thanks.
>> 
>> It seems to be unicode conversion that is taking all the time.  Python yields similar disparities depending on whether you're reading a file opened with open or codecs.open.
>> 
>> READ-SEQUENCE is nice and zippy.
>> 
>> rg
>> 
>> On Oct 19, 2010, at 4:36 PM, Greg Pfeil wrote:
>> 
>>> On 19 Oct 2010, at 19:27, Ron Garret wrote:
>>> 
>>>> Without doing anything special, read-line is, empirically, about fifteen times slower than the equivalent C code, even with :external-format :ascii.  (My benchmark is comparing (loop while (read-line f nil nil)) with wc.)  Lisp also seems to be CPU bound during read-line.  What is it doing with all those cycles?  Are there any easy ways to speed this up?  What's the fastest way to ingest a file in CCL?
>>> 
>>> I don't know what CCL is doing, but I remember seeing this forever ago: http://www.ymeme.com/slurping-a-file-common-lisp-83.html
>> 
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
>> 
>> 




More information about the Openmcl-devel mailing list