[Openmcl-devel] It's too long to take in READ-LINE !

Wed Jan 18 12:43:40 PST 2012

The case that's as bad as reported involves reading a (very long)
line from a stream that defaults to :ISO-8859-1/:UNIX encoding, and it's
worse on Windows than on other platforms.

That case -can- be much faster than the general case: there's typically
a very good chance that a terminating #\newline is sitting in the stream's
input buffer, and when that's true it's easy to know exactly how long
the string needs to be and trivial to initialize it.  If it's not true
(if the buffer doesn't contain a #\newline), the expectation is that
the next buffer will, and that's still likely faster than other approaches.

If the line is very very long, this approach breaks down: we keep reading
buffers (and growing the string by the size of the buffer) until we've finally
gotten to a #\newline or EOF.  On Unix systems, we ask the OS for a "good"
buffer size for the underlying file descriptor; on Windows, we just use
the constant #$BUFSIZ, which was defined (possibly in the 1980s) to be 512.
Growing the string to a length of 2^20 512 elements at a time causes a lot
of pointless memory allocation and copying and GC time, as reported.

The approach of looking for a #\newline in the buffer can be much faster
then the generic approach, but if it doesn't win quickly (it the first or
second buffer doesn't contain a #\newline) it should probably be abandoned.
A lot of the overhead here - since the length of the line isn't known -
involves growing a working copy of the eventual string (replacing that working
copy with a new, larger one whenever it fills up.)  It's generally productive
to grow vectors by an amount proportional to their size, and it's clearly not
productive to repeatedly grow a large string (approaching 1M elements) by 512
elements ...

(There are other approaches; on a FILE-STREAM, it may be faster to remember
the current position, read and count characters until #\Newline or EOF is
encountered, make a string of the exact length, and then back up and do
a READ-SEQUENCE and maybe a READ-CHAR to eat the trailing #\Newline.)

As I've said, I think that it does make sense to try to optimize for what's
very likely to be a more typical case (where the length of a line is much
smaller than a stream's buffer size).  Pessimizing this case has some value
(if only for its deterrent effect: reading 1M element lines is probably never
going to be a really good idea), but this is doing so way more than it'd need
to to make that point ...

On Wed, 18 Jan 2012, R. Matthew Emerson wrote:

>
> On Jan 18, 2012, at 7:43 AM, Xiaofeng Yang wrote:
>
>> I created a simple file with just full of spaces and has size 1MB. Then I opened it and read it using READ-LINE. It takes more than 1 minutes !!!
>>
>> CL-USER> (lisp-implementation-type)
>> "Clozure Common Lisp"
>> CL-USER> (lisp-implementation-version)
>> "Version 1.8-dev-r14962-trunk  (WindowsX8632)"
>>
>> The time I took in READ-LINE:
>> (READ-LINE F NIL) took 67,439 milliseconds (67.439 seconds) to run
>>                     with 8 available CPU cores.
>> During that period, 63,391 milliseconds (63.391 seconds) were spent in user mode
>>                     2,921 milliseconds (2.921 seconds) were spent in system mode
>> 23,977 milliseconds (23.977 seconds) was spent in GC.
>>  17,185,637,938 bytes of memory allocated.
>
> That does sound horrible, but I'm not seeing that result on my system.
>
> I used this code:
>
> (in-package "CL-USER")
>
> (defun make-spaces-file (pathname)
>  (with-open-file (s pathname :direction :output :if-exists :supersede
> 		     :if-does-not-exist :create
> 		     :external-format :utf-8)
>    (let ((spaces (make-string 1024 :initial-element #\space
> 			       :element-type 'character)))
>      (dotimes (i 1024)
> 	(write-sequence spaces s)))))
>
> (defun read-spaces-line (pathname)
>  (let ((line nil))
>    (with-open-file (s pathname :external-format :utf-8)
>      (setq line (read-line s nil nil)))
>    (format t "~&read line of ~d characters" (length line))))
>
> And got these results:
>
> CL-USER> (make-spaces-file "spaces")
> NIL
> CL-USER> (time (read-spaces-line "spaces"))
> read line of 1048576 characters
> (READ-SPACES-LINE "spaces") took 234 milliseconds (0.234 seconds) to run
>                    with 1 available CPU core.
> During that period, 141 milliseconds (0.141 seconds) were spent in user mode
>                    63 milliseconds (0.063 seconds) were spent in system mode
> 62 milliseconds (0.062 seconds) was spent in GC.
> 14,681,920 bytes of memory allocated.
> NIL
> CL-USER>
>
> This was a Windows XP system running under VMware.
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>