[Openmcl-devel] File-length bug?

Gary Byers gb at clozure.com
Sun Aug 4 20:10:29 PDT 2013



On Sat, 3 Aug 2013, Ron Garret wrote:

>
>> I agree that we lack a good POSIX oriented CL implementation.
>
> I dunno, I think CCL is a splendid POSIX implementation (and actually SBCL and CLisp don't suck too badly either).  It works for me 99.9% of the time, and the remaining 0.1% the Clozure folks always handle the situation in one of two ways: either they agree that there's a bug and they fix it, or they point out why it's not a bug and how the problem should be solved instead.  I am quite confident that one of those two things will happen as soon as gb gets around to reading this thread.
>
> rg
>

I read the original message, and started to think about how to respond
to it.  I was curious about how exactly "tail -f" worked, so I started
Googling for an answer to that question or a pointer to tail's source
code.

I forget exactly what search term I used (I could probably find it if
anyone cares), but one of the results that I looked at was a thread
where someone asked how to implement "tail -f" in "POSIX C/C++".

Predictably, the first several responses wondered what "C/C++" was.
Once the respondents concluded that the "/" should be read as "or",
and more helpful responses followed.  One of the most helpful was
Chris Torek, who was one of the primary authors of BSD 4.4 and
has probably thought about this sort of thing more than all but a
very small number of people in the world.  He pointed out that "tail -f"
needed functionality that's not defined in POSIX.

I then saw the response to your message, and have spent most of the
time since curled up in the fetal position and moaning "please God,
just make it stop ..."  Perhaps it has.

Anyway ... an input-only FILE-STREAM in CCL determines the size of the
underlying file when the stream is opened and FILE-LENGTH on such a
stream just returns this cached value.  In what is likely the vast
majority of cases (where there isn't some other entity modifying the
underlying file while the input stream is open) this is correct and
adequate.  I think that the original motivation was to avoid the overhead
of a system call that was very likely to always return the same value, and
(IIRC) determining the FILE-LENGTH of the stream may have been something
that occurred more regularly when reading from the file.  I don't think
that that motivation still exists, and if FILE-LENGTH always did a system
call (and returned, e.g., what "ls -l" would return, divided by the size
of an element in the case of binary streams, that'd only negatively affect
code that called FILE-LENGTH a lot ...)

Incidentally: (CCL:STREAM-DEVICE stream direction) - where DIRECTION
is one of :INPUT or :OUTPUT - should return NIL or a (relatively)
small integer.  On Unix-based system, that small integer is the
underlying file descriptor (fd); on Windows, it's an identifier for an
open file (a "file handle") that serves a similar purpose.  Given an
FD (or Windows file handle) associated with a FILE-STREAM,
(CCL::FD-SIZE FD) will return the non-negative size of the underlying
file in octets (or may return a negative number in case of error.)  That's
not likely to become exported/supported in the near future, but it's not
likely to disappear or change soon either.  If your use of this function
leads to a sudden outbreak in street crime, I can't be held responsible.
I understand that "grownups" can make reasonable decisions about whether
to use functions like this (or so I've heard), but once you go down the
Path To Non-Conformant Code ... well, if you're laying in the gutter clutching
a bottle of cheap wine a year from now, I'll at least know that my conscience
is clear.

Where was I ?  Oh yes ... if FILE-LENGTH on input file streams always
asked the OS for its idea of the underlying file's size, cases where
other entities are modifying a file while it's being read can still
lead to inconsitencies.  (If the external entity truncates a file
while buffered data obtained from a later position is being processed
could cause FILE-POSITION to return greater value than FILE-LENGTH,
for instance.).  The case that you're concerned about - where the
external entity is appending to a file and FILE-LENGTH increases
monotonically - is more tractable than other cases, but I think that
it's reasonable to assume that reading from a file that's being modifed
by some "external entity" could lead to inconsistencies in general.

To try to finally get back to your question of how to do something
like "tail -f" on a CCL FILE-STREAM: if a file descriptior is
associated with a real file (isn't a socket or pipe or tty or ..., can
have its position/length queried and possibly set), then reading from
the fd won't block.  (It could conceivably take an unpredictable
amount of time if the disk is bad or an unreliable network transport
(some versions of NFS) is involved, but in general the OS can either
return some positive number of octets from a #_read call (where that
number is <= the number of octets requested), return 0 (to indicate EOF),
or return a negative integer in case of error.  The operation that I'm
calling #_read is called something else on Windows, and it's triggered
whenever READ-CHAR or READ-LINE or ... is called on an input stream that
doesn't have (enough) data in its input buffer.

If a #_read operation triggered by a CL function on a CCL input stream returns
0 bytes, the CL function that triggered that operation processes the EOF but
(unlike the analogous case with stdio FILE* streams) the stream isn't marked
as being in an "at EOF" state and there's therefore no need to clear such
a state before doing something that could trigger another #_read.  In the
case where an external entity is appending data to the file, some such subsequent
#_read could eventually return a positive value.  A crude first approximation
of a first approximation of a "tail -f" like function could be:


;;; Assume that we've already read to EOF but expect more data.
(defun tail-f (stream &optional (out *standard-output*))
   (let* ((prefix nil))
     (loop
       (multiple-value-bind (string missing-newline) (read-line stream nil nil)
         (when string
           (cond (missing-newline
                   (if prefix
                     (setq prefix (concatenate 'string prefix string))
                     (setq prefix string)))
                 (t
                   (when prefix (setq string (concatenate 'string prefix string)))
                   (setq prefix nil)
                   (write-line string out))))))))

The business with "prefix" in the code above has to deal with the fact that
we might get EOF in the middle of a line because the rest of the line hasn't
been written yet by the "external entity".  Assuming that I got that right,
the code above should work but it's horribly inefficient: we're basically
busy-waiting for data to appear.  This is what made me curious to see how
tail -f was actually implemented.

It might be tempting to try to use #_select or #_poll to sleep until data
is available on the underlying fd, but these functions tell us when it's
possible to #_read without blocking and don't distinguish between the cases
where that's possible because data is available or because the fd is currently
at EOF.

What "tail -f" actually does is closer to:

(defun tail-f (stream &optional (out *standard-output*))
   (let* ((prefix nil)
          (path (pathname stream))
          (last-written (file-write-date path)))
     (loop
      (sleep some-fairly-short-interval)
      (let* ((current-mod-time (file-write-date path)))
        (when (> current-mod-time last-written)
           (setq last-written current-mod-time)
           ;; File has changed.  Worth trying to read lines here.
           ...)))))

I haven't actually tried the code above (beware of typos), but I did
verify the underlying idea (it's possible to read sequentially from a
FILE-STREAM after the stream has reached EOF) and I don't know of an
in-practice-relatively-portable way of sleeping until the file changes
that's better than the loop above.  One could check the the file's
length has changed instead of/as well as that its write-date has
changed (and might want to scream bloody murder if the file has' been
truncated), and you'd currently have to use CCL::FD-SIZE to do that.



More information about the Openmcl-devel mailing list