[Openmcl-devel] File-length bug?

Ron Garret ron at flownet.com
Mon Aug 5 00:21:52 PDT 2013


Thank you, Gary, as always for this very informative response.  (FWIW, I also looked at the source for tail, but I couldn't make heads nor tails of it -- no pun intended.)

I hereby absolve you of all responsibility for any horrible fate that might befall me as a result of my calling CCL::FD-SIZE.  :-)

rg

On Aug 4, 2013, at 8:10 PM, Gary Byers wrote:

> 
> 
> On Sat, 3 Aug 2013, Ron Garret wrote:
> 
>> 
>>> I agree that we lack a good POSIX oriented CL implementation.
>> 
>> I dunno, I think CCL is a splendid POSIX implementation (and actually SBCL and CLisp don't suck too badly either).  It works for me 99.9% of the time, and the remaining 0.1% the Clozure folks always handle the situation in one of two ways: either they agree that there's a bug and they fix it, or they point out why it's not a bug and how the problem should be solved instead.  I am quite confident that one of those two things will happen as soon as gb gets around to reading this thread.
>> 
>> rg
>> 
> 
> I read the original message, and started to think about how to respond
> to it.  I was curious about how exactly "tail -f" worked, so I started
> Googling for an answer to that question or a pointer to tail's source
> code.
> 
> I forget exactly what search term I used (I could probably find it if
> anyone cares), but one of the results that I looked at was a thread
> where someone asked how to implement "tail -f" in "POSIX C/C++".
> 
> Predictably, the first several responses wondered what "C/C++" was.
> Once the respondents concluded that the "/" should be read as "or",
> and more helpful responses followed.  One of the most helpful was
> Chris Torek, who was one of the primary authors of BSD 4.4 and
> has probably thought about this sort of thing more than all but a
> very small number of people in the world.  He pointed out that "tail -f"
> needed functionality that's not defined in POSIX.
> 
> I then saw the response to your message, and have spent most of the
> time since curled up in the fetal position and moaning "please God,
> just make it stop ..."  Perhaps it has.
> 
> Anyway ... an input-only FILE-STREAM in CCL determines the size of the
> underlying file when the stream is opened and FILE-LENGTH on such a
> stream just returns this cached value.  In what is likely the vast
> majority of cases (where there isn't some other entity modifying the
> underlying file while the input stream is open) this is correct and
> adequate.  I think that the original motivation was to avoid the overhead
> of a system call that was very likely to always return the same value, and
> (IIRC) determining the FILE-LENGTH of the stream may have been something
> that occurred more regularly when reading from the file.  I don't think
> that that motivation still exists, and if FILE-LENGTH always did a system
> call (and returned, e.g., what "ls -l" would return, divided by the size
> of an element in the case of binary streams, that'd only negatively affect
> code that called FILE-LENGTH a lot ...)
> 
> Incidentally: (CCL:STREAM-DEVICE stream direction) - where DIRECTION
> is one of :INPUT or :OUTPUT - should return NIL or a (relatively)
> small integer.  On Unix-based system, that small integer is the
> underlying file descriptor (fd); on Windows, it's an identifier for an
> open file (a "file handle") that serves a similar purpose.  Given an
> FD (or Windows file handle) associated with a FILE-STREAM,
> (CCL::FD-SIZE FD) will return the non-negative size of the underlying
> file in octets (or may return a negative number in case of error.)  That's
> not likely to become exported/supported in the near future, but it's not
> likely to disappear or change soon either.  If your use of this function
> leads to a sudden outbreak in street crime, I can't be held responsible.
> I understand that "grownups" can make reasonable decisions about whether
> to use functions like this (or so I've heard), but once you go down the
> Path To Non-Conformant Code ... well, if you're laying in the gutter clutching
> a bottle of cheap wine a year from now, I'll at least know that my conscience
> is clear.
> 
> Where was I ?  Oh yes ... if FILE-LENGTH on input file streams always
> asked the OS for its idea of the underlying file's size, cases where
> other entities are modifying a file while it's being read can still
> lead to inconsitencies.  (If the external entity truncates a file
> while buffered data obtained from a later position is being processed
> could cause FILE-POSITION to return greater value than FILE-LENGTH,
> for instance.).  The case that you're concerned about - where the
> external entity is appending to a file and FILE-LENGTH increases
> monotonically - is more tractable than other cases, but I think that
> it's reasonable to assume that reading from a file that's being modifed
> by some "external entity" could lead to inconsistencies in general.
> 
> To try to finally get back to your question of how to do something
> like "tail -f" on a CCL FILE-STREAM: if a file descriptior is
> associated with a real file (isn't a socket or pipe or tty or ..., can
> have its position/length queried and possibly set), then reading from
> the fd won't block.  (It could conceivably take an unpredictable
> amount of time if the disk is bad or an unreliable network transport
> (some versions of NFS) is involved, but in general the OS can either
> return some positive number of octets from a #_read call (where that
> number is <= the number of octets requested), return 0 (to indicate EOF),
> or return a negative integer in case of error.  The operation that I'm
> calling #_read is called something else on Windows, and it's triggered
> whenever READ-CHAR or READ-LINE or ... is called on an input stream that
> doesn't have (enough) data in its input buffer.
> 
> If a #_read operation triggered by a CL function on a CCL input stream returns
> 0 bytes, the CL function that triggered that operation processes the EOF but
> (unlike the analogous case with stdio FILE* streams) the stream isn't marked
> as being in an "at EOF" state and there's therefore no need to clear such
> a state before doing something that could trigger another #_read.  In the
> case where an external entity is appending data to the file, some such subsequent
> #_read could eventually return a positive value.  A crude first approximation
> of a first approximation of a "tail -f" like function could be:
> 
> 
> ;;; Assume that we've already read to EOF but expect more data.
> (defun tail-f (stream &optional (out *standard-output*))
>  (let* ((prefix nil))
>    (loop
>      (multiple-value-bind (string missing-newline) (read-line stream nil nil)
>        (when string
>          (cond (missing-newline
>                  (if prefix
>                    (setq prefix (concatenate 'string prefix string))
>                    (setq prefix string)))
>                (t
>                  (when prefix (setq string (concatenate 'string prefix string)))
>                  (setq prefix nil)
>                  (write-line string out))))))))
> 
> The business with "prefix" in the code above has to deal with the fact that
> we might get EOF in the middle of a line because the rest of the line hasn't
> been written yet by the "external entity".  Assuming that I got that right,
> the code above should work but it's horribly inefficient: we're basically
> busy-waiting for data to appear.  This is what made me curious to see how
> tail -f was actually implemented.
> 
> It might be tempting to try to use #_select or #_poll to sleep until data
> is available on the underlying fd, but these functions tell us when it's
> possible to #_read without blocking and don't distinguish between the cases
> where that's possible because data is available or because the fd is currently
> at EOF.
> 
> What "tail -f" actually does is closer to:
> 
> (defun tail-f (stream &optional (out *standard-output*))
>  (let* ((prefix nil)
>         (path (pathname stream))
>         (last-written (file-write-date path)))
>    (loop
>     (sleep some-fairly-short-interval)
>     (let* ((current-mod-time (file-write-date path)))
>       (when (> current-mod-time last-written)
>          (setq last-written current-mod-time)
>          ;; File has changed.  Worth trying to read lines here.
>          ...)))))
> 
> I haven't actually tried the code above (beware of typos), but I did
> verify the underlying idea (it's possible to read sequentially from a
> FILE-STREAM after the stream has reached EOF) and I don't know of an
> in-practice-relatively-portable way of sleeping until the file changes
> that's better than the loop above.  One could check the the file's
> length has changed instead of/as well as that its write-date has
> changed (and might want to scream bloody murder if the file has' been
> truncated), and you'd currently have to use CCL::FD-SIZE to do that.




More information about the Openmcl-devel mailing list