[Openmcl-devel] new snapshot tarballs (finally)
gb at clozure.com
Wed Oct 25 14:23:42 PDT 2006
I'm pretty sure that this is a cut-and-paste bug.
On a bivalent stream (one that supports both character and binary I/O), things
like READ-CHAR, UNREAD-CHAR, and READ-BYTE have to interact (somehow ...), and
there need to be specialized functions to handle that. (I don't know that all
of the various locked/private variants of those functions do the same thing or
do the right thing, but after confusing myself about this several times I think
that I concluded that the right thing is to clear any pending unread char
whenever READ-BYTE is called.)
I can't think of any reason for WRITE-CHAR and WRITE-BYTE to have to interact
specially on a bivalent stream, so (unless there's something I'm not thinking
of) the gigantic nested CASE statement in CCL::SETUP-IOBLOCK-OUTPUT has one
less case to worry about (it should have probably just selected some flavor
The good news is that whatever the fix is it won't involve some horrible
bootstrapping cycle ... let me look at this, check something into CVS, and
get back to you.
On Wed, 25 Oct 2006, Erik Pearson wrote:
> It compiled my stuff with no problems... but then I ran into:
> Undefined function CCL::%BIVALENT-IOBLOCK-WRITE-U8-BYTE
> It does not appear to exist where it might be expected, in l1-streams.lisp.
> In fact, none of the bivalent write functions appear to be there...
> --On October 24, 2006 1:39:41 PM -0600 Gary Byers <gb at clozure.com> wrote:
>> There are now new (061024) tar archives for DarwinPPC (32 and 64-bit),
>> LinuxPPC (32 and 64-bit), LinuxX8664 (64-bit), DarwinX8664 (64-bit), and
>> FreeBSDX8664 (64-bit) in ftp://clozure.com/pub/testing
>> These archives are all self-contained (contain sources, binaries,
>> interfaces, the CVS ChangeLog, and release notes); the release-notes
>> entry for this snapshot is included below.
>> I'm sorry that it's taken so long to get things back in synch; now that
>> they are, I hope that they'll stay that way for a while and that people
>> who want to track the bleeding edge will have an easier time doing so.
>> Please report bugs!
>> OpenMCL 1.1-pre-061024
>> - The FASL version changed (old FASL files won't work with this
>> lisp version), as did the version information which tries to
>> keep the kernel in sync with heap images.
>> - Linux users: it's possible (depending on the distribution that
>> you use) that the lisp kernel will claim to depend on newer
>> versions of some shared libraries than the versions that you
>> have installed. This is mostly just an artifact of the GNU
>> linker, which adds version information to dependent library
>> references even though no strong dependency exists. If you
>> run into this, you should be able to simply cd to the appropriate
>> build directory under ccl/lisp-kernel and do a "make".
>> - There's now a port of OpenMCL to FreeBSD/amd64; it claims to be
>> of beta quality. (The problems that made it too unstable
>> to release as of a few months ago have been fixed; I stil run
>> into occasional FreeBSD-specific issues, and some such issues
>> may remain.)
>> - The Darwin X8664 port is a bit more stable (no longer generates
>> obscure "Trace/BKPT trap" exits or spurious-looking FP exceptions.)
>> I'd never want to pass up a chance to speak ill of Mach, but both
>> of these bugs seemed to be OpenMCL problems rather than Mach kernel
>> problems, as I'd previously more-or-less assumed.
>> - I generally don't use SLIME with OpenMCL, but limited testing
>> with the 2006-04-20 verson of SLIME seems to indicate that no
>> changes to SLIME are necessary to work with this version.
>> - CHAR-CODE-LIMIT is now #x110000, which means that all Unicode
>> characters can be directly represented. There is one CHARACTER
>> type (all CHARACTERs are BASE-CHARs) and one string type (all
>> STRINGs are BASE-STRINGs.) This change (and some other changes
>> in the compiler and runtime) made the heap images a few MB larger
>> than in previous versions.
>> - As of Unicode 5.0, only about 100,000 of 1114112./#x110000 CHAR-CODEs
>> are actually defined; the function CODE-CHAR knows that certain
>> ranges of code values (notably #xd800-#xddff) will never be valid
>> character codes and will return NIL for arguments in that range,
>> but may return a non-NIL value (an undefined/non-standard CHARACTER
>> object) for other unassigned code values.
>> - The :EXTERNAL-FORMAT argument to OPEN/LOAD/COMPILE-FILE has been
>> extended to allow the stream's character encoding scheme (as well
>> as line-termination conventions) to be specified; see more
>> details below. MAKE-SOCKET has been extended to allow an
>> :EXTERNAL-FORMAT argument with similar semantics.
>> - Strings of the form "u+xxxx" - where "x" is a sequence of one
>> or more hex digits- can be used as as character names to denote
>> the character whose code is the value of the string of hex digits.
>> (The + character is actually optional, so #\u+0020, #\U0020, and
>> #\U+20 all refer to the #\Space character.) Characters with codes
>> in the range #xa0-#x7ff (IIRC) also have symbolic names (the
>> names from the Unicode standard with spaces replaced with underscores),
>> so #\Greek_Capital_Letter_Epsilon can be used to refer to the character
>> whose CHAR-CODE is #x395.
>> - The line-termination convention popularized with the CP/M operating
>> system (and used in its descendants) - e.g., CRLF - is now supported,
>> as is the use of Unicode #\Line_Separator (#\u+2028).
>> - About 15-20 character encoding schemes are defined (so far); these
>> include UTF-8/16/32 and the big-endian/little-endian variants of
>> the latter two and ISO-8859-* 8-bit encodings. (There is not
>> yet any support for traditional (non-Unicode) ways of externally
>> encoding characters used in Asian languages, support for legacy
>> MacOS encodings, legacy Windows/DOS/IBM encodings, ...) It's hoped
>> that the existing infrastructure will handle most (if not all) of
>> what's missing; that may not be the case for "stateful" encodings
>> (where the way that a given character is encoded/decoded depend
>> on context, like the value of the preceding/following character.)
>> - There isn't yet any support for Unicode-aware collation (CHAR>
>> and related CL functions just compare character codes, which
>> can give meaningless results for non-STANDARD-CHARs), case-inversion,
>> or normalization/denormalization. There's generally good support
>> for this sort of thing in OS-provided libraries (e.g., CoreFoundation
>> on MacOSX), and it's not yet clear whether it'd be best to duplicate
>> that in lisp or leverage library support.
>> - Unicode-aware FFI functions and macros are still in a sort of
>> embryonic state if they're there at all; things like WITH-CSTRs
>> continue to exist (and continue to assume an 8-bit character
>> - Characters that can't be represented in a fixed-width 8-bit
>> character encoding are replaced with #\Sub (= (code-char 26) =
>> ^Z) on output, so if you do something like:
>> ? (format t "~a" #\u+20a0)
>> you might see a #\Sub character (however that's displayed on
>> the terminal device/Emacs buffer) or a Euro currency sign or
>> practically anything else (depending on how lisp is configured
>> to encode output to *TERMINAL-IO* and on how the terminal/Emacs
>> is configured to decode its input.
>> On output to streams with character encodings that can encode
>> the full range of Unicode - and on input from any stream -
>> "unencodable characters" are represented using the Unicode
>> #\Replacement_Character (= #\U+fffd); the presence of such a
>> character usually indicates that something got lost in translation
>> (data wasn't encoded properly or there was a bug in the decoding
>> - Streams encoded in schemes which use more than one octet per code unit
>> (UTF-16, UTF-32, ...) and whose endianness is not explicit will be
>> written with a leading byte-order-mark character on (new) output and
>> will expect a BOM on input; if a BOM is missing from input data,
>> that data will be assumed to have been serialized in big-endian order.
>> Streams encoded in variants of these schemes whose endianness is
>> explicit (UTF-16BE, UCS-4LE, ...) will not have byte-order-marks
>> written on output or expected on input. (UTF-8 streams might also
>> contain encoded byte-order-marks; even though UTF-8 uses a single
>> octet per code unit - and possibly more than one code unit per
>> character - this convention is sometimes used to advertise that the
>> stream is UTF-8- encoded. The current implementation doesn't skip
>> over/ignore leading BOMs on UTF8-encoded input, but it probably
>> If the preceding paragraph made little sense, a shorter version is
>> that sometimes the endianness of encoded data matters and there
>> are conventions for expressing the endianness of encoded data; I
>> think that OpenMCL gets it mostly right, but (even if that's true)
>> the real world may be messier.
>> - By default, OpenMCL uses ISO-8859-1 encoding for *TERMINAL-IO*
>> and for all streams whose EXTERNAL-FORMAT isn't explicitly specified.
>> (ISO-8859-1 just covers the first 256 Unicode code points, where
>> the first 128 code points are equivalent to US-ASCII.) That should
>> be pretty much equivalent to what previous versions (that only
>> supported 8-bit characters) did, but it may not be optimal for
>> users working in a particular locale. The default for *TERMINAL-IO*
>> can be set via a command-line argument (see below) and this setting
>> persists across calls to SAVE-APPLICATION, but it's not clear that
>> there's a good way of setting it automatically (e.g., by checking
>> the POSIX "locale" settings on startup.) Thing like POSIX locales
>> aren't always set correctly (even if they're set correctly for
>> the shell/terminal, they may not be set correctly when running
>> under Emacs ...) and in general, *TERMINAL-IO*'s notion of the
>> character encoding it's using and the "terminal device"/Emacs
>> subprocess's notion need to agree (and fonts need to contain glyphs
>> for the right set of characters) in order for everything to "work".
>> Using ISO-8859-1 as the default seemed to increase the likelyhood that
>> most things would work even if things aren't quite set up ideally
>> (since no character translation occurs for 8-bit characters in
>> - In non-Unicode-related news: the rewrite of OpenMCL's stream code
>> that was started a few months ago should now be complete (no more
>> "missing method for BASIC-STREAM" errors, or at least there shouldn't
>> be any.)
>> - I haven't done anything with the Cocoa bridge/demos lately, besides
>> a little bit of smoke-testing.
>> Some implementation/usage details:
>> Character encodings.
>> CHARACTER-ENCODINGs are objects (structures) that're named by keywords
>> (:ISO-8859-1, :UTF-8, etc.). The structures contain attributes of
>> the encoding and functions used to encode/decode external data, but
>> unless you're trying to define or debug an encoding there's little
>> reason to know much about the CHARACTER-ENCODING objects and it's
>> generally desirable (and sometimes necessary) to refer to the encoding
>> via its name.
>> Most encodings have "aliases"; the encoding named :ISO-8859-1 can
>> also be referred to by the names :LATIN1 and :IBM819, among others.
>> Where possible, the keywordized name of an encoding is equivalent
>> to the preferred MIME charset name (and the aliases are all registered
>> IANA charset names.)
>> NIL is an alias for the :ISO-8859-1 encoding; it's treated a little
>> specially by the I/O system.
>> The function CCL:DESCRIBE-CHARACTER-ENCODINGS will write descriptions
>> of all defined character encodings to *terminal-io*; these descriptions
>> include the names of the encoding's aliases and a doc string which
>> briefly describes each encoding's properties and intended use.
>> Line-termination conventions.
>> As noted in the <=1.0 documentation, the keywords :UNIX, :MACOS, and
>> :INFERRED can be used to denote a stream's line-termination conventions.
>> (:INFERRED is only useful for FILE-STREAMs that're open for :INPUT or
>> :IO.) In this release, the keyword :CR can also be used to indicate
>> that a stream uses #\Return characters for line-termination (equivalent
>> to :MACOS), the keyword :UNICODE denotes that the stream uses Unicode
>> # \Line_Separator characters to terminate lines, and the keywords :CRLF,
>> :CP/M, :MSDOS, :DOS, and :WINDOWS all indicate that lines are terminated
>> via a #\Return #\Linefeed sequence.
>> In some contexts (when specifying EXTERNAL-FORMATs), the keyword :DEFAULT
>> can also be used; in this case, it's equivalent to specifying the value
>> of the variable CCL:*DEFAULT-LINE-TERMINATION*. The initial value of
>> this variable is :UNIX.
>> Note that the set of keywords used to denote CHARACTER-ENCODINGs and
>> the set of keywords used to denote line-termination conventions is
>> disjoint: a keyword denotes at most a character encoding or a line
>> termination convention, but never both.
>> EXTERNAL-FORMATs are also objects (structures) with two read-only
>> fields that can be accessed via the functions
>> EXTERNAL-FORMAT-LINE-TERMINATION and EXTERNAL-FORMAT-CHARACTER-ENCODING;
>> the values of these fields are line-termination-convention-names and
>> character-encoding names as described above.
>> An EXTERNAL-FORMAT object via the function MAKE-EXTERNAL-FORMAT:
>> MAKE-EXTERNAL-FORMAT &key domain character-encoding line-termination
>> (Despite the function's name, it doesn't necessarily create a new,
>> unique EXTERNAL-FORMAT object: two calls to MAKE-EXTERNAL-FORMAT
>> with the same arguments made in the same dynamic environment will
>> return the same (eq) object.)
>> Both the :LINE-TERMINATION and :CHARACTER-ENCODING arguments default
>> to :DEFAULT; if :LINE-TERMINATION is specified as or defaults to
>> :DEFAULT, the value of CCL:*DEFAULT-LINE-TERMINATION* is used to
>> provide a concrete value.
>> When the :CHARACTER-ENCODING argument is specifed as/defaults to
>> :DEFAULT, the concrete character encoding name that's actually used
>> depends on the value of the :DOMAIN argument to MAKE-EXTERNAL-FORMAT.
>> The :DOMAIN-ARGUMENT's value can be practically anything; when it's
>> the keyword :FILE and the :CHARACTER-ENCODING argument's value is
>> :DEFAULT, the concrete character encoding name that's used will be
>> the value of the variable CCL:*DEFAULT-FILE-CHARACTER-ENCODING*; the
>> initial value of this variable is NIL (which is an alias for :ISO-8859-1).
>> If the value of the :DOMAIN argument is :SOCKET and the
>> :CHARACTER-ENCODING argument's value is :DEFAULT, the value of
>> CCL:*DEFAULT-SOCKET-CHARACTER-ENCODING* is used as a concrete character
>> encoding name. The initial value of
>> CCL:*DEFAULT-SOCKET-CHARACTER-ENCODING* is NIL, again denoting the
>> :ISO-8859-1 encoding.
>> If the value of the :DOMAIN argument is anything else, :ISO-8859-1 is
>> also used (but there's no way to override this.)
>> The result of a call to MAKE-EXTERNAL-FORMAT can be used as the value
>> of the :EXTERNAL-FORMAT argument to OPEN, LOAD, COMPILE-FILE, and
>> MAKE-SOCKET; it's also possible to use a few shorthand constructs
>> in these contexts:
>> * if ARG is unspecified or specified as :DEFAULT, the value of the
>> variable CCL:*DEFAULT-EXTERNAL-FORMAT* is used. Since the value
>> of this variable has historically been used to name a default
>> line-termination convention, this case effectively falls into
>> the next one:
>> * if ARG is a keyword which names a concrete line-termination convention,
>> an EXTERNAL-FORMAT equivalent to the result of calling
>> (MAKE-EXTERNAL-FORMAT :line-termination ARG)
>> will be used
>> * if ARG is a keyword which names a character encoding, an EXTERNAL-FORMAT
>> equvalent to the result of calling
>> (MAKE-EXTERNAL-FORMAT :character-encoding ARG)
>> will be used
>> * if ARG is a list, the result of (APPLY #'MAKE-EXTERNAL-FORMAT ARG)
>> will be used
>> (When MAKE-EXTERNAL-FORMAT is called to create an EXTERNAL-FORMAT
>> object from one of these shorthand designators, the value of the
>> :DOMAIN keyword argument is :FILE for OPEN,LOAD, and COMPILE-FILE
>> and :SOCKET for MAKE-SOCKET.)
>> The CL function STREAM-EXTERNAL-FORMAT - which is portably defined
>> on FILE-STREAMs - can be applied to any open stream in this release
>> and will return an EXTERNAL-FORMAT object when applied to an open
>> CHARACTER-STREAM. For open CHARACTER-STREAMs (other than STRING-STREAMs),
>> SETF can be used with STREAM-EXTERNAL-FORMAT to change the stream's
>> character encoding, line-termination, or both.
>> If a "shorthand" external-format designator is used in a call to
>> (SETF STREAM-EXTERNAL-FORMAT), the "domain" used to construct an
>> EXTERNAL-FORMAT is derived from the class of the stream in the
>> obvious way (:FILE for FILE-STREAMs, :SOCKET for ... well, for
>> sockets ...)
>> Note that the effect or doing something like:
>> (let* ((s (open "foo" ... :external-format :utf-8)))
>> (unread-char ch s)
>> (eetf (stream-external-format s) :us-ascii)
>> (read-char s))
>> might or might not be what was intended. The current behavior is
>> that the call to READ-CHAR will return the previously unread character
>> CH, which might surprise any code which assumes that the READ-CHAR
>> will return something encodable in 7 or 8 bits. Since functions
>> like READ may call UNREAD-CHAR "behind your back", it may or may
>> not be obvious that this has even occurred; the best approach to
>> dealing with this issue might be to avoid using READ or explicit
>> calls to UNREAD-CHAR when processing content encoded in multiple
>> external formats.
>> There's a similar issue with "bivalent" streams (sockets) which
>> can do both character and binary I/O with an :ELEMENT-TYPE of
>> (UNSIGNED-BYTE 8). Historically, the sequence:
>> (unread-char ch s)
>> (read-byte s)
>> caused the READ-BYTE to return (CHAR-CODE CH); that made sense
>> when everything was implicitly encoded as :ISO-8859-1, but may not
>> make any sense anymore. (The only thing that seems to make sense
>> in that case is to clear the unread character and read the next
>> octet; that's implemented in some cases but I don't think that
>> things are always handled consistently.)
>> Command-line argument for specifying the character encoding to
>> be used for *TERMINAL-IO*.
>> Shortly after a saved lisp image starts up, it creates the standard
>> CL streams (like *STANDARD-OUTPUT*, *TERMINAL-IO*, *QUERY-IO*, etc.);
>> most of these streams are usually SYNONYM-STREAMS which reference
>> the TWO-WAY-STREAM *TERMINAL-IO*, which is itself comprised of
>> a pair of CHARACTER-STREAMs. The character encoding used for
>> any CHARACTER-STREAMs created during this process is the one
>> named by the value of the variable CCL:*TERMINAL-CHARACTER-ENCODING-NAME*;
>> this value is initially NIL.
>> The -K or --terminal-encoding command-line argument can be used to
>> set the value of this variable (the argument is processed before the
>> standard streams are created.) The string which is the value of
>> the -K/--terminal-encoding argument is uppercased and interned in
>> the KEYWORD package; if an encoding named by that keyword exists,
>> CCL:*TERMINAL-CHARACTER-ENCODING-NAME* is set to the name of that
>> encoding. For example:
>> shell> openmcl -K utf-8
>> will have the effect of making the standard CL streams use :UTF-8
>> as their character encoding.
>> (It's probably possible - but a bit awkward - to use (SETF
>> EXTERNAL-FORMAT) from one's init file or --eval arguments or similar to
>> change existing streams' character encodings; the hard/awkward parts of
>> doing so include the difficulty of determining which standard streams are
>> "real" character streams and which are aliases/composite streams.)
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
> Erik Pearson
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
More information about the Openmcl-devel