[Openmcl-devel] new snapshot tarballs (finally)
Gary Byers
gb at clozure.com
Tue Oct 24 12:39:41 PDT 2006
There are now new (061024) tar archives for DarwinPPC (32 and 64-bit), LinuxPPC (32
and 64-bit), LinuxX8664 (64-bit), DarwinX8664 (64-bit), and FreeBSDX8664 (64-bit)
in ftp://clozure.com/pub/testing
These archives are all self-contained (contain sources, binaries,
interfaces, the CVS ChangeLog, and release notes); the release-notes
entry for this snapshot is included below.
I'm sorry that it's taken so long to get things back in synch; now that they are,
I hope that they'll stay that way for a while and that people who want to track
the bleeding edge will have an easier time doing so.
Please report bugs!
OpenMCL 1.1-pre-061024
- The FASL version changed (old FASL files won't work with this
lisp version), as did the version information which tries to
keep the kernel in sync with heap images.
- Linux users: it's possible (depending on the distribution that
you use) that the lisp kernel will claim to depend on newer
versions of some shared libraries than the versions that you
have installed. This is mostly just an artifact of the GNU
linker, which adds version information to dependent library
references even though no strong dependency exists. If you
run into this, you should be able to simply cd to the appropriate
build directory under ccl/lisp-kernel and do a "make".
- There's now a port of OpenMCL to FreeBSD/amd64; it claims to be
of beta quality. (The problems that made it too unstable
to release as of a few months ago have been fixed; I stil run
into occasional FreeBSD-specific issues, and some such issues
may remain.)
- The Darwin X8664 port is a bit more stable (no longer generates
obscure "Trace/BKPT trap" exits or spurious-looking FP exceptions.)
I'd never want to pass up a chance to speak ill of Mach, but both
of these bugs seemed to be OpenMCL problems rather than Mach kernel
problems, as I'd previously more-or-less assumed.
- I generally don't use SLIME with OpenMCL, but limited testing
with the 2006-04-20 verson of SLIME seems to indicate that no
changes to SLIME are necessary to work with this version.
- CHAR-CODE-LIMIT is now #x110000, which means that all Unicode
characters can be directly represented. There is one CHARACTER
type (all CHARACTERs are BASE-CHARs) and one string type (all
STRINGs are BASE-STRINGs.) This change (and some other changes
in the compiler and runtime) made the heap images a few MB larger
than in previous versions.
- As of Unicode 5.0, only about 100,000 of 1114112./#x110000 CHAR-CODEs
are actually defined; the function CODE-CHAR knows that certain
ranges of code values (notably #xd800-#xddff) will never be valid
character codes and will return NIL for arguments in that range,
but may return a non-NIL value (an undefined/non-standard CHARACTER
object) for other unassigned code values.
- The :EXTERNAL-FORMAT argument to OPEN/LOAD/COMPILE-FILE has been
extended to allow the stream's character encoding scheme (as well
as line-termination conventions) to be specified; see more
details below. MAKE-SOCKET has been extended to allow an
:EXTERNAL-FORMAT argument with similar semantics.
- Strings of the form "u+xxxx" - where "x" is a sequence of one
or more hex digits- can be used as as character names to denote
the character whose code is the value of the string of hex digits.
(The + character is actually optional, so #\u+0020, #\U0020, and
#\U+20 all refer to the #\Space character.) Characters with codes
in the range #xa0-#x7ff (IIRC) also have symbolic names (the
names from the Unicode standard with spaces replaced with underscores),
so #\Greek_Capital_Letter_Epsilon can be used to refer to the character
whose CHAR-CODE is #x395.
- The line-termination convention popularized with the CP/M operating
system (and used in its descendants) - e.g., CRLF - is now supported,
as is the use of Unicode #\Line_Separator (#\u+2028).
- About 15-20 character encoding schemes are defined (so far); these
include UTF-8/16/32 and the big-endian/little-endian variants of
the latter two and ISO-8859-* 8-bit encodings. (There is not
yet any support for traditional (non-Unicode) ways of externally
encoding characters used in Asian languages, support for legacy
MacOS encodings, legacy Windows/DOS/IBM encodings, ...) It's hoped
that the existing infrastructure will handle most (if not all) of
what's missing; that may not be the case for "stateful" encodings
(where the way that a given character is encoded/decoded depend
on context, like the value of the preceding/following character.)
- There isn't yet any support for Unicode-aware collation (CHAR>
and related CL functions just compare character codes, which
can give meaningless results for non-STANDARD-CHARs), case-inversion,
or normalization/denormalization. There's generally good support
for this sort of thing in OS-provided libraries (e.g., CoreFoundation
on MacOSX), and it's not yet clear whether it'd be best to duplicate
that in lisp or leverage library support.
- Unicode-aware FFI functions and macros are still in a sort of
embryonic state if they're there at all; things like WITH-CSTRs
continue to exist (and continue to assume an 8-bit character
encoding.)
- Characters that can't be represented in a fixed-width 8-bit
character encoding are replaced with #\Sub (= (code-char 26) =
^Z) on output, so if you do something like:
? (format t "~a" #\u+20a0)
you might see a #\Sub character (however that's displayed on
the terminal device/Emacs buffer) or a Euro currency sign or
practically anything else (depending on how lisp is configured
to encode output to *TERMINAL-IO* and on how the terminal/Emacs
is configured to decode its input.
On output to streams with character encodings that can encode
the full range of Unicode - and on input from any stream -
"unencodable characters" are represented using the Unicode
#\Replacement_Character (= #\U+fffd); the presence of such a
character usually indicates that something got lost in translation
(data wasn't encoded properly or there was a bug in the decoding
process.)
- Streams encoded in schemes which use more than one octet per code unit
(UTF-16, UTF-32, ...) and whose endianness is not explicit will be
written with a leading byte-order-mark character on (new) output and
will expect a BOM on input; if a BOM is missing from input data,
that data will be assumed to have been serialized in big-endian order.
Streams encoded in variants of these schemes whose endianness is
explicit (UTF-16BE, UCS-4LE, ...) will not have byte-order-marks written
on output or expected on input. (UTF-8 streams might also contain
encoded byte-order-marks; even though UTF-8 uses a single octet per
code unit - and possibly more than one code unit per character - this
convention is sometimes used to advertise that the stream is UTF-8-
encoded. The current implementation doesn't skip over/ignore leading
BOMs on UTF8-encoded input, but it probably should.)
If the preceding paragraph made little sense, a shorter version is
that sometimes the endianness of encoded data matters and there
are conventions for expressing the endianness of encoded data; I
think that OpenMCL gets it mostly right, but (even if that's true)
the real world may be messier.
- By default, OpenMCL uses ISO-8859-1 encoding for *TERMINAL-IO*
and for all streams whose EXTERNAL-FORMAT isn't explicitly specified.
(ISO-8859-1 just covers the first 256 Unicode code points, where
the first 128 code points are equivalent to US-ASCII.) That should
be pretty much equivalent to what previous versions (that only
supported 8-bit characters) did, but it may not be optimal for
users working in a particular locale. The default for *TERMINAL-IO*
can be set via a command-line argument (see below) and this setting
persists across calls to SAVE-APPLICATION, but it's not clear that
there's a good way of setting it automatically (e.g., by checking
the POSIX "locale" settings on startup.) Thing like POSIX locales
aren't always set correctly (even if they're set correctly for
the shell/terminal, they may not be set correctly when running
under Emacs ...) and in general, *TERMINAL-IO*'s notion of the
character encoding it's using and the "terminal device"/Emacs subprocess's
notion need to agree (and fonts need to contain glyphs for the
right set of characters) in order for everything to "work". Using
ISO-8859-1 as the default seemed to increase the likelyhood that
most things would work even if things aren't quite set up ideally
(since no character translation occurs for 8-bit characters in
ISO-8859-1.)
- In non-Unicode-related news: the rewrite of OpenMCL's stream code
that was started a few months ago should now be complete (no more
"missing method for BASIC-STREAM" errors, or at least there shouldn't
be any.)
- I haven't done anything with the Cocoa bridge/demos lately, besides
a little bit of smoke-testing.
Some implementation/usage details:
Character encodings.
CHARACTER-ENCODINGs are objects (structures) that're named by keywords
(:ISO-8859-1, :UTF-8, etc.). The structures contain attributes of
the encoding and functions used to encode/decode external data, but
unless you're trying to define or debug an encoding there's little
reason to know much about the CHARACTER-ENCODING objects and it's
generally desirable (and sometimes necessary) to refer to the encoding
via its name.
Most encodings have "aliases"; the encoding named :ISO-8859-1 can
also be referred to by the names :LATIN1 and :IBM819, among others.
Where possible, the keywordized name of an encoding is equivalent
to the preferred MIME charset name (and the aliases are all registered
IANA charset names.)
NIL is an alias for the :ISO-8859-1 encoding; it's treated a little
specially by the I/O system.
The function CCL:DESCRIBE-CHARACTER-ENCODINGS will write descriptions
of all defined character encodings to *terminal-io*; these descriptions
include the names of the encoding's aliases and a doc string which
briefly describes each encoding's properties and intended use.
Line-termination conventions.
As noted in the <=1.0 documentation, the keywords :UNIX, :MACOS, and
:INFERRED can be used to denote a stream's line-termination conventions.
(:INFERRED is only useful for FILE-STREAMs that're open for :INPUT or
:IO.) In this release, the keyword :CR can also be used to indicate
that a stream uses #\Return characters for line-termination (equivalent
to :MACOS), the keyword :UNICODE denotes that the stream uses Unicode
#\Line_Separator characters to terminate lines, and the keywords :CRLF,
:CP/M, :MSDOS, :DOS, and :WINDOWS all indicate that lines are terminated
via a #\Return #\Linefeed sequence.
In some contexts (when specifying EXTERNAL-FORMATs), the keyword :DEFAULT
can also be used; in this case, it's equivalent to specifying the value
of the variable CCL:*DEFAULT-LINE-TERMINATION*. The initial value of
this variable is :UNIX.
Note that the set of keywords used to denote CHARACTER-ENCODINGs and
the set of keywords used to denote line-termination conventions is
disjoint: a keyword denotes at most a character encoding or a line
termination convention, but never both.
External-formats.
EXTERNAL-FORMATs are also objects (structures) with two read-only
fields that can be accessed via the functions EXTERNAL-FORMAT-LINE-TERMINATION
and EXTERNAL-FORMAT-CHARACTER-ENCODING; the values of these fields are
line-termination-convention-names and character-encoding names as described
above.
An EXTERNAL-FORMAT object via the function MAKE-EXTERNAL-FORMAT:
MAKE-EXTERNAL-FORMAT &key domain character-encoding line-termination
(Despite the function's name, it doesn't necessarily create a new,
unique EXTERNAL-FORMAT object: two calls to MAKE-EXTERNAL-FORMAT
with the same arguments made in the same dynamic environment will
return the same (eq) object.)
Both the :LINE-TERMINATION and :CHARACTER-ENCODING arguments default
to :DEFAULT; if :LINE-TERMINATION is specified as or defaults to
:DEFAULT, the value of CCL:*DEFAULT-LINE-TERMINATION* is used to
provide a concrete value.
When the :CHARACTER-ENCODING argument is specifed as/defaults to
:DEFAULT, the concrete character encoding name that's actually used
depends on the value of the :DOMAIN argument to MAKE-EXTERNAL-FORMAT.
The :DOMAIN-ARGUMENT's value can be practically anything; when it's
the keyword :FILE and the :CHARACTER-ENCODING argument's value is
:DEFAULT, the concrete character encoding name that's used will be
the value of the variable CCL:*DEFAULT-FILE-CHARACTER-ENCODING*; the
initial value of this variable is NIL (which is an alias for :ISO-8859-1).
If the value of the :DOMAIN argument is :SOCKET and the :CHARACTER-ENCODING
argument's value is :DEFAULT, the value of
CCL:*DEFAULT-SOCKET-CHARACTER-ENCODING* is used as a concrete character
encoding name. The initial value of CCL:*DEFAULT-SOCKET-CHARACTER-ENCODING*
is NIL, again denoting the :ISO-8859-1 encoding.
If the value of the :DOMAIN argument is anything else, :ISO-8859-1 is
also used (but there's no way to override this.)
The result of a call to MAKE-EXTERNAL-FORMAT can be used as the value
of the :EXTERNAL-FORMAT argument to OPEN, LOAD, COMPILE-FILE, and
MAKE-SOCKET; it's also possible to use a few shorthand constructs
in these contexts:
* if ARG is unspecified or specified as :DEFAULT, the value of the
variable CCL:*DEFAULT-EXTERNAL-FORMAT* is used. Since the value
of this variable has historically been used to name a default
line-termination convention, this case effectively falls into
the next one:
* if ARG is a keyword which names a concrete line-termination convention,
an EXTERNAL-FORMAT equivalent to the result of calling
(MAKE-EXTERNAL-FORMAT :line-termination ARG)
will be used
* if ARG is a keyword which names a character encoding, an EXTERNAL-FORMAT
equvalent to the result of calling
(MAKE-EXTERNAL-FORMAT :character-encoding ARG)
will be used
* if ARG is a list, the result of (APPLY #'MAKE-EXTERNAL-FORMAT ARG)
will be used
(When MAKE-EXTERNAL-FORMAT is called to create an EXTERNAL-FORMAT
object from one of these shorthand designators, the value of the
:DOMAIN keyword argument is :FILE for OPEN,LOAD, and COMPILE-FILE
and :SOCKET for MAKE-SOCKET.)
STREAM-EXTERNAL-FORMAT.
The CL function STREAM-EXTERNAL-FORMAT - which is portably defined
on FILE-STREAMs - can be applied to any open stream in this release
and will return an EXTERNAL-FORMAT object when applied to an open
CHARACTER-STREAM. For open CHARACTER-STREAMs (other than STRING-STREAMs),
SETF can be used with STREAM-EXTERNAL-FORMAT to change the stream's
character encoding, line-termination, or both.
If a "shorthand" external-format designator is used in a call to
(SETF STREAM-EXTERNAL-FORMAT), the "domain" used to construct an
EXTERNAL-FORMAT is derived from the class of the stream in the
obvious way (:FILE for FILE-STREAMs, :SOCKET for ... well, for
sockets ...)
Note that the effect or doing something like:
(let* ((s (open "foo" ... :external-format :utf-8)))
...
(unread-char ch s)
(eetf (stream-external-format s) :us-ascii)
(read-char s))
might or might not be what was intended. The current behavior is
that the call to READ-CHAR will return the previously unread character
CH, which might surprise any code which assumes that the READ-CHAR
will return something encodable in 7 or 8 bits. Since functions
like READ may call UNREAD-CHAR "behind your back", it may or may
not be obvious that this has even occurred; the best approach to
dealing with this issue might be to avoid using READ or explicit
calls to UNREAD-CHAR when processing content encoded in multiple
external formats.
There's a similar issue with "bivalent" streams (sockets) which
can do both character and binary I/O with an :ELEMENT-TYPE of
(UNSIGNED-BYTE 8). Historically, the sequence:
(unread-char ch s)
(read-byte s)
caused the READ-BYTE to return (CHAR-CODE CH); that made sense
when everything was implicitly encoded as :ISO-8859-1, but may not
make any sense anymore. (The only thing that seems to make sense
in that case is to clear the unread character and read the next
octet; that's implemented in some cases but I don't think that
things are always handled consistently.)
Command-line argument for specifying the character encoding to
be used for *TERMINAL-IO*.
Shortly after a saved lisp image starts up, it creates the standard
CL streams (like *STANDARD-OUTPUT*, *TERMINAL-IO*, *QUERY-IO*, etc.);
most of these streams are usually SYNONYM-STREAMS which reference
the TWO-WAY-STREAM *TERMINAL-IO*, which is itself comprised of
a pair of CHARACTER-STREAMs. The character encoding used for
any CHARACTER-STREAMs created during this process is the one
named by the value of the variable CCL:*TERMINAL-CHARACTER-ENCODING-NAME*;
this value is initially NIL.
The -K or --terminal-encoding command-line argument can be used to
set the value of this variable (the argument is processed before the
standard streams are created.) The string which is the value of
the -K/--terminal-encoding argument is uppercased and interned in
the KEYWORD package; if an encoding named by that keyword exists,
CCL:*TERMINAL-CHARACTER-ENCODING-NAME* is set to the name of that
encoding. For example:
shell> openmcl -K utf-8
will have the effect of making the standard CL streams use :UTF-8
as their character encoding.
(It's probably possible - but a bit awkward - to use (SETF EXTERNAL-FORMAT)
from one's init file or --eval arguments or similar to change existing
streams' character encodings; the hard/awkward parts of doing so include
the difficulty of determining which standard streams are "real" character
streams and which are aliases/composite streams.)
More information about the Openmcl-devel
mailing list