[Openmcl-devel] *default-character-encoding* should be :utf-8

Pascal J. Bourguignon pjb at informatimago.com
Mon Mar 5 14:36:59 UTC 2012


Antony <lisp.linux at gmail.com> writes:

> On 3/4/2012 5:53 PM, Gary Byers wrote:
>>
>> In reading some of the messages in this thread, I'm not sure that
>> some of the
>> people arguing in favor of this change are even aware of the fact
>> that there
>> are some costs associated with it, and may assume that my reluctance
>> to make
>>
> Don't know where this should fit in -
> shouldn't there be separate specials for controlling "source file"
> (for a set of functions such as load-file)
> default encoding versus the general external encoding.
> I'd think then making the source encoding utf-8 would be fine
> (ostensibly the user will compile the code in dev and fix it).
> Taking source file encoding from environment in these days of free
> source libs seems rather silly.

Well, I'd still argue to take it from the environment _by default_.


Otherwise I agree that one may do something smarter.  I'd consider that
in 99% of the cases, the encodings used are us-ascii, iso-8859-1 or
utf-8. Furthermore, most of the iso-8859-1 files won't be valid utf-8
files.


That is, most of the bytes with the high bit set in iso-8859-1 files are
accented letters, and they don't make valid utf-8 secquences:

    (ext:convert-string-from-bytes 
     (ext:convert-string-to-bytes "Il était gentil." 
                                  charset:iso-8859-1)
     charset:utf-8)

    *** - EXT:CONVERT-STRING-FROM-BYTES: Invalid byte
          sequence #xE9 #x74 #x61 in CHARSET:UTF-8 conversion


Therefore I'd propose to read the first line.  If it contains one null
every two bytes, we have utf-16be or utf-16le.  If it contains an emacs
file local variable -*- coding -*-, then use that to decode the rest of
the file.  (Optionnaly, you may also skip to the end of the file (last
512 bytes) and see if there's a Local Variables block).  

Otherwise read the file as US-ASCII. Once a high bit set is found, see
if we can decode it as utf-8 (in which case go on with utf-8) otherwise
use iso-8859-1.

All this, only by default, if no explicit encoding has been given with
*default-file-character-encoding* or :external-format.


(Of course people using iso-8859-7, or iso-8859-15, or KOI-8 won't be
happy, but they're aware of encoding problems and can set their
environment variables or *default-source-character-encoding* or -*-
coding -*-).


> Any other data should be user problem even if there is a helpful
> default encoding that is taken from environment.


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
A bad day in () is better than a good day in {}.




More information about the Openmcl-devel mailing list