[Openmcl-devel] string encoding problem

Ron Garret ron at flownet.com
Fri Aug 7 08:29:42 PDT 2015


You’re confused about several things:

1.  There is no such thing as a UTF-8 string.  There is only a UTF-8 encoding of a string.  Not the same thing.  A string is a sequence of characters.  An encoding is a sequence of bytes.  (Note that these may or may not be 8-bit bytes.  CCL, for example, encodes strings internally as UTF-32, which uses 32-bit bytes.)

2.  There is no such thing as iso-8659-1.  (Well, there is, but it’s a standard for testing plastic valves :-)  You almost certainly meant iso-8859-1, a.k.a. latin-1.

Latin-1 is an 8-bit encoding that uses a single byte per character, so it can only represent a tiny subset of all unicode characters.  But fortunately for you, the Cyrillic alphabet is part of that subset so there’s some hope.  Let’s try it:

? (setf s "Економічні реформи УкраїнÐ")
"Економічні реформи УкраїнÐ"
? (encode-string-to-octets s :external-format :latin-1)
#(208 149 208 186 208 190 208 189 208 190 208 188 209 150 209 135 208 189 209 150 32 209 128 208 181 209 132 208 190 209 128 208 188 208 184 32 208 163 208 186 209 128 208 176 209 151 208 189 208)
49
? (decode-string-from-octets * :external-format :utf-8)
"Економічні реформи Україн”


Is that the text you were after?

BTW, how did you end up “accidentally” writing a string as latin-1?  All of the defaults in recent versions of CCL should be set to utf-8.

rg

On Aug 7, 2015, at 7:38 AM, Mark Klein <m_klein at MIT.EDU> wrote:

> 
> I had a set of UTF-8 strings (ukrainian text) that I mistakenly wrote to disk as iso-8659-1, and then read back into clozure as UTF-8, so my strings get messed up e.g. into
> 
> Економічні реформи України
> 
> (#\Latin_Capital_Letter_Eth #\U+0095 #\Latin_Capital_Letter_Eth #\Masculine_Ordinal_Indicator #\Latin_Capital_Letter_Eth #\Vulgar_Fraction_Three_Quarters #\Latin_Capital_Letter_Eth #\Vulgar_Fraction_One_Half #\Latin_Capital_Letter_Eth #\Vulgar_Fraction_Three_Quarters #\Latin_Capital_Letter_Eth #\Vulgar_Fraction_One_Quarter #\Latin_Capital_Letter_N_With_Tilde #\U+0096 #\Latin_Capital_Letter_N_With_Tilde #\U+0087 #\Latin_Capital_Letter_Eth #\Vulgar_Fraction_One_Half #\Latin_Capital_Letter_N_With_Tilde #\U+0096 #\  #\Latin_Capital_Letter_N_With_Tilde #\U+0080 #\Latin_Capital_Letter_Eth #\Micro_Sign #\Latin_Capital_Letter_N_With_Tilde #\U+0084 #\Latin_Capital_Letter_Eth #\Vulgar_Fraction_Three_Quarters #\Latin_Capital_Letter_N_With_Tilde #\U+0080 #\Latin_Capital_Letter_Eth #\Vulgar_Fraction_One_Quarter #\Latin_Capital_Letter_Eth #\Cedilla #\  #\Latin_Capital_Letter_Eth #\Pound_Sign #\Latin_Capital_Letter_Eth #\Masculine_Ordinal_Indicator #\Latin_Capital_Letter_N_With_Tilde #\U+0080 #\Latin_Capital_Letter_Eth #\Degree_Sign #\Latin_Capital_Letter_N_With_Tilde #\U+0097 #\Latin_Capital_Letter_Eth #\Vulgar_Fraction_One_Half #\Latin_Capital_Letter_Eth #\Cedilla)
> 
> Is there some way to recover the original Ukrainian text?
> 
>   Thanks!
> 
> 	Mark
> 
> -------------------------------
> Mark Klein
> http://cci.mit.edu/klein
> 
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> https://lists.clozure.com/mailman/listinfo/openmcl-devel

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clozure.com/pipermail/openmcl-devel/attachments/20150807/73d0d2dd/attachment.bin>


More information about the Openmcl-devel mailing list