本文我们简单讲述一下unicode编码(Unicode-32与UTF-8)

1. Unicode

cb-unicode

2. UTF-8

UTF-8是对Unicode字符的一个变长编码。utf-8是由Ken ThompsonRob Pike发明,同时这两位也是Golang的创立者,当前已经称为Unicode的标准(Unicode standard)。utf-8使用1~4个字节来编码每一个字符,其中ASCII字符使用1个字节,对于其他大部分常用的字符使用2个3个字节,极少部分采用4个字节来编码。第一个字节高比特位为0,用于表示7位的ASCII码;第一个字节的高位为10表示字符占用2个字节:

0xxxxxx                                 runes 0−127       (ASCII)

110xxxxx 10xxxxxx                       128−2047          (values <128 unused)

1110xxxx 10xxxxxx 10xxxxxx              2048−65535        (values <2048 unused)

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx     65536−0x10ffff    (other values unused)

此外,还可以参考如下一段对utf8的说明:

UTF-8 represents each Unicode character using a variable number of bytes. For instance, it represents A with one byte, 65; it represents the Hebrew character Aleph, which has code 1488 in Unicode, with the two-byte sequence 215–144. UTF-8 represents all characters in the ASCII range as in ASCII, that is, with a single byte smaller than 128. It represents all other characters using sequences of bytes where the first byte is in the range [194,244] and the continuation bytes are in the range [128,191]. More specifically, the range of the starting bytes for two-byte sequences is [194,223]; for three-byte sequences, the range is [224,239]; and for four-byte sequences, it is [240,244]. None of those ranges overlap. This property ensures that the code sequence of any character never appears as part of the code sequence of any other character. In particular, a byte smaller than 128 never appears in a multibyte sequence; it always represents its corresponding ASCII character.

例如,对于,其Unicode-16值为\u56fd,那么编码为utf-8后,其对应的字节为:

cb-utf8-guo

说明: unicode-16表示形式为\uhhhh,unicode-32表示形式为\Uhhhhhhhh(注意这里为大写的U)

对于如下均表示同一字符串:
"世界"
"\xe4\xb8\x96\xe7\x95\x8c"
"\u4e16\u754c"
"\U00004e16\U0000754c"



参看:

  1. 查看字符编码(UTF-8)

  2. The Go programing language(p68)