
1. Unicode


2. UTF-8

UTF-8是对Unicode字符的一个变长编码。utf-8是由Ken ThompsonRob Pike发明,同时这两位也是Golang的创立者,当前已经称为Unicode的标准(Unicode standard)。utf-8使用1~4个字节来编码每一个字符,其中ASCII字符使用1个字节,对于其他大部分常用的字符使用2个3个字节,极少部分采用4个字节来编码。第一个字节高比特位为0,用于表示7位的ASCII码;第一个字节的高位为10表示字符占用2个字节:

0xxxxxx                                 runes 0−127       (ASCII)

110xxxxx 10xxxxxx                       128−2047          (values <128 unused)

1110xxxx 10xxxxxx 10xxxxxx              2048−65535        (values <2048 unused)

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx     65536−0x10ffff    (other values unused)


UTF-8 represents each Unicode character using a variable number of bytes. For instance, it represents A with one byte, 65; it represents the Hebrew character Aleph, which has code 1488 in Unicode, with the two-byte sequence 215–144. UTF-8 represents all characters in the ASCII range as in ASCII, that is, with a single byte smaller than 128. It represents all other characters using sequences of bytes where the first byte is in the range [194,244] and the continuation bytes are in the range [128,191]. More specifically, the range of the starting bytes for two-byte sequences is [194,223]; for three-byte sequences, the range is [224,239]; and for four-byte sequences, it is [240,244]. None of those ranges overlap. This property ensures that the code sequence of any character never appears as part of the code sequence of any other character. In particular, a byte smaller than 128 never appears in a multibyte sequence; it always represents its corresponding ASCII character.



说明: unicode-16表示形式为\uhhhh,unicode-32表示形式为\Uhhhhhhhh(注意这里为大写的U)



  1. 查看字符编码(UTF-8)

  2. The Go programing language(p68)