How can I find out the encoding of this corrupted Chinese text, which an online tool fixes correctly?
I have a text in Simplified Chinese, which, when read as UTF-8 begins with ´ÓºÜ¾ÃÒÔǰ¿ªÊ¼, which the online tool from MandarinTools (first search result for Repair Corrupted Chinese Email) fixes to the correct 从很久以前开始, but it's not clear how it fixed that. From using the online tool and a hex editor I know that each character is encoded as fixed length 32-bit:
c2b4 c393 从
c2ba c39c 很
c2be c383 久
c392 c394 以
c387 c2b0 前
c2bf c2aa 开
c38a c2bc 始This also shows that a character is encoded as two 16-bit words in the c2**-c3** range. With UTF-16 the first 16-bit word is always 0 for these characters. UTF-8 only uses 24 bits per character for these and Codepage 936 only uses 16 bits per character here. Which method can I use to determine the correct encoding conversion?
utf-8 representation:
e4bb 8e 从
e5be 88 很
e4b9 85 久
e4bb a5 以
e589 8d 前
e5bc 80 开
e5a7 8b 始cp936 representation:
b4d3 从
badc 很
bec3 久
d2d4 以
c7b0 前
bfaa 开
cabc 始 1 Answer
The corrupted text ´ÓºÜ¾ÃÒÔǰ¿ªÊ¼ is 14 characters long. Since the correct Simplified Chinese text 从很久以前开始 is 7 characters long, that immediately suggests that each Simplified Chinese character might correspond to two characters in the corrupted text.
The characters in the corrupted text have the following hex equivalents in UTF-16 (and also with cp936 as shown in the OP):
´ => b4
Ó => d3
º => ba
Ü => dc
¾ => be
à => c3
Ò => d2
Ô => d4
Ç => c7
° => b0
¿ => bf
ª => aa
Ê => ca
¼ => bcI did that translation using a trivial Java program, but there are on-line sites that can do the same thing:
So all the Mandarin Tool needs to do is combine the hex values of the first two corrupted characters to get the first Simplified Chinese character using CP 936, and so on:
´ + Ó => b4 + d3 => b4d3 => 从
º + Ü => ba + dc => badc => 很
¾ + Ã => be + c3 => bec3 => 久
Ò + Ô => d2 + d4 => d2d4 => 以
Ç + ° => c7 + b0 => c7b0 => 前
¿ + ª => bf + aa => bfaa => 开
Ê + ¼ => ca + bc => cabc => 始 Presumably the Mandarin Tool verifies that the transformation of the corrupted text really does result in valid Simplified Chinese text.
Each Simplified Chinese cp936 value can then be mapped to its Unicode code point. For example, 从 = 0xB4D3 = code point 0x4ECE. And once you have the Unicode code point you can translate to any encoding you wish (cp936, GB 18030, UTF-16, etc).
One point I am unclear on in your question is the first listing, showing the 32-bit representations of each Simplified Chinese character (e.g. c2b4 c393 从). That doesn't look right, since the code point for a character (e.g. 0x4ECE for 从) and its 32-bit representation are the same thing. Or am I misunderstanding something?
More in general
"Zoraya ter Beek, age 29, just died by assisted suicide in the Netherlands. She was physically healthy, but psychologically depressed. It's an abomination that an entire society would actively facilitate, even encourage, someone ending their own life because they had no hope. Th…"