在嘗試解碼 unicode 序列時,至少有一個關(guān)于 SO 的相關(guān)問題被證明是有用的。我正在預(yù)處理大量不同類型的文本。有些是經(jīng)濟的,有些是技術(shù)的,等等。警告之一是轉(zhuǎn)換 unicode 序列:'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek.這樣的字符串需要轉(zhuǎn)換為實際字符:'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt?ch ?amek.可以這樣做:s = "'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek."s = s.encode('utf-8').decode('unicode-escape')(至少這在s從utf-8編碼文本文件中獲取輸入行時有效。我似乎無法讓它在像 REPL.it 這樣的在線服務(wù)上工作,其中輸出的編碼/解碼方式不同。)在大多數(shù)情況下,這可以正常工作。但是,當(dāng)在輸入字符串中看到目錄結(jié)構(gòu)路徑時(我的數(shù)據(jù)集中的技術(shù)文檔通常是這種情況),就會UnicodeDecodeError出現(xiàn) s。鑒于以下數(shù)據(jù)unicode.txt:'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek, Financial Director and Director of Controlling.Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\udfs\math.dll (op Windows)).使用字節(jié)串表示:b"'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\\u0115ch \\u010camek, Financial Director and Director of Controlling.\r\nVoor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\\udfs\\math.dll (op Windows))."解碼輸入文件中的第二行時,以下腳本將失?。簑ith open('unicode.txt', 'r', encoding='utf-8') as fin, open('unicode-out.txt', 'w', encoding='utf-8') as fout: lines = ''.join(fin.readlines()) lines = lines.encode('utf-8').decode('unicode-escape') fout.write(lines)有痕跡:Traceback (most recent call last): File "C:/Python/files/fast_aligning/unicode-encoding.py", line 3, in <module> lines = lines.encode('utf-8').decode('unicode-escape')UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 275-278: truncated \uXXXX escapeProcess finished with exit code 1我如何確保第一句話仍然正確“翻譯”,如前所示,但第二句話保持不變?因此,給出的兩行的預(yù)期輸出如下,其中第一行已更改,第二行未更改。
添加回答
舉報
0/150
提交
取消