您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python third-party library - the use of string encoding tool chardet (classic programming case of Python 3)

編輯：Python

One . chardet Introduce

chardet This third-party library is very easy to use ,chardet Support the detection of Chinese 、 Japanese 、 Korean and many other languages .

String encoding has always been a headache , Especially when dealing with some non-standard third-party Web pages . although Python Provides Unicode It means str and bytes Two types of data , And through encode() and decode() Method transformation , however , Without knowing the code , Yes bytes do decode() Not easy to do .

For unknown encoded bytes, To convert it into str, It needs to be done first “ guess ” code . The way of guessing is to collect various coded characteristic characters first , Judge according to the characteristic characters , There is a high probability “ Guess right ”.

Official documents ：https://chardet.readthedocs.io/en/latest/

github Address ：https://github.com/chardet/chardet

install ：pip3 install chardet

Up to now , Detectable codes ：

ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)
EUC-KR, ISO-2022-KR, Johab (Korean)
KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
ISO-8859-5, windows-1251 (Bulgarian)
ISO-8859-1, windows-1252 (Western European languages)
ISO-8859-7, windows-1253 (Greek)
ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
TIS-620 (Thai)

Two . Use chardet

2.1 The detection code is ascii

When we get one bytes when , It can be detected and coded . use chardet Detection code , Just one line of code ：

import chardet
print(chardet.detect(b'Hello, world!'))
# Running results 
# The detected code is ascii, Notice another confidence Field , Indicates that the probability of detection is 1.0（ namely 100%）.
{
'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

2.2 testing GBK code ：

import chardet
data = ' There is only one truth '.encode('gbk')
print(chardet.detect(data))
# Running results 
# The detection code is GB2312, be aware GBK yes GB2312 Superset , They're the same code , The probability of correct detection is 99%,
# language The language indicated by the field is 'Chinese'.
{
'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

2.3 Yes UTF-8 Code for detection

import chardet
data = ' There is only one truth '.encode('utf-8')
print(chardet.detect(data))
# Running results 
{
'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

2.4 Detect Japanese

import chardet
data = ' Truth はいつもひとつ'.encode('euc-jp')
print(chardet.detect(data))
# Running results 
{
'encoding': 'EUC-JP', 'confidence': 1.0, 'language': 'Japanese'}

so , use chardet Detection code , It's simple . After obtaining the code , Re convert to str, It is convenient for subsequent processing .