程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python third-party library - the use of string encoding tool chardet (classic programming case of Python 3)

編輯:Python

One . chardet Introduce

chardet This third-party library is very easy to use ,chardet Support the detection of Chinese 、 Japanese 、 Korean and many other languages .

String encoding has always been a headache , Especially when dealing with some non-standard third-party Web pages . although Python Provides Unicode It means str and bytes Two types of data , And through encode() and decode() Method transformation , however , Without knowing the code , Yes bytes do decode() Not easy to do .

For unknown encoded bytes, To convert it into str, It needs to be done first “ guess ” code . The way of guessing is to collect various coded characteristic characters first , Judge according to the characteristic characters , There is a high probability “ Guess right ”.

Official documents :https://chardet.readthedocs.io/en/latest/

github Address :https://github.com/chardet/chardet

install :pip3 install chardet

Up to now , Detectable codes :

  • ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
  • Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
  • EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)
  • EUC-KR, ISO-2022-KR, Johab (Korean)
  • KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
  • ISO-8859-5, windows-1251 (Bulgarian)
  • ISO-8859-1, windows-1252 (Western European languages)
  • ISO-8859-7, windows-1253 (Greek)
  • ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
  • TIS-620 (Thai)

Two . Use chardet

2.1 The detection code is ascii

When we get one bytes when , It can be detected and coded . use chardet Detection code , Just one line of code :

import chardet
print(chardet.detect(b'Hello, world!'))
# Running results 
# The detected code is ascii, Notice another confidence Field , Indicates that the probability of detection is 1.0( namely 100%).
{
'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

2.2 testing GBK code :

import chardet
data = ' There is only one truth '.encode('gbk')
print(chardet.detect(data))
# Running results 
# The detection code is GB2312, be aware GBK yes GB2312 Superset , They're the same code , The probability of correct detection is 99%,
# language The language indicated by the field is 'Chinese'.
{
'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

2.3 Yes UTF-8 Code for detection

import chardet
data = ' There is only one truth '.encode('utf-8')
print(chardet.detect(data))
# Running results 
{
'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

2.4 Detect Japanese

import chardet
data = ' Truth はいつもひとつ'.encode('euc-jp')
print(chardet.detect(data))
# Running results 
{
'encoding': 'EUC-JP', 'confidence': 1.0, 'language': 'Japanese'}

so , use chardet Detection code , It's simple . After obtaining the code , Re convert to str, It is convenient for subsequent processing .


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved