程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python - string encoding and decoding

編輯:Python

List of articles

    • About codec
      • Type of code
    • Code implementation encoding and decoding
      • Common string -- Byte conversion
      • Byte style string codec
      • url codec
      • Add bytes


About codec

code / Decoding is essentially a mapping
character a use ascii The encoding is 65, Stored in the computer as 00110101.
a Need to decode to 00110101, Can be used by the computer .

code : The correspondence between real characters and binary strings , Real characters → Binary string
decode : The correspondence between binary string and real character , Binary string → Real characters

Such as :
UTF-8 --> decode decode --> Unicode
Unicode --> encode code --> GBK / UTF-8 etc.


Type of code

  • ASCII Occupy 1 Bytes , Only English is supported
  • GB2312 Occupy 2 Bytes , Support 6700+ Chinese characters
  • GBK GB2312 Upgraded version , Support 21000+ Chinese characters , chinese 2 Bytes .
  • Unicode 2-4 byte , Included 136690 Characters
  • UTF-8: Use 1、2、3、4 Bytes for all characters ;
    priority of use 1 Characters 、 If it cannot be satisfied, a byte will be added , most 4 Bytes .
    English 1 Bytes 、 The European language family accounts for 2 individual 、 East Asia accounts for 3 individual , Other and special characters occupy 4 individual , chinese 3 Bytes .
  • UTF-16: Use 2、4 Bytes for all characters ;
    priority of use 2 Bytes , Otherwise use 4 Byte representation .

ASCII With 1 byte 8 individual bit Bit represents a character , The first is all 0, The character set represented is obviously not enough

unicode The coding system is designed to express any language , To prevent redundancy on storage ( such as , Corresponding ascii The part of the code ), It uses variable length coding , But variable length coding makes decoding difficult , It can't be judged that several bytes represent a character

UTF-8 Is aimed at unicode A prefix for variable length coding design , According to the prefix, it can be judged that several bytes represent a character


Python Default encoding in

  • Python2 Default is ASCII code
  • Python3 Default is unicode

Code implementation encoding and decoding


Common string – Byte conversion

str = ' Hello ' # b'\xe4\xbd\xa0\xe5\xa5\xbd' gbk:b'\xc4\xe3\xba\xc3'
str = 'abc' # b'abc'
str = 'นั่ง' # b'\xe0\xb8\x99\xe0\xb8\xb1\xe0\xb9\x88\xe0\xb8\x87'
str = 'นั่' # b'\xe0\xb8\x99\xe0\xb8\xb1\xe0\xb9\x88'
# str = 2 # 'int' object has no attribute 'encode'
str = '*' # b'*'
a = str.encode('UTF-8')
a = str.encode('gbk')
print(a)
print(type(a)) # <class 'bytes'>

Byte style string codec

Mainly in the use of raw_unicode_escape code

str = '\xe5\x90\x8d\xe7\xa7\xb0'
str_b = str.encode("raw_unicode_escape") # b'\xe5\x90\x8d\xe7\xa7\xb0'
str_origin = str_b.decode("utf-8") # ' name '

url codec

Use urllib library
Reference resources : https://www.cnblogs.com/miaoxiaochao/p/13705936.html


str = ' Hello '
a = urllib.parse.quote(str)
print(a) # %E4%BD%A0%E5%A5%BD
b = urllib.parse.unquote(a) # Hello 

Add bytes

b = b''
b += b'a'
b += b' b'
print(b) b'a b'
print (b.decode('utf-8')) # a b

Yizhi 2022-06-24( 5、 ... and )


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved