This article mainly introduces in Python in bytes And str The difference between .1
Updated: 2022 / 6 / 16
1. Python There are two types that can represent character sequences
bytesASCII code Standard to show )strUnicode Code points (code point, Also called code points ), These code points correspond to text characters in human language a = b'h\x6511o' print(list(a)) # [104, 101, 49, 49, 111] print(a) # b'he11o' a = 'a\\u300 propos' print(list(a)) # ['a', '\\', 'u', '3', '0', '0', ' ', 'p', 'r', 'o', 'p', 'o', 's'] print(a) # a\u300 propos
2.Unicode Data and binary data conversion
Unicode Data into binary data , Must call str Of encode Method ( code )FileContent = 'This is file content.'
print(FileContent)
# 'This is file content.'
print(type(FileContent))
# <class 'str'>
FileContent = FileContent.encode(encoding='utf-8')
print(FileContent)
# b'This is file content.'
print(type(FileContent))
# <class 'bytes'>
Unicode data , Must call bytes Of decode Method ( decode )FileContent = b'This is file content.'
print(FileContent)
# b'This is file content.'
print(type(FileContent))
# <class 'bytes'>
FileContent = FileContent.decode(encoding='utf-8')
print(FileContent)
# 'This is file content.'
print(type(FileContent))
# <class 'str'>
When you call these methods , You can specify the character set encoding , You can also use the system default solution , Usually
UTF-8The default character set encoding of the current operating system ,
PythonCheck the default coding standard of the current operating system with a line of code : staycmdIn the implementation of :python3 -c 'import locale; print(locale.getpreferredencoding())' # UTF-8
3. Use the original 8 Bit value and Unicode character string
Use the original 8 Bit value and Unicode Two problems to pay attention to when string ( This problem is equivalent to using bytes and str Two problems that need to be paid attention to ):
bytes and str Are incompatible with each other Use
+The operator# bytes+bytes print(b'a' + b'1') # b'a1' # str+str print('b' + '2') # b2 # bytes+str print('c' + b'2') # TypeError: can only concatenate str (not "bytes") to str
Binary operators can also be used to compare sizes between the same types
# bytes bytes assert b'c' > b'a' assert b'c' < b'a' # AssertionError print(b'a' == b'a') # True # str str assert 'c' > 'a' assert 'c' < 'a' # AssertionError print('a' == 'a') # True # bytes str assert b'c' > 'a' # TypeError: '>' not supported between instances of 'bytes' and 'str' print('a' == b'a') # False
In the format string
%sBoth types of instances can appear in
%The right side of the operator , Used to replace the format string on the left (format string) Inside%s. But if the format string isbytestype , Then it doesn't workstrInstance to replace%s, becausePythonI don't know thisstrWhat character set should be encoded .# bytes % str print(b'red %s' % 'blue' # TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str' # str % bytes print('red %s' % b'blue') # red b'blue' # @ This will make the system bytes Instance above __repr__ Method . The call result replaces the... In the format string %s, So the program will directly output b'blue', Not the output blue
Unicode String manipulation , You cannot use the original bytes
wMode must be in ‘ Text ’ Mode writing , Otherwise, an error will be reported when writing binary data to the file :# Write binary data with open('test.txt', "w+") as f: f.write(b"\xf1\xf2") # TypeError: write() argument must be str, not bytes
wbBinary data can be written normally# Write binary data with open('test.txt', "wb") as f: f.write(b"\xf1\xf2")
rMode must be in ‘ Text ’ Mode writing , Otherwise, an error will be reported when reading binary data from a file :# Read binary data with open('test.txt', "r+") as f: f.read() # UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 0: invalid continuation byte # @ When manipulating a file handle in text mode , The system will use Default text encoding The scheme deals with binary data . therefore , The above way of writing will let the system pass `bytes.decode` Decode this data into `str` character string , Reuse `str.encode` Encode strings into binary values . But for most systems , The default text encoding scheme is `UTF-8`, So the system is likely to put `b'\xf1\xf2\xf3\xf4\xf5'` As a `UTF-8` Format string to decode , So there's a mistake like that .
rbBinary data can be read normally# Write binary data with open('test.txt', "rb") as f: print(b"\xf1\xf2" == f.read()) # TrueAnother modification , Set up
encodingParameter specifies the string encoding :with open('test.txt', "r", encoding="cp1252") as f: print(f.read())
Additional explanation :
Python bytes And str The difference between ︎