Block Hexa Hexa Art Level 58 Block Hexa Hexa Art Level 59
Watch Now This tutorial has a related video grade created by the Real Python squad. Watch it together with the written tutorial to deepen your understanding: Unicode in Python: Working With Graphic symbol Encodings
Treatment character encodings in Python or whatever other language can at times seem painful. Places such as Stack Overflow have thousands of questions stemming from confusion over exceptions like UnicodeDecodeError and UnicodeEncodeError. This tutorial is designed to clear the Exception fog and illustrate that working with text and binary data in Python three can be a smooth experience. Python's Unicode support is strong and robust, but it takes some time to principal.
This tutorial is different considering it'southward non linguistic communication-agnostic but instead deliberately Python-axial. You'll all the same go a language-agnostic primer, but yous'll and so dive into illustrations in Python, with text-heavy paragraphs kept to a minimum. You'll see how to use concepts of character encodings in live Python lawmaking.
By the terminate of this tutorial, you'll:
- Get conceptual overviews on character encodings and numbering systems
- Understand how encoding comes into play with Python'due south
strandbytes - Know about back up in Python for numbering systems through its various forms of
intliterals - Be familiar with Python'south built-in functions related to character encodings and numbering systems
Character encoding and numbering systems are and then closely connected that they need to be covered in the same tutorial or else the handling of either would exist totally inadequate.
What'south a Character Encoding?
There are tens if non hundreds of grapheme encodings. The all-time fashion to start agreement what they are is to cover ane of the simplest character encodings, ASCII.
Whether you're self-taught or have a formal computer science background, chances are you've seen an ASCII tabular array one time or twice. ASCII is a good place to start learning nearly character encoding because it is a small and independent encoding. (Also pocket-sized, every bit it turns out.)
It encompasses the following:
- Lowercase English letters: a through z
- Uppercase English letters: A through Z
- Some punctuation and symbols:
"$"and"!", to name a couple - Whitespace characters: an bodily space (
" "), as well every bit a newline, wagon return, horizontal tab, vertical tab, and a few others - Some not-printable characters: characters such every bit backspace,
"\b", that can't be printed literally in the manner that the letter A tin
So what is a more formal definition of a character encoding?
At a very high level, it's a way of translating characters (such every bit letters, punctuation, symbols, whitespace, and control characters) to integers and ultimately to bits. Each grapheme tin can be encoded to a unique sequence of bits. Don't worry if you lot're shaky on the concept of bits, because we'll go to them before long.
The various categories outlined represent groups of characters. Each single character has a respective code point, which you can call back of as just an integer. Characters are segmented into different ranges within the ASCII table:
| Code Indicate Range | Class |
|---|---|
| 0 through 31 | Control/not-printable characters |
| 32 through 64 | Punctuation, symbols, numbers, and space |
| 65 through ninety | Uppercase English language alphabet letters |
| 91 through 96 | Additional graphemes, such as [ and \ |
| 97 through 122 | Lowercase English alphabet letters |
| 123 through 126 | Additional graphemes, such every bit { and | |
| 127 | Control/non-printable character (DEL) |
The entire ASCII tabular array contains 128 characters. This table captures the consummate character set that ASCII permits. If you don't encounter a character hither, and so you just tin can't limited it as printed text under the ASCII encoding scheme.
| Code Point | Grapheme (Name) | Lawmaking Betoken | Graphic symbol (Proper name) |
|---|---|---|---|
| 0 | NUL (Null) | 64 | @ |
| 1 | SOH (Start of Heading) | 65 | A |
| ii | STX (Beginning of Text) | 66 | B |
| iii | ETX (End of Text) | 67 | C |
| 4 | EOT (End of Manual) | 68 | D |
| five | ENQ (Enquiry) | 69 | E |
| half-dozen | ACK (Acknowledgment) | 70 | F |
| 7 | BEL (Bong) | 71 | G |
| 8 | BS (Backspace) | 72 | H |
| 9 | HT (Horizontal Tab) | 73 | I |
| 10 | LF (Line Feed) | 74 | J |
| 11 | VT (Vertical Tab) | 75 | Thou |
| 12 | FF (Form Feed) | 76 | 50 |
| 13 | CR (Wagon Return) | 77 | Grand |
| 14 | Then (Shift Out) | 78 | North |
| 15 | SI (Shift In) | 79 | O |
| 16 | DLE (Information Link Escape) | eighty | P |
| 17 | DC1 (Device Control ane) | 81 | Q |
| 18 | DC2 (Device Command 2) | 82 | R |
| xix | DC3 (Device Command 3) | 83 | S |
| 20 | DC4 (Device Control iv) | 84 | T |
| 21 | NAK (Negative Acknowledgment) | 85 | U |
| 22 | SYN (Synchronous Idle) | 86 | Five |
| 23 | ETB (Cease of Transmission Block) | 87 | Due west |
| 24 | CAN (Abolish) | 88 | 10 |
| 25 | EM (Finish of Medium) | 89 | Y |
| 26 | SUB (Substitute) | 90 | Z |
| 27 | ESC (Escape) | 91 | [ |
| 28 | FS (File Separator) | 92 | \ |
| 29 | GS (Grouping Separator) | 93 | ] |
| 30 | RS (Record Separator) | 94 | ^ |
| 31 | U.s. (Unit of measurement Separator) | 95 | _ |
| 32 | SP (Infinite) | 96 | ` |
| 33 | ! | 97 | a |
| 34 | " | 98 | b |
| 35 | # | 99 | c |
| 36 | $ | 100 | d |
| 37 | % | 101 | e |
| 38 | & | 102 | f |
| 39 | ' | 103 | g |
| 40 | ( | 104 | h |
| 41 | ) | 105 | i |
| 42 | * | 106 | j |
| 43 | + | 107 | chiliad |
| 44 | , | 108 | fifty |
| 45 | - | 109 | m |
| 46 | . | 110 | northward |
| 47 | / | 111 | o |
| 48 | 0 | 112 | p |
| 49 | 1 | 113 | q |
| l | 2 | 114 | r |
| 51 | 3 | 115 | due south |
| 52 | four | 116 | t |
| 53 | 5 | 117 | u |
| 54 | 6 | 118 | 5 |
| 55 | seven | 119 | due west |
| 56 | 8 | 120 | x |
| 57 | 9 | 121 | y |
| 58 | : | 122 | z |
| 59 | ; | 123 | { |
| sixty | < | 124 | | |
| 61 | = | 125 | } |
| 62 | > | 126 | ~ |
| 63 | ? | 127 | DEL (delete) |
The string Module
Python's string module is a convenient 1-cease-shop for string constants that fall in ASCII'due south graphic symbol gear up.
Here'southward the core of the module in all its celebrity:
# From lib/python3.7/string.py whitespace = ' \t\n\r\v\f ' ascii_lowercase = 'abcdefghijklmnopqrstuvwxyz' ascii_uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' ascii_letters = ascii_lowercase + ascii_uppercase digits = '0123456789' hexdigits = digits + 'abcdef' + 'ABCDEF' octdigits = '01234567' punctuation = r """!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""" printable = digits + ascii_letters + punctuation + whitespace Most of these constants should be self-documenting in their identifier name. We'll embrace what hexdigits and octdigits are shortly.
Yous can use these constants for everyday string manipulation:
>>>
>>> import cord >>> south = "What's wrong with ASCII?!?!?" >>> s . rstrip ( string . punctuation ) 'What's wrong with ASCII' A Bit of a Refresher
Now is a practiced time for a short refresher on the scrap, the nearly fundamental unit of measurement of information that a computer knows.
A chip is a signal that has only 2 possible states. In that location are unlike ways of symbolically representing a bit that all mean the same affair:
- 0 or i
- "aye" or "no"
-
TrueorFalse - "on" or "off"
Our ASCII table from the previous section uses what y'all and I would just call numbers (0 through 127), but what are more precisely called numbers in base ten (decimal).
Y'all can as well express each of these base of operations-10 numbers with a sequence of bits (base 2). Here are the binary versions of 0 through 10 in decimal:
| Decimal | Binary (Compact) | Binary (Padded Course) |
|---|---|---|
| 0 | 0 | 00000000 |
| 1 | 1 | 00000001 |
| 2 | 10 | 00000010 |
| 3 | 11 | 00000011 |
| four | 100 | 00000100 |
| 5 | 101 | 00000101 |
| 6 | 110 | 00000110 |
| seven | 111 | 00000111 |
| 8 | 1000 | 00001000 |
| nine | 1001 | 00001001 |
| x | 1010 | 00001010 |
Notice that as the decimal number n increases, you demand more than pregnant bits to represent the graphic symbol set up to and including that number.
Here's a handy way to correspond ASCII strings as sequences of bits in Python. Each character from the ASCII string gets pseudo-encoded into 8 bits, with spaces in between the 8-flake sequences that each represent a unmarried character:
>>>
>>> def make_bitseq ( due south : str ) -> str : ... if not s . isascii (): ... raise ValueError ( "ASCII only allowed" ) ... return " " . join ( f " { ord ( i ) : 08b } " for i in south ) >>> make_bitseq ( "$.25" ) '01100010 01101001 01110100 01110011' >>> make_bitseq ( "CAPS" ) '01000011 01000001 01010000 01010011' >>> make_bitseq ( "$25.43" ) '00100100 00110010 00110101 00101110 00110100 00110011' >>> make_bitseq ( "~5" ) '01111110 00110101' The f-string f"{ord(i):08b}" uses Python's Format Specification Mini-Language, which is a manner of specifying formatting for replacement fields in format strings:
-
The left side of the colon,
ord(i), is the bodily object whose value will be formatted and inserted into the output. Using the Pythonord()part gives you lot the base-10 code signal for a singlestrcharacter. -
The right hand side of the colon is the format specifier.
08means width 8, 0 padded, and thebfunctions as a sign to output the resulting number in base 2 (binary).
This fob is mainly just for fun, and information technology will neglect very badly for whatever grapheme that you don't meet present in the ASCII table. We'll discuss how other encodings gear up this problem afterward.
Nosotros Demand More than Bits!
There's a critically important formula that's related to the definition of a scrap. Given a number of bits, northward, the number of distinct possible values that can be represented in n bits is iin :
def n_possible_values ( nbits : int ) -> int : return 2 ** nbits Here's what that means:
- 1 bit will permit you limited 21 == 2 possible values.
- viii $.25 will let yous limited 28 == 256 possible values.
- 64 bits will let you express 264 == 18,446,744,073,709,551,616 possible values.
There'southward a corollary to this formula: given a range of distinct possible values, how tin can we detect the number of bits, n, that is required for the range to be fully represented? What y'all're trying to solve for is north in the equation 2northward = 10 (where you lot already know x).
Here's what that works out to:
>>>
>>> from math import ceil , log >>> def n_bits_required ( nvalues : int ) -> int : ... return ceil ( log ( nvalues ) / log ( ii )) >>> n_bits_required ( 256 ) eight The reason that yous need to use a ceiling in n_bits_required() is to account for values that are not clean powers of 2. Say you need to store a character gear up of 110 characters total. Naively, this should take log(110) / log(two) == 6.781 bits, merely in that location'south no such thing as 0.781 bits. 110 values will require 7 $.25, non half-dozen, with the final slots existence unneeded:
>>>
>>> n_bits_required ( 110 ) 7 All of this serves to to prove 1 concept: ASCII is, strictly speaking, a 7-fleck code. The ASCII tabular array that yous saw to a higher place contains 128 lawmaking points and characters, 0 through 127 inclusive. This requires 7 bits:
>>>
>>> n_bits_required ( 128 ) # 0 through 127 7 >>> n_possible_values ( 7 ) 128 The event with this is that modern computers don't shop much of anything in vii-bit slots. They traffic in units of eight $.25, conventionally known equally a byte.
This means that the storage space used by ASCII is half-empty. If it's not articulate why this is, call up back to the decimal-to-binary table from to a higher place. You can express the numbers 0 and 1 with just 1 bit, or you can use 8 bits to limited them as 00000000 and 00000001, respectively.
You can express the numbers 0 through 3 with just 2 bits, or 00 through xi, or you can employ eight $.25 to express them equally 00000000, 00000001, 00000010, and 00000011, respectively. The highest ASCII lawmaking point, 127, requires only 7 significant bits.
Knowing this, you can meet that make_bitseq() converts ASCII strings into a str representation of bytes, where every character consumes ane byte:
>>>
>>> make_bitseq ( "$.25" ) '01100010 01101001 01110100 01110011' ASCII'south underutilization of the 8-flake bytes offered by modern computers led to a family unit of conflicting, informalized encodings that each specified additional characters to be used with the remaining 128 available lawmaking points immune in an 8-flake character encoding scheme.
Not only did these different encodings disharmonism with each other, but each 1 of them was by itself still a grossly incomplete representation of the world's characters, regardless of the fact that they made use of one boosted scrap.
Over the years, one character encoding mega-scheme came to rule them all. However, before nosotros get there, let's talk for a minute about numbering systems, which are a primal underpinning of character encoding schemes.
Covering All the Bases: Other Number Systems
In the discussion of ASCII in a higher place, you saw that each grapheme maps to an integer in the range 0 through 127.
This range of numbers is expressed in decimal (base 10). It's the style that y'all, me, and the residual of us humans are used to counting, for no reason more complicated than that we have 10 fingers.
Just there are other numbering systems as well that are especially prevalent throughout the CPython source code. While the "underlying number" is the same, all numbering systems are just dissimilar ways of expressing the same number.
If I asked you what number the string "11" represents, you'd exist right to give me a strange await before answering that information technology represents eleven.
Nevertheless, this string representation can express unlike underlying numbers in dissimilar numbering systems. In addition to decimal, the alternatives include the following common numbering systems:
- Binary: base 2
- Octal: base 8
- Hexadecimal (hex): base xvi
But what does it hateful for us to say that, in a certain numbering system, numbers are represented in base N?
Hither is the best mode that I know of to articulate what this means: it'southward the number of fingers that you'd count on in that system.
If you desire a much fuller only still gentle introduction to numbering systems, Charles Petzold's Code is an incredibly cool book that explores the foundations of computer lawmaking in detail.
1 manner to demonstrate how different numbering systems interpret the same thing is with Python's int() constructor. If y'all pass a str to int(), Python will assume by default that the string expresses a number in base 10 unless you tell information technology otherwise:
>>>
>>> int ( '11' ) 11 >>> int ( 'eleven' , base = 10 ) # 10 is already default 11 >>> int ( '11' , base = 2 ) # Binary three >>> int ( '11' , base of operations = 8 ) # Octal 9 >>> int ( 'xi' , base = 16 ) # Hex 17 There'south a more than mutual way of telling Python that your integer is typed in a base of operations other than x. Python accepts literal forms of each of the iii alternative numbering systems to a higher place:
| Type of Literal | Prefix | Case |
|---|---|---|
| due north/a | northward/a | 11 |
| Binary literal | 0b or 0B | 0b11 |
| Octal literal | 0o or 0O | 0o11 |
| Hex literal | 0x or 0X | 0x11 |
All of these are sub-forms of integer literals. Yous can see that these produce the same results, respectively, as the calls to int() with non-default base values. They're all simply int to Python:
>>>
>>> 11 11 >>> 0b11 # Binary literal 3 >>> 0o11 # Octal literal nine >>> 0x11 # Hex literal 17 Here's how you lot could type the binary, octal, and hexadecimal equivalents of the decimal numbers 0 through 20. Whatever of these are perfectly valid in a Python interpreter shell or source lawmaking, and all work out to exist of type int:
| Decimal | Binary | Octal | Hex |
|---|---|---|---|
0 | 0b0 | 0o0 | 0x0 |
1 | 0b1 | 0o1 | 0x1 |
2 | 0b10 | 0o2 | 0x2 |
three | 0b11 | 0o3 | 0x3 |
iv | 0b100 | 0o4 | 0x4 |
five | 0b101 | 0o5 | 0x5 |
half dozen | 0b110 | 0o6 | 0x6 |
7 | 0b111 | 0o7 | 0x7 |
8 | 0b1000 | 0o10 | 0x8 |
9 | 0b1001 | 0o11 | 0x9 |
10 | 0b1010 | 0o12 | 0xa |
11 | 0b1011 | 0o13 | 0xb |
12 | 0b1100 | 0o14 | 0xc |
13 | 0b1101 | 0o15 | 0xd |
14 | 0b1110 | 0o16 | 0xe |
15 | 0b1111 | 0o17 | 0xf |
xvi | 0b10000 | 0o20 | 0x10 |
17 | 0b10001 | 0o21 | 0x11 |
xviii | 0b10010 | 0o22 | 0x12 |
19 | 0b10011 | 0o23 | 0x13 |
20 | 0b10100 | 0o24 | 0x14 |
It'south amazing but how prevalent these expressions are in the Python Standard Library. If you lot desire to meet for yourself, navigate to wherever your lib/python3.7/ directory sits, and check out the use of hex literals like this:
$ grep -nri --include "*\.py" -eastward "\b0x" lib/python3.7 This should piece of work on whatsoever Unix system that has grep. You lot could use "\b0o" to search for octal literals or "\b0b" to search for binary literals.
What's the statement for using these alternate int literal syntaxes? In short, it'due south considering ii, viii, and 16 are all powers of 2, while 10 is non. These three alternate number systems occasionally offer a mode for expressing values in a computer-friendly manner. For example, the number 65536 or 216 , is but 10000 in hexadecimal, or 0x10000 as a Python hexadecimal literal.
Enter Unicode
Every bit you saw, the problem with ASCII is that it's non about a big enough prepare of characters to accommodate the world's set of languages, dialects, symbols, and glyphs. (It'southward non fifty-fifty big plenty for English language lone.)
Unicode fundamentally serves the same purpose as ASCII, but it merely encompasses a fashion, way, way bigger set up of code points. There are a handful of encodings that emerged chronologically between ASCII and Unicode, just they are not really worth mentioning just nonetheless because Unicode and one of its encoding schemes, UTF-eight, has get and so predominantly used.
Retrieve of Unicode as a massive version of the ASCII table—one that has one,114,112 possible code points. That's 0 through 1,114,111, or 0 through 17 * (2xvi) - 1, or 0x10ffff hexadecimal. In fact, ASCII is a perfect subset of Unicode. The get-go 128 characters in the Unicode table correspond precisely to the ASCII characters that you'd reasonably expect them to.
In the interest of being technically exacting, Unicode itself is not an encoding. Rather, Unicode is implemented by different character encodings, which yous'll come across soon. Unicode is better idea of as a map (something like a dict) or a two-column database table. It maps characters (similar "a", "¢", or even "ቈ") to distinct, positive integers. A character encoding needs to offer a bit more than.
Unicode contains nigh every character that you can imagine, including additional non-printable ones likewise. 1 of my favorites is the pesky correct-to-left mark, which has code point 8207 and is used in text with both left-to-right and right-to-left language scripts, such equally an article containing both English and Arabic paragraphs.
Unicode vs UTF-8
It didn't take long for people to realize that all of the world'south characters could not be packed into one byte each. Information technology'south evident from this that modern, more comprehensive encodings would need to use multiple bytes to encode some characters.
You lot also saw above that Unicode is not technically a full-blown grapheme encoding. Why is that?
In that location is one affair that Unicode doesn't tell you: it doesn't tell you how to get actual bits from text—just code points. It doesn't tell you enough about how to catechumen text to binary data and vice versa.
Unicode is an abstract encoding standard, non an encoding. That'south where UTF-8 and other encoding schemes come into play. The Unicode standard (a map of characters to lawmaking points) defines several dissimilar encodings from its single grapheme prepare.
UTF-viii equally well as its bottom-used cousins, UTF-sixteen and UTF-32, are encoding formats for representing Unicode characters as binary data of ane or more bytes per character. We'll discuss UTF-16 and UTF-32 in a moment, but UTF-viii has taken the largest share of the pie by far.
That brings u.s.a. to a definition that is long overdue. What does it mean, formally, to encode and decode?
Encoding and Decoding in Python iii
Python 3's str type is meant to stand for human being-readable text and can contain any Unicode grapheme.
The bytes type, conversely, represents binary information, or sequences of raw bytes, that practise not intrinsically take an encoding attached to it.
Encoding and decoding is the process of going from 1 to the other:
In .encode() and .decode(), the encoding parameter is "utf-8" by default, though it'due south generally safer and more than unambiguous to specify it:
>>>
>>> "résumé" . encode ( "utf-8" ) b'r\xc3\xa9sum\xc3\xa9' >>> "El Niño" . encode ( "utf-8" ) b'El Ni\xc3\xb1o' >>> b "r \xc3\xa9 sum \xc3\xa9 " . decode ( "utf-8" ) 'résumé' >>> b "El Ni \xc3\xb1 o" . decode ( "utf-8" ) 'El Niño' The results of str.encode() is a bytes object. Both bytes literals (such equally b"r\xc3\xa9sum\xc3\xa9") and the representations of bytes permit but ASCII characters.
This is why, when calling "El Niño".encode("utf-8"), the ASCII-compatible "El" is allowed to exist represented every bit information technology is, but the n with tilde is escaped to "\xc3\xb1". That messy-looking sequence represents ii bytes, 0xc3 and 0xb1 in hex:
>>>
>>> " " . join ( f " { i : 08b } " for i in ( 0xc3 , 0xb1 )) '11000011 10110001' That is, the grapheme ñ requires two bytes for its binary representation under UTF-eight.
Python 3: All-In on Unicode
Python iii is all-in on Unicode and UTF-viii specifically. Here'southward what that means:
-
Python 3 source lawmaking is causeless to exist UTF-8 by default. This means that y'all don't need
# -*- coding: UTF-8 -*-at the top of.pyfiles in Python iii. -
All text (
str) is Unicode past default. Encoded Unicode text is represented every bit binary data (bytes). Thestrblazon can contain any literal Unicode character, such as"Δv / Δt", all of which will be stored as Unicode. -
Python 3 accepts many Unicode code points in identifiers, meaning
résumé = "~/Documents/resume.pdf"is valid if this strikes your fancy. -
Python's
remodule defaults to there.UNICODEflag rather thanre.ASCII. This means, for instance, thatr"\w"matches Unicode word characters, not only ASCII letters. -
The default
encodinginstr.encode()andbytes.decode()is UTF-8.
There is one other property that is more nuanced, which is that the default encoding to the built-in open() is platform-dependent and depends on the value of locale.getpreferredencoding():
>>>
>>> # Mac OS 10 High Sierra >>> import locale >>> locale . getpreferredencoding () 'UTF-eight' >>> # Windows Server 2012; other Windows builds may use UTF-16 >>> import locale >>> locale . getpreferredencoding () 'cp1252' Once again, the lesson here is to be careful about making assumptions when it comes to the universality of UTF-eight, even if it is the predominant encoding. Information technology never hurts to exist explicit in your lawmaking.
Ane Byte, Ii Bytes, 3 Bytes, Iv
A crucial characteristic is that UTF-8 is a variable-length encoding. It's tempting to gloss over what this means, but it'due south worth delving into.
Think back to the department on ASCII. Everything in extended-ASCII-land demands at nearly one byte of space. You can quickly show this with the following generator expression:
>>>
>>> all ( len ( chr ( i ) . encode ( "ascii" )) == 1 for i in range ( 128 )) True UTF-eight is quite different. A given Unicode grapheme tin occupy anywhere from one to four bytes. Here'southward an example of a single Unicode grapheme taking up four bytes:
>>>
>>> ibrow = "🤨" >>> len ( ibrow ) 1 >>> ibrow . encode ( "utf-eight" ) b'\xf0\x9f\xa4\xa8' >>> len ( ibrow . encode ( "utf-8" )) 4 >>> # Calling list() on a bytes object gives you >>> # the decimal value for each byte >>> list ( b ' \xf0\x9f\xa4\xa8 ' ) [240, 159, 164, 168] This is a subtle but important feature of len():
- The length of a unmarried Unicode character equally a Python
strwill always exist ane, no matter how many bytes information technology occupies. - The length of the same character encoded to
byteswill exist anywhere between 1 and 4.
The table below summarizes what full general types of characters fit into each byte-length saucepan:
| Decimal Range | Hex Range | What's Included | Examples |
|---|---|---|---|
| 0 to 127 | "\u0000" to "\u007F" | U.Due south. ASCII | "A", "\n", "7", "&" |
| 128 to 2047 | "\u0080" to "\u07FF" | Near Latinic alphabets* | "ę", "±", "ƌ", "ñ" |
| 2048 to 65535 | "\u0800" to "\uFFFF" | Additional parts of the multilingual plane (BMP)** | "ത", "ᄇ", "ᮈ", "‰" |
| 65536 to 1114111 | "\U00010000" to "\U0010FFFF" | Other*** | "𝕂", "𐀀", "😓", "🂲", |
*Such equally English language, Arabic, Greek, and Irish
**A huge array of languages and symbols—generally Chinese, Japanese, and Korean by volume (as well ASCII and Latin alphabets)
***Additional Chinese, Japanese, Korean, and Vietnamese characters, plus more symbols and emojis
What About UTF-xvi and UTF-32?
Let's get back to 2 other encoding variants, UTF-16 and UTF-32.
The difference between these and UTF-8 is substantial in exercise. Here's an case of how major the difference is with a round-trip conversion:
>>>
>>> letters = "αβγδ" >>> rawdata = letters . encode ( "utf-8" ) >>> rawdata . decode ( "utf-8" ) 'αβγδ' >>> rawdata . decode ( "utf-16" ) # 😧 '뇎닎돎듎' In this case, encoding four Greek letters with UTF-viii and and then decoding back to text in UTF-16 would produce a text str that is in a completely dissimilar linguistic communication (Korean).
Glaringly wrong results like this are possible when the same encoding isn't used bidirectionally. 2 variations of decoding the same bytes object may produce results that aren't even in the same language.
This table summarizes the range or number of bytes under UTF-8, UTF-sixteen, and UTF-32:
| Encoding | Bytes Per Grapheme (Inclusive) | Variable Length |
|---|---|---|
| UTF-8 | 1 to 4 | Aye |
| UTF-16 | two to 4 | Yes |
| UTF-32 | 4 | No |
One other curious attribute of the UTF family is that UTF-8 will not e'er take up less infinite than UTF-16. That may seem mathematically counterintuitive, merely it's quite possible:
>>>
>>> text = "記者 鄭啟源 羅智堅" >>> len ( text . encode ( "utf-viii" )) 26 >>> len ( text . encode ( "utf-sixteen" )) 22 The reason for this is that the lawmaking points in the range U+0800 through U+FFFF (2048 through 65535 in decimal) take up 3 bytes in UTF-viii versus just two in UTF-sixteen.
I'yard not by whatsoever means recommending that you jump aboard the UTF-16 train, regardless of whether or not y'all operate in a linguistic communication whose characters are usually in this range. Amongst other reasons, 1 of the strong arguments for using UTF-8 is that, in the globe of encoding, information technology's a bully idea to alloy in with the oversupply.
Not to mention, it's 2019: figurer retention is cheap, and so saving 4 bytes by going out of your way to use UTF-16 is arguably not worth it.
Python's Built-In Functions
You've made it through the hard part. Fourth dimension to use what you've seen thus far in Python.
Python has a group of built-in functions that relate in some manner to numbering systems and character encoding:
-
ascii() -
bin() -
bytes() -
chr() -
hex() -
int() -
oct() -
ord() -
str()
These can be logically grouped together based on their purpose:
-
ascii(),bin(),hex(), andoct()are for obtaining a different representation of an input. Each one produces astr. The offset,ascii(), produces an ASCII only representation of an object, with non-ASCII characters escaped. The remaining three give binary, hexadecimal, and octal representations of an integer, respectively. These are only representations, not a primal change in the input. -
bytes(),str(), andint()are course constructors for their respective types,bytes,str, andint. They each offer ways of coercing the input into the desired type. For example, as you saw earlier, whileint(11.0)is probably more common, y'all might also come acrossint('11', base=16). -
ord()andchr()are inverses of each other in that the Pythonord()role converts astrgraphic symbol to its base-10 code point, whilechr()does the opposite.
Hither's a more detailed look at each of these nine functions:
| Function | Signature | Accepts | Render Blazon | Purpose |
|---|---|---|---|---|
ascii() | ascii(obj) | Varies | str | ASCII only representation of an object, with not-ASCII characters escaped |
bin() | bin(number) | number: int | str | Binary representation of an integer, with the prefix "0b" |
bytes() | bytes(iterable_of_ints) | Varies | bytes | Coerce (convert) the input to bytes, raw binary data |
chr() | chr(i) | i: int | str | Convert an integer lawmaking point to a single Unicode character |
hex() | hex(number) | number: int | str | Hexadecimal representation of an integer, with the prefix "0x" |
int() | int([x]) | Varies | int | Coerce (convert) the input to int |
oct() | oct(number) | number: int | str | Octal representation of an integer, with the prefix "0o" |
ord() | ord(c) | c: str | int | Catechumen a single Unicode graphic symbol to its integer lawmaking point |
str() | str(object='') | Varies | str | Coerce (catechumen) the input to str, text |
Yous can expand the section below to see some examples of each office.
ascii() gives you an ASCII-only representation of an object, with non-ASCII characters escaped:
>>>
>>> ascii ( "abcdefg" ) "'abcdefg'" >>> ascii ( "jalepeño" ) "'jalepe\\xf1o'" >>> ascii (( 1 , 2 , 3 )) '(ane, 2, three)' >>> ascii ( 0xc0ffee ) # Hex literal (int) '12648430' bin() gives you a binary representation of an integer, with the prefix "0b":
>>>
>>> bin ( 0 ) '0b0' >>> bin ( 400 ) '0b110010000' >>> bin ( 0xc0ffee ) # Hex literal (int) '0b110000001111111111101110' >>> [ bin ( i ) for i in [ i , 2 , 4 , 8 , xvi ]] # `int` + list comprehension ['0b1', '0b10', '0b100', '0b1000', '0b10000'] bytes() coerces the input to bytes, representing raw binary data:
>>>
>>> # Iterable of ints >>> bytes (( 104 , 101 , 108 , 108 , 111 , 32 , 119 , 111 , 114 , 108 , 100 )) b'how-do-you-do globe' >>> bytes ( range ( 97 , 123 )) # Iterable of ints b'abcdefghijklmnopqrstuvwxyz' >>> bytes ( "real 🐍" , "utf-8" ) # Cord + encoding b'existent \xf0\x9f\x90\x8d' >>> bytes ( 10 ) b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' >>> bytes . fromhex ( 'c0 ff ee' ) b'\xc0\xff\xee' >>> bytes . fromhex ( "72 65 61 6c seventy 79 74 68 6f 6e" ) b'realpython' chr() converts an integer lawmaking indicate to a single Unicode character:
>>>
>>> chr ( 97 ) 'a' >>> chr ( 7048 ) 'ᮈ' >>> chr ( 1114111 ) '\U0010ffff' >>> chr ( 0x10FFFF ) # Hex literal (int) '\U0010ffff' >>> chr ( 0b01100100 ) # Binary literal (int) 'd' hex() gives the hexadecimal representation of an integer, with the prefix "0x":
>>>
>>> hex ( 100 ) '0x64' >>> [ hex ( i ) for i in [ i , 2 , 4 , viii , xvi ]] ['0x1', '0x2', '0x4', '0x8', '0x10'] >>> [ hex ( i ) for i in range ( 16 )] ['0x0', '0x1', '0x2', '0x3', '0x4', '0x5', '0x6', '0x7', '0x8', '0x9', '0xa', '0xb', '0xc', '0xd', '0xe', '0xf'] int() coerces the input to int, optionally interpreting the input in a given base of operations:
>>>
>>> int ( 11.0 ) 11 >>> int ( 'xi' ) 11 >>> int ( '11' , base of operations = 2 ) three >>> int ( '11' , base = 8 ) ix >>> int ( '11' , base = 16 ) 17 >>> int ( 0xc0ffee - 1.0 ) 12648429 >>> int . from_bytes ( b " \x0f " , "footling" ) fifteen >>> int . from_bytes ( b ' \xc0\xff\xee ' , "big" ) 12648430 The Python ord() function converts a single Unicode character to its integer code point:
>>>
>>> ord ( "a" ) 97 >>> ord ( "ę" ) 281 >>> ord ( "ᮈ" ) 7048 >>> [ ord ( i ) for i in "hello globe" ] [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100] str() coerces the input to str, representing text:
>>>
>>> str ( "str of string" ) 'str of string' >>> str ( 5 ) '5' >>> str ([ i , 2 , 3 , 4 ]) # Like [1, two, 3, 4].__str__(), simply apply str() '[1, 2, 3, 4]' >>> str ( b " \xc2\xbc cup of flour" , "utf-8" ) '¼ cup of flour' >>> str ( 0xc0ffee ) '12648430' Python String Literals: Ways to Peel a Cat
Rather than using the str() constructor, it's commonplace to blazon a str literally:
>>>
>>> repast = "shrimp and grits" That may seem easy plenty. But the interesting side of things is that, considering Python 3 is Unicode-axial through and through, you lot tin can "type" Unicode characters that y'all probably won't even notice on your keyboard. You can copy and paste this correct into a Python 3 interpreter shell:
>>>
>>> alphabet = 'αβγδεζηθικλμνξοπρςστυφχψ' >>> print ( alphabet ) αβγδεζηθικλμνξοπρςστυφχψ Besides placing the actual, unescaped Unicode characters in the console, there are other ways to blazon Unicode strings also.
One of the densest sections of Python'southward documentation is the portion on lexical analysis, specifically the department on cord and bytes literals. Personally, I had to read this section about i, two, or mayhap 9 times for information technology to actually sink in.
Part of what information technology says is that there are upwards to six ways that Python will allow you to type the same Unicode graphic symbol.
The beginning and about common manner is to blazon the character itself literally, equally you've already seen. The tough role with this method is finding the bodily keystrokes. That'southward where the other methods for getting and representing characters come into play. Hither's the full listing:
| Escape Sequence | Meaning | How To Express "a" |
|---|---|---|
"\ooo" | Character with octal value ooo | "\141" |
"\xhh" | Graphic symbol with hex value hh | "\x61" |
"\Due north{proper name}" | Graphic symbol named proper noun in the Unicode database | "\N{LATIN Pocket-sized LETTER A}" |
"\uxxxx" | Character with xvi-bit (2-byte) hex value xxxx | "\u0061" |
"\Uxxxxxxxx" | Graphic symbol with 32-bit (4-byte) hex value xxxxxxxx | "\U00000061" |
Here's some proof and validation of the above:
>>>
>>> ( ... "a" == ... " \x61 " == ... " \Northward{LATIN SMALL LETTER A} " == ... " \u0061 " == ... " \U00000061 " ... ) True Now, in that location are two main caveats:
-
Not all of these forms work for all characters. The hex representation of the integer 300 is
0x012c, which merely isn't going to fit into the two-hex-digit escape code"\xhh". The highest code point that you can squeeze into this escape sequence is"\xff"("ÿ"). Similarly for"\ooo", it will but piece of work up to"\777"("ǿ"). -
For
\xhh,\uxxxx, and\Uxxxxxxxx, exactly as many digits are required as are shown in these examples. This can throw you lot for a loop because of the way that Unicode tables conventionally display the codes for characters, with a leadingU+and variable number of hex characters. They key is that Unicode tables near often do not zero-pad these codes.
For case, if you consult unicode-table.com for data on the Gothic alphabetic character faihu (or fehu), "𐍆", you'll meet that information technology is listed as having the lawmaking U+10346.
How practice yous put this into "\uxxxx" or "\Uxxxxxxxx"? Well, y'all can't fit it in "\uxxxx" because it'due south a 4-byte graphic symbol, and to utilize "\Uxxxxxxxx" to represent this grapheme, you'll need to left-pad the sequence:
This likewise means that the "\Uxxxxxxxx" grade is the but escape sequence that is capable of belongings any Unicode character.
Other Encodings Available in Python
And so far, you've seen four character encodings:
- ASCII
- UTF-eight
- UTF-16
- UTF-32
There are a ton of other ones out there.
One example is Latin-1 (as well called ISO-8859-1), which is technically the default for the Hypertext Transfer Protocol (HTTP), per RFC 2616. Windows has its own Latin-1 variant called cp1252.
The complete list of accepted encodings is buried style down in the documentation for the codecs module, which is part of Python'south Standard Library.
In that location'south 1 more than useful recognized encoding to be enlightened of, which is "unicode-escape". If yous have a decoded str and desire to rapidly get a representation of its escaped Unicode literal, then you can specify this encoding in .encode():
>>>
>>> alef = chr ( 1575 ) # Or "\u0627" >>> alef_hamza = chr ( 1571 ) # Or "\u0623" >>> alef , alef_hamza ('ا', 'أ') >>> alef . encode ( "unicode-escape" ) b'\\u0627' >>> alef_hamza . encode ( "unicode-escape" ) b'\\u0623' You Know What They Say About Assumptions…
Only because Python makes the assumption of UTF-8 encoding for files and code that you generate doesn't mean that yous, the developer, should operate with the same assumption for external data.
Let's say that again because it'south a rule to live by: when yous receive binary data (bytes) from a tertiary party source, whether information technology be from a file or over a network, the best do is to check that the information specifies an encoding. If it doesn't, then it's on yous to enquire.
All I/O happens in bytes, non text, and bytes are just ones and zeros to a computer until you tell it otherwise by informing it of an encoding.
Here'due south an example of where things tin go wrong. You're subscribed to an API that sends yous a recipe of the solar day, which yous receive in bytes and take always decoded using .decode("utf-eight") with no problem. On this detail twenty-four hours, part of the recipe looks like this:
>>>
>>> data = b " \xbc loving cup of flour" Information technology looks equally if the recipe calls for some flour, but we don't know how much:
>>>
>>> data . decode ( "utf-8" ) Traceback (most recent telephone call terminal): File "<stdin>", line ane, in <module> UnicodeDecodeError: 'utf-viii' codec can't decode byte 0xbc in position 0: invalid offset byte Uh oh. There's that pesky UnicodeDecodeError that tin can seize with teeth you when you lot make assumptions about encoding. You check with the API host. Lo and behold, the data is actually sent over encoded in Latin-1:
>>>
>>> data . decode ( "latin-1" ) '¼ loving cup of flour' At that place nosotros go. In Latin-1, every character fits into a single byte, whereas the "¼" character takes up two bytes in UTF-8 ("\xc2\xbc").
The lesson here is that it tin can exist dangerous to assume the encoding of whatsoever information that is handed off to you. It's ordinarily UTF-8 these days, simply it'southward the modest per centum of cases where it's not that will blow things up.
If yous really do need to carelessness transport and estimate an encoding, then have a expect at the chardet library, which uses methodology from Mozilla to brand an educated gauge almost ambiguously encoded text. That said, a tool like chardet should be your concluding resort, not your get-go.
Odds and Ends: unicodedata
We would be remiss not to mention unicodedata from the Python Standard Library, which lets you collaborate with and do lookups on the Unicode Character Database (UCD):
>>>
>>> import unicodedata >>> unicodedata . name ( "€" ) 'EURO SIGN' >>> unicodedata . lookup ( "EURO SIGN" ) '€' Wrapping Upwardly
In this article, you've decoded the wide and imposing field of study of character encoding in Python.
You've covered a lot of ground here:
- Fundamental concepts of character encodings and numbering systems
- Integer, binary, octal, hex, str, and bytes literals in Python
- Python's built-in functions related to character encoding and numbering systems
- Python 3'due south treatment of text versus binary data
Now, get forth and encode!
Resources
For fifty-fifty more detail near the topics covered here, check out these resources:
- Joel Spolsky: The Absolute Minimum Every Software Developer Admittedly, Positively Must Know About Unicode and Character Sets (No Excuses!)
- David Zentgraf: What every programmer admittedly, positively needs to know nigh encodings and character sets to piece of work with text
- Mozilla: A blended approach to linguistic communication/encoding detection
- Wikipedia: UTF-eight
- John Skeet: Unicode and .NET
- Charles Petzold: Lawmaking: The Hidden Language of Estimator Hardware and Software
- Network Working Group, RFC 3629: UTF-8, a transformation format of ISO 10646
- Unicode Technical Standard #eighteen: Unicode Regular Expressions
The Python docs take two pages on the subject:
- What'southward New in Python 3.0
- Unicode HOWTO
Watch At present This tutorial has a related video course created by the Existent Python team. Watch it together with the written tutorial to deepen your understanding: Unicode in Python: Working With Character Encodings
Source: https://realpython.com/python-encodings-guide/
0 Response to "Block Hexa Hexa Art Level 58 Block Hexa Hexa Art Level 59"
Post a Comment