The UTF-8 standard is a variable length encoding format where the first 128 characters (1st octet) are the original ASCII character set – bare bones text, numbers and simple punctuation without any support for foreign language or special characters. All characters in the global Unicode set can be encoded using one to four 8-bit bytes (octet). UTF-8 is the dominant character encoding used on the Web, in email and with XML/HTML.
UTF-8 (Unicode Transformation–8-bit) is documented in ISO 10646-2017. It can represent up to 2,097,152 code points (2^21), more than enough to cover the current 1,112,064 Unicode code points.
The UTF-8 standard ensures that there are no conflicts between how characters display in different applications and geographic locations. Without this standardization across systems, text can look corrupt. This is more than cosmetic. Meaning is lost for the reader but the copy is also inaccessible to search and indexing applications.
UTF-8 is a powerful encoding system. It starts with the standard ASCII codes (#0-127). Code #128-191 are flexible characters. They can be “shifted” using the rest of the table. For example, characters 208 and 209 shift you into the Cyrillic range. 208 followed by 175 is character 1071, the Cyrillic Я.
Unicode is the great equalizer when processing data across languages and locales. Text is pulled out of databases, spreadsheets and other repositories. Often, there is a need to extract information and feed it into a mark-up or formatting system to eventually present it to end-users in a presentable format. Text is also used to generate keywords, abstracts and excerpts for HTML-based systems, content management systems and search/indexing applications. Unicode ensures that this text is always displays correctly and represents information as intended.
The PDF Conversion SDK and PDF Conversion Server are designed to extract text fro PDF files with full Unicode support, including the UTF-8 encoding.
Learn More
Unicode, UTF8 & Character Sets: The Ultimate Guide