Unicode

Unicode is the standard for defining computer characters. It removes the limitations and conflicts of traditional encodings. With 137,929 characters, it has enough capacity to completely cover the world’s current and historic languages. It also contains symbols and special characters like emojis.

Implementations include UTF-8, UTF-16, and UTF-32. Individual characters are defined as code points instead of glyphs. This is what makes the standard robust and flexible.

UTF-8 is a variable-length encoding format where the first 128 characters (1st octet) are the original ASCII character set – bare-bones text, numbers, and simple punctuation without any support for foreign language or special characters. All characters in the global Unicode character set are encoded using one to four 8-bit bytes (octet). UTF-8 is the dominant character encoding used on the Web, in email, and with XML/HTML.

Unicode characters are at the heart of everything you read. Visual effects like typeface, font size and color embellish the characters. Line and paragraph spacing, tables and graphics make reading easier. Without these added features, text looks like simple typewriter characters.

Databases and spreadsheets output text. This data passes through systems where it’s formatted as reports and statements. In addition, text is extracted form databases to generate keywords, abstracts and excerpts. The Unicode format ensures that there are no conflicts in these operations. For example, Unicode ensures that content management systems are free of conflicts between overlapping language character sets.

The PDF Conversion SDK and PDF Conversion Server are designed to extract text fro PDF files with full Unicode support.

Resources

More on Unicode

Visit the Unicode Organization

Technical Page

Fun Fact – Emojis

Adopt an emoji through the Unicode organization.

Everyone uses emojis. The Unicode Consortium approves and manages these popular images. They represent things like faces, weather, emotions, animals and languages.  They also express love, thanks and congratulations. In fact, more emojis are added all the time.

Here’s the Fun Fact – The Unicode Consortium solicits proposals from the public for new emojis. Suggest one to add to the standard, or adopt a character and help them in their mission.