ISO-8859 Encoding

ISO-8859 encoding is an extension to to the basic ASCII character set to accommodate non-English languages. It dates back to 1983 and was last updated in 1998. ASCII used the first 7 bits of the 8-bit octet (128 characters). The remaining bit was used to add extended, language specific characters. Usage of the extra bit has been recorded in 16 different variations such as Latin-1 for Western Europe, Latin-2 for Eastern Europe, Latin-Cyrillic and so on. Because the definitions were regional, each variation was unable to completely handle all the additional characters needed for a specific language so users were forced to adjust in how they used their language on computers. The Dutch, for example, have a letter which is a combination of I and J (“IJ”). Latin 1 did not have space to add this character so, in Holland, users adapted to writing “IJ” as two separate characters.

Background on the ISO-8859 Encoding

ISO-8859 encoding was a great solution if you operated in one region and could cover everything you needed to in 256 characters. With increasing globalization, and companies sending emails across borders and languages, a more comprehensive solution was needed.

iso-8859-1

The bottom half of the table contains the Latin-1 character set. There are 16 tables like this, each containing a different set of extended characters. The Arabic version, for example,  would include all the special Arabic characters. ISO 8859 definitely helped with international character support, but as more and more document and emails were being exchanged, text would look corrupted if the reader had a different character set on their PC.

In 1990, the first version of Unicode was defined using ISO-8859 encoding (Latin 1) as the first 256 Unicode code points. Now it was possible to contain all of the world’s characters in one table!

Extracting Text from PDF Documents

The PDF Conversion SDK and PDF Conversion Server are designed to extract text fro PDF files with full Unicode support, including the ISO 8859 encoding.

Resources

Wikipedia Stub on ISO-8859

The differences between ASCII, ISO 8859, and Unicode