ISO-8859 was proposed as an extension to to the basic ASCII character set to accommodate non-English languages. ASCII used the first 7 bits of the 8-bit octet. The remaining bit was used to add extended, language specific characters. Usage of the extra bit has been recorded in 16 different variations such as Latin-1 for Western Europe, Latin-2 for Eastern Europe, Latin-Cyrillic and so on. Because the definitions were regional, each variation was unable to completely handle all the additional characters needed for a specific language so users were forced to adjust in how they used their language on computers. The Dutch, for example, have a letter which is a combination of I and J (“IJ”). Latin 1 did not have space to add this character so, in Holland, users adapted to writing “IJ” as two separate characters.
All of the visual effects added by today’s word processing programs such as typefaces, font sizes, colors, line and paragraph spacing, tables and graphics are unavailable in text formats. Characters are equally spaced their positioning is the only formatting that’s available. Text is pulled out of databases, spreadsheets and other repositories. Often, there is a need to extract information and feed it into a mark-up or formatting system to eventually present it to end-users in an presentable format. Text is also used to generate keywords, abstracts and excerpts for HTML-based systems, content management systems and search/indexing applications.
Frequently Asked Questions
When using one of the text extraction tools – either via command-line or API, you can choose to:
- completely strip the text of white space, non-printing characters, etc
- extract text while preserving the placement of all characters on a page
- generate excepts or abstracts
It depends on what you mean by “formatted”. With ASCII text, “formatted” means that the characters are in certain positions on a page. A few examples would be:
- when text is printed on a check, the text must be in specific areas for the check to print accurately
- when spreadsheets are saved as text, it’s important to see what’s in each column.
- if reports are converted to ASCII, the data should be in the correct tables
- if a form is converted to text, the descriptions must align with corresponding fields for data
With the text extraction tools from Visual Integrity, you can count on precision placement of each character. Since the format does not support attributes such as bold, underline or italic, these will all be sacrificed in the conversion process.