UTF-8 is a variable length encoding format where the first 128 characters (1st octet) are the original ASCII character set – bare bones text, numbers and simple punctuation without an support for foreign language or special characters. All characters in the global Unicode character set can be encoded using one to four 8-bit bytes (octet). UTF-8 is the dominant character encoding used on the Web, in email and with XML/HTML.
All of the visual effects added by today’s word processing programs such as typefaces, font sizes, colors, line and paragraph spacing, tables and graphics are unavailable in this simple yet important format. Characters are equally spaced their positioning is the only formatting that’s available. Text is pulled out of databases, spreadsheets and other repositories. Often, there is a need to extract information and feed it into a mark-up or formatting system to eventually present it to end-users in an presentable format. Text is also used to generate keywords, abstracts and excerpts for HTML-based systems, content management systems and search/indexing applications.
Frequently Asked Questions
When using one of the text extraction tools – either via command-line or API, you can choose to:
- completely strip the text of white space, non-printing characters, etc
- extract text while preserving the placement of all characters on a page
- generate excepts or abstracts
It depends on what you mean by “formatted”. With ASCII text, “formatted” means that the characters are in certain positions on a page. A few examples would be:
- when text is printed on a check, the text must be in specific areas for the check to print accurately
- when spreadsheets are saved as text, it’s important to see what’s in each column.
- if reports are converted to ASCII, the data should be in the correct tables
- if a form is converted to text, the descriptions must align with corresponding fields for data
With the text extraction tools from Visual Integrity, you can count on precision placement of each character. Since the format does not support attributes such as bold, underline or italic, these will all be sacrificed in the conversion process.