Unicode is today’s standard for defining characters on computing platforms. It was developed to transcend the limits of traditional encodings and contains 120,000 characters, enough to completely cover the world’s languages, both current and historic. It also contains many symbols and special characters sets. Individual characters are defined as code points rather than glyphs making the standard both robust and flexible. The current version of Unicode is v8 and it is predominately implemented using the UTF-8 encoding method, the first 128 characters of which are simple ASCII.
All of the visual effects added by today’s word processing programs such as typefaces, font sizes, colors, line and paragraph spacing, tables and graphics are unavailable in text formats. Characters are equally spaced their positioning is the only formatting that’s available. Text is pulled out of databases, spreadsheets and other repositories. Often, there is a need to extract information and feed it into a mark-up or formatting system to eventually present it to end-users in an presentable format. Text is also used to generate keywords, abstracts and excerpts for HTML-based systems, content management systems and search/indexing applications.
Frequently Asked Questions
When using one of the text extraction tools – either via command-line or API, you can choose to:
- completely strip the text of white space, non-printing characters, etc
- extract text while preserving the placement of all characters on a page
- generate excepts or abstracts
It depends on what you mean by “formatted”. With ASCII text, “formatted” means that the characters are in certain positions on a page. A few examples would be:
- when text is printed on a check, the text must be in specific areas for the check to print accurately
- when spreadsheets are saved as text, it’s important to see what’s in each column.
- if reports are converted to ASCII, the data should be in the correct tables
- if a form is converted to text, the descriptions must align with corresponding fields for data
With the text extraction tools from Visual Integrity, you can count on precision placement of each character. Since the format does not support attributes such as bold, underline or italic, these will all be sacrificed in the conversion process.