Hit enter to search or ESC to close

Accurately extracting text/image data from unpredictable format/layout documents (PDF, Word, Excel, Webpages, emails) which do not have an underlying technical structure XML or field identifiers, has always been a challenge for all conventional technologies, including other RPA (Robotic Process Automation) platforms. This means people must read each document and re-enter data increasing processing cost, time and errors.

Instaknow patented Artificial Intelligence process millions of complex documents to eliminate manual processing for Fortune 500 clients in Banking, Supply Chain, Healthcare, Utilities, Pharmaceuticals, Law, Insurance and Government. All required data is accurately extracted and converted to XML for conventional processing.

Using human-eyeball-like scanning of each document’s layout, Instaknow correctly decide which text is which header or label in that document, WITHOUT needing the underlying structure like XML or field identifiers or Machine Learning examples. Data can be laid out DIFFERENTLY in different documents. Instaknow can even accurately determine the checkboxes and radio buttons. If a human eyeball can find and isolate date of interest, Instaknow can do it too, regardless of variations. Documents do NOT need to be in specific technical formats. They can be text documents or image/scan documents, with one or multiple pages. Section within documents can appear in any order and columns in tables can also have an unpredictable sequence!

E.g. in the following scanned tax returns example, the top return has space for three Officers while the bottom return can have up to four Officers listed. Also, the column widths are very different. These documents came in as scanned images and have no underlying XML, technical ids or predictable string sequences which will allow conventional data processing like RPA (Robotic Process Automation). Only a person can detect the actual data layout and content, and has to manually re-enter it in another computer system or file for further processing. But manual processing of thousands of documents is expensive, slow and error-prone!

Insta-Intelligence_1

Instaknow can do the same processing automatically. Using the human-like Artificial Intelligence, Instaknow can be told to extract “’Officer Name and address’ from the ‘Information about officers’ section”.

That instruction allows it to do the following user-like steps:
  • Read each document. If the document is an image, it is converted to text using Optical Character Recognition (OCR).
  • Decide if document is relevant for this data extraction (i.e. is it a tax return)
  • Find the appropriate page for this data extraction. Required data may be on different pages in different tax returns. Within the page data of interest may be in different vertical and horizontal locations in different documents.
  • Find the section header “Information about Officers”. Alternates can be provided for Headers and Labels, to take care of different text meaning the same thing.
  • Within the proximity of the dynamically found header, look for label “Name and address”, look to left and right of the label to decide how far “visual scope” of the label extends (i.e. which data below the label is for that column). This use of “white space” to decide label scope requires artificial intelligence and is beyond capabilities of conventional technologies.
  • Decide what is the vertical scope of the data part in that section using white space gaps and font prominence (e.g. bold or bigger fonts are more likely for headers and labels than data)
  • After isolating the data rectangle like this, extract data and save as XML for further automated downstream processing. Tables/grids of data from document are correctly extracted as XML nodes, retaining the original data relationships.

Multiple sets of data can be extracted from the same document page together, separately or conditionally (e.g. “Extract Balance Sheet details from another page only if total revenue reported on the first page of the Tax Return is more than $100,000”.). All exceptions (e.g. expected pages missing from document) are routed to specified users for review.

Below is an example of accurate data extraction in spite of layout variations in HTML Web pages.

As can be seen, same fields are present in different physical locations or HTML hierarchies for different websites. A conventional HTML hierarchy based data extraction attempt (e.g. “Screen Scraping”) will fail, because the “Previous Close” numeric value is in different branches of the HTML on the two sites. Instaknow DYNAMICALLY decides where labels describing data of interest are, and accurately extracts the related data using human-vision AI, using a single instruction, regardless of unknown location or position variations.

Automation Excellence

HUMAN INTELLIGENCE AUTOMATION

Learn how Instaknow’s patented Human Intelligence Automation® with vision-like artificial intelligence is vastly more intelligent, flexible and can reliably handle infinite variability as compared to RPA bots.