Search

Document Processing

Last Updated: Nov 20, 2020

Articles

Document Processing is the conversion of paper-based and electronic documents into digital information using the combination of Intelligent Character Recognition (ICR), Optical Character Recognition (OCR) and manual interventions whenever needed.

In Jiffy, document processing is achieved in four phases:

Create a Doc Table with required columns for the fields being extracted from the document. Refer to the Create Doc Table to know more.
Design the task using the Doc Reader node to extract the fields from the document. Refer to Doc Reader to know more.
Execute the task to extract the fields.
Train the Document Processing engine - Familiarize the document, Save and Approve the fields extracted

Design the task

Following are steps the to extract data from Document using the Doc Reader node:

Drag the Doc Reader node from the Document processing category in Nodes pane to the Design canvas and connect to the Start and End nodes.
Refer to Create a task, Task Design Canvas and Doc Reader Node to know more.
Double-click on the Doc Reader node. The Attributes Pane opens.

Setting the Configurations of Doc Reader

Click on the New Configuration radio button.
Provide a Configuration Name and then click on the Create button.
The Configuration gets created and listed under Existing Configuration.

Setting the Properties of Doc Reader

Click on the Properties tab. Set the values for the following properties:
1. Name: Provide the name e.g. Invoice Reader
2. Description: Provide a description e.g. To process the Invoice received from Travel Agents.
3. Document Table: All the doc tables in the App will be listed here. Select the Doc Table created earlier where you wish to save the extracted values. Refer to the Create Doc Table to know more.
4. Continue on Failure: ON/OFF.
  When the Continue on Failure field is ON, even if the node fails, the execution will continue to the next node.
  When the Continue on Failure field is OFF, if the node fails, the execution will not continue to the next node.
5. Post-Processing Task: User has an option to select a task that executed post the current working task execution.
  Validations after extracting the fields from document are performed in the post-processing task. UUID of the record being processed is passed to the task. For e.g. validate if the Product Number extracted is present in the Product master table. If product number is present, update status as ‘Validation Successful’. If product number is not present in Master table, update status as ‘Validation Failed’. In the drop-down field, all the tasks created for that App will be listed from which the user can select the required task.
  
  The task selected as post processing task cannot
  a. have a doc processing node
  b. have a task node
  c. be the current working task.
6. Mark run Failure on Node Fail: ON/OFF.
  When the Mark run Failure on Node Fail field is ON, if the node execution fails then the task execution is marked as fail.
  When the Mark run Failure on Node Fail field is OFF, even if the node execution fails, the task execution is marked as pass.

Mapping the Inputs to Doc Reader

Map all the required the inputs to the Doc Reader node like path of document to be processed, password of the document if it is password protected, category if you want to define the category

Double click on the mapping line between Start and Doc Reader node.
In the Edit Mapping Data window, select the tag location from the RHS panel (1).
Click on the Element Map button (2). The mapping gets listed in the bottom with type of mapping as Constant. (3).
Under the What to get? field (4), type the path of the document. Example: C:\Documents\Travel_Invoice.pdf
If the document is password protected, Select the tag password from the RHS panel and click on the Element Map button. Under the What to get? field, type the password.
If instead of constants, variables need to be mapped, Select the preceding node in LHS panel and the required tags under them to map to location/password/category as needed

Categorizing of the Document

The category of the document can be identified in two ways:
1. Manually feed the category to the task: Select the tag category from the RHS panel and click on the Element Map button. Under the What to get? field, type the category for the document e.g. Invoice.
2. Machine Learning picks the category for the document: With this option ML picks a unique category for the document of new template. The Category need not to be mapped in the task.
  Click on Delete icon beside category mapping.
  Click on Close button to close the mapping screen.
  
  When a similar doc is processed again, ML assigns the same category to the document it had assigned earlier. So, all docs of similar template are assigned the same category.
  If the template of the document being processed is a new one, it will assign a unique category.
Click on the Close icon to close the Edit Mapping Data window.
Click on the Save icon.

Execute the task to extract the fields

When the task executes, the fields are extracted and populated in the Doc table that was selected in the Properties of Doc Reader. Refer to the Create Doc Table to know more.

Click on the Trial run icon to execute the task.
After the task is executed successfully,
Click on the DOC Reader node
Click on the ≡ icon to display the Result of execution window.
1. In the ResultText tab: The Document is displayed.
2. In the Output tab: The table path and document UUID is displayed along with the document processed in xml format.
3. The Input tab shows the input variables that have been mapped to the node
4. Navigate to Dataset.
5. Open the Doc Table which you have created. Refer to the Doc Table to know more.
6. The default columns will be auto populated for the processed Document.
  1. PDF: This field is populated with the filename of the document that has been processed.
  2. Category: Category helps in identifying the document type.
    Whenever a document of a new template is processed through a doc reader node, it is identified with a unique value by the ML engine.
    Category can also be manually fed to the Bot.
  3. Status : This field can have values New Type, Extracted Successfully, Manual Intervention and Technical error.
    - New Type: Whenever a document of a new template is being processed for the first time then status will be New Type.
    - Extracted Successfully: When the document category is predicted and field values are captured, the Extracted Successfully status is assigned.
    - Manual Intervention: If any of the fields to be extracted from the document is not recognised by the ML engine, then status will be Manual Intervention.
    - Technical error: When an error is encountered in the Jiffy server during extraction, then status will be Technical error.

Familiarize the Document

After the fields are extracted, you need to familiarize the document, if any corrections needed in the fields extracted, or if any fields are not extracted.

Click on filename under Pdf, you will be navigated to the window Jiffy PDF to familiarize the document.
The PDF will be displayed on the left side. Fields tab and the columns that needs to be extracted will be listed on the right.
Select the identifier for the column e.g. “Invoice number” by clicking on the keyword invoice number in the PDF displayed.
Then select the value for the identifier e.g. “Invoice number” by clicking on the value.
The fields familiarized in the PDF will be displayed in the right panel.
Note: The columns and values that the ML engine has identified will be auto populated.
Verify the values and change them if they are incorrect by clicking on the actual values in the PDF on the left side.
Familiarise the columns and values e.g. Invoice date, Total amount and PO number. The right panel will display all of the familiarized fields.

Familiarize the Inline table

To familiarise the inline table, click on the required column headers. E.g: Flight number, PNR number and Amount of the inline table.
Click on the column headers of the columns in the table that you want to extract. E.g: Flight number, PNR number and Amount of the inline table.
The familiarized column names are populated in the Inline table as column headings in R.H.S. panel
Click on the Settings icon and select the End of the table. This will train the bot to understand how to locate the end of the table.
Click on any text below the table that identifies the end of the table.
Click on the Get Data. All the table data will be extracted from the PDF and populated in the table in the right panel.
Click on the SAVE AND APPROVE button to save the changes and approve the category of the PDF processed. The PDF gets approved successfully.
Click on the back-arrow button and Navigate back to the Doc table.
The status of the document is updated to Extracted Successfully and the extracted values from document will be populated in the Doc table e.g. Invoice No, Invoice date, Total amount and PO number
To view the inline table data, click on the View button.