Smart OCR (Optical Character Recognition) allows intelligent data search and extraction within a document without using fixed zones.
The search is done via rules which specify the format of a pattern to search in the document text (example VAT Nr.) and based on its position the format of its value starting from a specific position respect where the pattern is found. All of is with the help of the OCR engine and the information about the document structure, text location and words coordinates.
The module creates custom variables during processing. Check the Variables list for more details.
The left hand side menu shows the available settings section. Settings are displayed according the selected section.
General
Test document
Browse to or enter a document to use for testing the configured rules.
Rules
The table shows a list of all configured Smart Rules.
The view of the table is very easy and intuitive:
• Variable
The variable assigned to the rule which will contain the output result of the recognition.
• Position
The position of the value respect its pattern.
• Result
The result field populated when testing the rules with the Test button and the Test document.
• Id
The id associated with your rule. It is unique and it is the id of your rule inside the DB.
• Status
If the rule is enabled or not, a disabled rule will not be used. The status of a new rule is automatically set to true when a new rule is created. In the rules menu you can easily set this to false by clicking on the indicator. The rule will then be greyed out, the indicator will be gray and set to off and it will be skipped during Smart OCR processing.
By pressing the New button a new Rule can be created. The same dialog is displayed when editing existing Rules from the list.
Rule
Active
Enable or disable the current rule.
Variable
Enter the name of the variable which will contain the recognition result of this rule. If the variable is already existing in the current Workflow an error message will prevent the rule to be saved.
Position
Select the position where to look for the value respect the pattern. Available options are:
- Any
- Bottom
- Left
- Right
- Top
When using Any as position only the Regular Expression field is enabled while the Value is disabled. This is a special position which is looking for pattern and value all together in one unique regular expression.
Regular Expression
Enter the Regular Expression which will be used to search for the format of the pattern, or pattern and value if Any is used as position, or click on the Variables button on the right to select a variable which will contain the expression. The expression is matched against every text word until a match is found.
When using Any position the Regular Expression needs to be written in two groups, the first group containing the pattern match and the second group containing the value match, in the form of:
(GROUP1)(GROUP2)
If the two groups are not existing the results, with the Any position, are unexpected.
For more information on regular expressions please check the Regular Expressions Appendix.
Value
Enter the Regular Expression which will be used to search for the format of the value after its pattern with the previous Regular Expression has been found, or click on the Variables button on the right to select a variable which will contain the expression. The expression is matched against every text word appearing on the specific Position respects the matched pattern, until a match is found. This field is not used when using Any position.
Redact
If enabled it will automatically redact the matched value with a black, un-removable, box. The redacted text is exactly what matched with the Value expression or with the Regular Expression value group when using Any position.
Highlight
If enabled it will automatically highlight the matched value with a yellow, un-removable, semi transparent box. The highlighted text is exactly what matched with the Value expression or with the Regular Expression value group when using Any position.
All pages
If selected the recognition and Redact or Highlight, when enabled, will run automatically on all pages. When this option is enabled the current variable will always contain the last page result while an automatic variable for every page is created in the form of:
VARIABLENAME_PX
Where the current variable name, either automatic generated or customized, will be suffixed by _PX where X is the number of the page.
Only on page
Enter the number of the page where the current rule needs to be processed and redacted or highlighted.
Last page only
If selected the process will run only on the last page of the document, whatever number of pages the current document has got.
Engine
Engine
Select here the OCR engine to use to run the current module recognition. Available engines are, based on the current license:
- Default (Tesseract OCR)
Languages
Select the language to use during the OCR recognition process. Multiple languages can be selected by holding CTRL key while selecting the languages.
Please refer to the OCR Appendix chapter for the supported OCR languages.
Example
As an example assuming the current sample document:
We will add two rules:
- Invoice Number
- Invoice Total Amount
Invoice Number
We configure this rule with the following settings:
- %INVOICE_NUMBER%
- Right as Position
- (PO[ ]*#[ ]*) as Reg. Exp.
- [A-Z]{3}[0-9]+ as Value
- Redact enable
Invoice Total Amount
We configure this rule with the following settings:
- %TOTAL%
- Right as Position
- (TOTAL[ ]*\$*) as Reg. Exp.
- ([0-9]+.[0-9]{2}) as Value
- Redact disable
Using the Test button we can see the result column populated with the values extracted for the above sample document.