Smart OCR

Smart OCR (Optical Character Recognition) allows intelligent data search and extraction within a document without using fixed zones.

The search is done via rules which specify the format of a pattern to search in the document text (example VAT Nr.) and based on its position the format of its value starting from a specific position respect where the pattern is found. All of is with the help of the OCR engine and the information about the document structure, text location and words coordinates.

The module creates custom variables during processing. Check the Variables list for more details.

The left hand side menu shows the available settings section. Settings are displayed according the selected section.

General

processing_settings_smartocr1

Test document
Browse to or enter a document to use for testing the configured rules.

Info

Browse allows to upload the sample document to the server, this operation stores the document file in the local Scanshare data folder.

Warning

For security and privacy reasons the filename is always referring to the Scanshare data folder and not to a full absolute path. Do not insert absolute paths which will generate instead an error.

Rules

The table shows a list of all configured Smart Rules.

The view of the table is very easy and intuitive:

• Variable
The variable assigned to the rule which will contain the output result of the recognition.

• Position
The position of the value respect its pattern.

• Result
The result field populated when testing the rules with the Test button and the Test document.

• Id
The id associated with your rule. It is unique and it is the id of your rule inside the DB.

• Status
If the rule is enabled or not, a disabled rule will not be used. The status of a new rule is automatically set to true when a new rule is created. In the rules menu you can easily set this to false by clicking on the indicator. The rule will then be greyed out, the indicator will be gray and set to off and it will be skipped during Smart OCR processing.

By pressing the New button a new Rule can be created. The same dialog is displayed when editing existing Rules from the list.

Rule

Active
Enable or disable the current rule.

Variable
Enter the name of the variable which will contain the recognition result of this rule. If the variable is already existing in the current Workflow an error message will prevent the rule to be saved.

Position
Select the position where to look for the value respect the pattern. Available options are:

  • Any
  • Bottom
  • Left
  • Right
  • Top

When using Any as position only the Regular Expression field is enabled while the Value is disabled. This is a special position which is looking for pattern and value all together in one unique regular expression.

Regular Expression
Enter the Regular Expression which will be used to search for the format of the pattern, or pattern and value if Any is used as position, or click on the Variables button on the right to select a variable which will contain the expression. The expression is matched against every text word until a match is found.

When using Any position the Regular Expression needs to be written in two groups, the first group containing the pattern match and the second group containing the value match, in the form of:

(GROUP1)(GROUP2)

If the two groups are not existing the results, with the Any position, are unexpected.

For more information on regular expressions please check the Regular Expressions Appendix.

Value
Enter the Regular Expression which will be used to search for the format of the value after its pattern with the previous Regular Expression has been found, or click on the Variables button on the right to select a variable which will contain the expression. The expression is matched against every text word appearing on the specific Position respects the matched pattern, until a match is found. This field is not used when using Any position.

Redact
If enabled it will automatically redact the matched value with a black, un-removable, box. The redacted text is exactly what matched with the Value expression or with the Regular Expression value group when using Any position.

Highlight
If enabled it will automatically highlight the matched value with a yellow, un-removable, semi transparent box. The highlighted text is exactly what matched with the Value expression or with the Regular Expression value group when using Any position.

All pages
If selected the recognition and Redact or Highlight, when enabled, will run automatically on all pages. When this option is enabled the current variable will always contain the last page result while an automatic variable for every page is created in the form of:

VARIABLENAME_PX

Where the current variable name, either automatic generated or customized, will be suffixed by _PX where X is the number of the page.

Only on page
Enter the number of the page where the current rule needs to be processed and redacted or highlighted.

Last page only
If selected the process will run only on the last page of the document, whatever number of pages the current document has got.

Engine

processing_settings_smartocr5

Engine
Select here the OCR engine to use to run the current module recognition. Available engines are, based on the current license:

  • Nuance OmniPage
  • Abbyy FineReader

Languages
Select the language to use during the OCR recognition process. Multiple languages can be selected by holding CTRL key while selecting the languages.

Please refer to the OCR Appendix chapter for the supported OCR languages.

Abbyy

Enhance local contrast
If enabled engine will increase the local contrast of the image during the preprocessing of the image. Such option may increase the quality of recognition.

Info

The option is meaningful for color and gray images only.

The images for which this preprocessing method is effective include:

  • Photos or scans of documents with texture or pictures in the background. With the normal binarization procedure, the characters that coincide with darker areas of background may be lost or recognized unreliably. If you apply this method before recognition, such areas are detected, and contrast is increased, with the result that after binarization the characters stand out more distinctly.
  • Photos or scans of documents with highly colorful background or text highlighting.

Remove noise
If enabled engine will reduce the noise of the image. Available working options are:

  • White noise: this mode may be useful, for example, for uncompressed images with ISO less then 800, for reduced images.
  • Correlated noise: this mode may be useful, for example, for the JPEG photos with high compression settings

Example

As an example assuming the current sample document:

processing_settings_smartocr4

We will add two rules:

  • Invoice Number
  • Invoice Total Amount

Invoice Number
We configure this rule with the following settings:

  • %INVOICE_NUMBER%
  • Right as Position
  • (PO[ ]*#[ ]*) as Reg. Exp.
  • [A-Z]{3}[0-9]+ as Value
  • Redact enable

Invoice Total Amount
We configure this rule with the following settings:

  • %TOTAL%
  • Right as Position
  • (TOTAL[ ]*\$*) as Reg. Exp.
  • ([0-9]+.[0-9]{2}) as Value
  • Redact disable

Using the Test button we can see the result column populated with the values extracted for the above sample document.

processing_settings_smartocr6
Previous Article

Smart Form (Invoice)

Next Article

Zone OCR