Default OCR encoding profile results in sub-par recognition

Description

The default encoding profile used to extract images from presentation streams (for OCR) implements a box blur filter.

OCR applied to images subject to the box blur filter results in incoherent text. The resultant text is much better if the box blur filter is not applied.

There are a few pictures attached below, with OCR'd text.

Note that the tests are run on presentation streams of PPT/PDF documents. Assuming the presenter uses ppt's/pdf's where the foreground text and their resp. backgrounds contrast well, the box blur filter imo should not be applied.

Steps to reproduce

None

Status

Assignee

Duncan Smith

Reporter

Duncan Smith

Criticality

Low

Tags (folksonomy)

None

Components

Fix versions

Affects versions

4.2

Priority