Default OCR encoding profile results in sub-par recognition

Description

The default encoding profile used to extract images from presentation streams (for OCR) implements a box blur filter.

OCR applied to images subject to the box blur filter results in incoherent text. The resultant text is much better if the box blur filter is not applied.

There are a few pictures attached below, with OCR'd text.

Note that the tests are run on presentation streams of PPT/PDF documents. Assuming the presenter uses ppt's/pdf's where the foreground text and their resp. backgrounds contrast well, the box blur filter imo should not be applied.

Activity

Show:
duncan smith
April 8, 2018, 6:27 AM

Text without filter:

Web Versions of Open Textbooks

Portab!e: 64 GB is

  • 64 Encyc!opedia Britannica (text)

  • 1 English Wikipedia (text)

  • 10,000 4007page math textbooks (w,"
    images)

Ubiquitous: laptop, tab!et, or phone
Upeto—Date: correct, and refresh, at w!!!
Accurate: crowd~sourced proofereading
Open: never outeofeprint

InteHectually Honest:

no pressure to satisfy market segments

FREE!!!!!

duncan smith
April 8, 2018, 6:26 AM
Edited

Text from boxblur filter:

Web Versions of Open Textbooks

Pmmh‘v' (14(IBls

. H11[V1<.x‘upr(1l.4[’XHKIIHHIJ\th
I l [H.‘flhlv \‘pr: (lm Hv -I)
. 11mm 4111‘ in, Ivmtlw fr 'Ithw‘Lw \

mug; \W

Ulilqtmum Lipiup [ANN m phunz
Up In [)‘Ilf' (mm-<1 Jud unvsh ‘41 \'.\H
N ( 4|

It Adm;
Opt-l1 nun mm m pun:

Int: HM HLIHx How \1

I10 pH-ssHlE' to \Illsh,’ Hulk-I stvglm-nis

FREE”!!!

Assignee

duncan smith

Reporter

duncan smith

Criticality

Low