Default OCR encoding profile results in sub-par recognition


The default encoding profile used to extract images from presentation streams (for OCR) implements a box blur filter.

OCR applied to images subject to the box blur filter results in incoherent text. The resultant text is much better if the box blur filter is not applied.

There are a few pictures attached below, with OCR'd text.

Note that the tests are run on presentation streams of PPT/PDF documents. Assuming the presenter uses ppt's/pdf's where the foreground text and their resp. backgrounds contrast well, the box blur filter imo should not be applied.


duncan smith
April 8, 2018, 6:27 AM

Text without filter:

Web Versions of Open Textbooks

Portab!e: 64 GB is

  • 64 Encyc!opedia Britannica (text)

  • 1 English Wikipedia (text)

  • 10,000 4007page math textbooks (w,"

Ubiquitous: laptop, tab!et, or phone
Upeto—Date: correct, and refresh, at w!!!
Accurate: crowd~sourced proofereading
Open: never outeofeprint

InteHectually Honest:

no pressure to satisfy market segments


duncan smith
April 8, 2018, 6:26 AM

Text from boxblur filter:

Web Versions of Open Textbooks

Pmmh‘v' (14(IBls

. H11[V1<.x‘upr(1l.4[’XHKIIHHIJ\th
I l [H.‘flhlv \‘pr: (lm Hv -I)
. 11mm 4111‘ in, Ivmtlw fr 'Ithw‘Lw \

mug; \W

Ulilqtmum Lipiup [ANN m phunz
Up In [)‘Ilf' (mm-<1 Jud unvsh ‘41 \'.\H
N ( 4|

It Adm;
Opt-l1 nun mm m pun:

Int: HM HLIHx How \1

I10 pH-ssHlE' to \Illsh,’ Hulk-I stvglm-nis



duncan smith


duncan smith