Text extraction does not respect dictionary

Steps to reproduce

Steps to reproduce:
1. Run text extraction on a mediapackage with slides
2. Some text is recognized and visible in engage, along with lots of garbage

Actual Results:
Everything makes it into the text segment, even non-text.

Expected Results:
Non-words should be left out.

Activity

Show:
A
May 11, 2011, 12:05 AM
Edited

As discussed, upgrading this to release blocker.

Christopher Brooks
May 11, 2011, 3:46 AM

This fix version doesn't indicate if it should be considered a blocker on the 1.2 release which is scheduled to start QA shortly. If it does effect 1.2 please add this to the Fix Version/s field so we don't release without it.

Christopher Brooks
May 12, 2011, 8:29 PM

Dropping to critical, this bug is significant and makes this feature questionable, but shouldn't stop a release at this time.

Tobias Wunden
May 13, 2011, 2:48 PM

It looks like the problem is with our sample dictionaries. Here are some samples from the german one:

„THE
§§




¬
¬A
¬F
¬X
−2
−4



1870–1922
A

This basically mean that the dictionary is worthless right now and needs to be cleaned up in order to provide what we need.

Tobias Wunden
May 13, 2011, 3:05 PM

We would need to make sure that the dictionary creation code throws away "words" that:

  • are less than two characters long

  • contain non-word-characters

Assignee

Tobias Wunden

Reporter

Tobias Wunden

Severity

Data Loss/Corruption

Tags (folksonomy)

None

Components

Fix versions

Affects versions

Priority

Critical
Configure