Steps to reproduce:
1. Run text extraction on a mediapackage with slides
2. Some text is recognized and visible in engage, along with lots of garbage
Everything makes it into the text segment, even non-text.
Non-words should be left out.
As discussed, upgrading this to release blocker.
This fix version doesn't indicate if it should be considered a blocker on the 1.2 release which is scheduled to start QA shortly. If it does effect 1.2 please add this to the Fix Version/s field so we don't release without it.
Dropping to critical, this bug is significant and makes this feature questionable, but shouldn't stop a release at this time.
It looks like the problem is with our sample dictionaries. Here are some samples from the german one:
This basically mean that the dictionary is worthless right now and needs to be cleaned up in order to provide what we need.
We would need to make sure that the dictionary creation code throws away "words" that:
are less than two characters long