This document will help you understand Matterhorn's text analysis capabilities.
Text analysis is used to analyze the contents of a still image extracted from a slide presentation or a movie, which in the context of lecture capture, will be a video segmentation preview.
Being able to automatically extract metadata (text) from slide previews that were previously extracted from a movie using video segmentation allows for in-video search and thus for providing links directly into the movie rather than just to the beginning of it, requiring the learner to find the interesting spot himself.
The actual task of doing text analysis on an image comes down to running optical character recognition on an image, so matterhorn makes use of tesseract, which provides language support and backend functionality.
Depending on the installation of ocropus, various image formats can be used. Please refer to the documentation that is available on tesseract to learn what is supported. Matterhorn relies on the jpeg format, using the composer service to create still images from the original movies.
Resizing images will usually introduce noise into the image and most often result in slightly blurred text. Although this is not preventing text analysis to work properly, it is at least making it harder. Therefore Matterhorn runs text analysis on images that are of the same resolution as the original movie.
Tesseract is run on every image that is provided by the video segmenter, after which we end up with a certain number of words. The quality of these words heavily depends on the font being used, quality of the image, background vs. text color and some more factors.
In order to overcome failures in this process, the resulting words are matched against a huge dictionary (based on Wikipedia). Like this, stop words as well as invalid character combinations can be detected, which results in an overall increase of the quality of the analysis result.
Resulting from the software that Matterhorn's text analysis service is using, please be aware of the following limitations:
- Tesseract is known to work best on images that feature dark text on a light background. Right now, there is no algorithm in place that looks at the image's histogram in order to invert it, should there be more black than white.
- Text analysis will not work well on slides that use uncommon fonts, hand written notes etc.
- Proper language support is needed in order for tesseract to do it's magic. Make sure that you have language packs installed for the languages that you are dealing with.