The dictionary in Matterhorn stores words and stopwords extracted from multiple wikipedia language exports in a relational database table. This is a fairly heavyweight solution. Given enough memory and a more selective list of words (wikipedia contains a lot of oddities and typos), these dictionaries can be stored in memory.
Implement a simple in-memory dictionary implementation containing only valid words, obtained from more reliable sources. Stop words are a feature that is not (yet) used in the system, so do not need to be included.
Attached a patch that replaces the JPA dictionary implementation with an English-only in-memory implementation. The word list was generated from a combination of:
The ispell dictionary extract available from http://wordlist.sourceforge.net
The male, female, and last name list from http://www.census.gov/genealogy/names/names_files.html
Per http://opencast.3480289.n2.nabble.com/JIRA-Ticket-Cleanup-proposal-td7475080.html, this has been bulk resolved as won't fix. If this is still important to you please reopen and we can triage as appropriate.