The dictionary in Matterhorn stores words and stopwords extracted from multiple wikipedia language exports in a relational database table. This is a fairly heavyweight solution. Given enough memory and a more selective list of words (wikipedia contains a lot of oddities and typos), these dictionaries can be stored in memory.
Implement a simple in-memory dictionary implementation containing only valid words, obtained from more reliable sources. Stop words are a feature that is not (yet) used in the system, so do not need to be included.