Spell checkers are used to determine the correctness of the spelling of a word and notify users about misspellings. Though spell checkers are somewhat commonplace programs, their operation can be cryptic and far from intuitive to the uninitiated.
Many word processors, search engines and web browsers provide spell-checking features.
They can be deployed as standalone programs or included as part of a larger application, like a search engine.
Typically, once a wrongly spelled word is detected, spell checkers provide correction facilities by attempting to suggest one or more alternative words as the right spelling.
This article provides a simple but workable Clojure implementation with an appreciable degree of efficiency by using the Damerau–Levenshtein distance algorithm for spell correction. In the spirit of simplicity, this implementation limits itself to single-word spell-checking and correction. Consequently, cases like grammatical and context-sensitive spell checks are not considered.
How a Spell Checker Works
Most spell checkers work by maintaining a dictionary of valid words in a language. This reduces the process of determining the validity of a word to simply checking if the given word is part of the list of words in the dictionary.
Typographical errors can generally be categorized under four different types, as follows:
- Deletion: refers to the case where one or more characters are deleted from the correct word (e.g. using rnt for rant).
- Insertion: refers to the situation where one or more characters are added to the beginning, end or between two characters of a correct word (e.g. using beautifpul for beautiful).
- Substitution: This is were a character is replaced with another (e.g. using beleeve for believe).
- Transposition: This is the case where two adjacent letters are swapped (e.g. using giong for going).
There are quite a number of spell correction algorithms. These include rule-based techniques, edit distance, probabilistic techniques, neural networks, etc. In our case, we shall employ the edit distance approach—specifically, the Damerau–Levenshtein algorithm, which works by calculating the distance between two words. Damerau–Levenshtein distance (also known as the minimum edit distance) is the minimum number of editing operations (i.e. deletions, insertions,substitutions and transpositions) required to convert one string to another.
Getting Down to the Drudgery
Unix and Linux platforms provide a list of words located at /usr/share/dict/words.
Non-Unix and Linux users can use: https://raw.githubusercontent.com/Arkatgit/spell_check/master/src/spell_checker/dictionary.txt
We shall use any of these as our dictionary list of words (depending on your platform).
The first order of business is to open and read the complete file content into memory so we can easily and efficiently work with it. Let’s write a function that takes a file path, and reads and returns its content as a string.
(defn read-file [ path ] (slurp path) )
Well, that was easy. We just had to call the slurp function and hand the file over to it. Below is a snapshot of the file’s content on my Linux/Ubuntu computer.
As we can see, each line contains a single word. So the string returned by read-file will be a sequence of newline(\n) delimetered words. The idea is to convert this into one of Clojure’s collections that will be easy to work with. Let’s write a function that takes a string and converts it to a set of words. Our use of a set is meant to anticipate and remove duplicates.
(require '[clojure.string :as str ] ) (defn to-words [ wordstring ] (set (map #(str/trim %) (str/split-lines wordstring) ) ) )
This function calls split-lines (located in the clojure.string namespace), which splits on newlines(\n) and return feeds(‘\r’) and returns a vector of the word that was split.
Next, we map the trim function (using an anonymous function) to get rid of the white spaces surrounding the individual words.
Now let’s get our dictionary of words by putting the two functions to use.
(def words (-> "/usr/share/dict/words" (read-file) (to-words) ))
Smooth! All we had to do was employ the thread-first macro.
Checking for Correction
The next thing to do is to write a function that verifies whether or not a given word is contained in our dictionary.
(defn correct-word? [word ] (contains? words word) )
Pretty easy, right? correct-word? is a predicate function that returns true if the given word is a
dictionary word, and false if otherwise.
Offering a Correction
In a case where the correct-word? predicate function returns false, our program ought to be able to offer a suggestion.
Damerau–Levenshtein distance, as already stated, allows four editing operations. The Demerau-Levenstein distance between “Something” and “Somoething” is one which is obtained by deleting the letter “o.”
Our Damerau–Levenshtein distance uses a dynamic programming solution.
Below is a section of the code.
(defn damerau-levenshtein [fstr sstr] (let [fstr-len (count fstr) sstr-len (count sstr) ............... .............
The full implementation (entire repo) can be found here:
The idea is to find a word in our dictionary that has the smallest distance to the misspelled word the user typed in.
Let’s define a function to do that for us.
defn minimun-distance [ word ] ( apply min-key (partial damerau-levenshtein word) words ) )
Okay—This looks a bit complex, but it’s understandable. This is what the Clojure
documentation had to say about min-key.
clojure.core/min-key ([k x] [k x y] [k x y & more]) Returns the x for which (k x), a number, is least.
It returns x for which (k x) is the minimum for all k applied to the rest of the arguments. In this case, our k is Damerau–Levenshtein which unfortunately takes two arguments.
Clojure allows us to circumvent this by partially applying Damerau–Levenshtein to the given word, and before we finally apply it to all the dictionary words using the apply function. (Sweet!)
Tying It All Together
Finally, we can now implement the main function.
(defn -main [& args] (time ( let [ _ (println "Enter word possibly misspelled :" ) word (read-line) ] ( if ( correct-word? word) (println word "Is a correct word" ) (do (println word "Is wrong word ") (println "Did you mean -> " (minimun-distance word) ) ) ) )))
This basically reads a word, checks for correctness, and offers a suggestion if the word is misspelled. The full code can be found here:
Here is a sample run on my screen.
Enter word possibly misspelled : mngo mngo Is wrong word Did you mean -> mango "Elapsed time: 79084.58919 msecs"
Though our implementation doesn’t perform too badly, other alternatives to spell correction exist—some of which give better and more efficient results. With this gentle introduction to the world of spell checkers, you can delve deeper. Watch out for them doing their thing when typing.