Implementing a Spell Checker in Clojure

142 VIEWS

Spell checkers are used to determine the correctness of the spelling of a word and notify users about misspellings. Though spell checkers are somewhat commonplace programs, their operation can be cryptic and far from intuitive to the uninitiated.

Introduction

Many word processors, search engines and web browsers provide spell-checking features.

They can be deployed as standalone programs or included as part of a larger application, like a search engine.

Typically, once a wrongly spelled word is detected, spell checkers provide correction facilities by attempting to suggest one or more alternative words as the right spelling.

This article provides a simple but workable Clojure implementation with an appreciable degree of efficiency by using the Damerau–Levenshtein distance algorithm for spell correction. In the spirit of simplicity, this implementation limits itself to single-word spell-checking and correction. Consequently, cases like grammatical and context-sensitive spell checks are not considered.

How a Spell Checker Works

Most spell checkers work by maintaining a dictionary of valid words in a language. This reduces the process of determining the validity of a word to simply checking if the given word is part of the list of words in the dictionary.

Typographical errors can generally be categorized under four different types, as follows:

  1. Deletion: refers to the case where one or more characters are deleted from the correct word (e.g. using rnt for rant).
  2. Insertion: refers to the situation where one or more characters are added to the beginning, end or between two characters of a correct word (e.g. using beautifpul for beautiful).
  3. Substitution: This is were a character is replaced with another (e.g. using beleeve for believe).
  4. Transposition: This is the case where two adjacent letters are swapped (e.g. using giong for going).

There are quite a number of spell correction algorithms. These include rule-based techniques, edit distance, probabilistic techniques, neural networks, etc. In our case, we shall employ the edit distance approach—specifically, the Damerau–Levenshtein algorithm, which works by calculating the distance between two words. Damerau–Levenshtein distance (also known as the minimum edit distance) is the minimum number of editing operations (i.e. deletions, insertions,substitutions and transpositions) required to convert one string to another.

Getting Down to the Drudgery

Unix and Linux platforms provide a list of words located at /usr/share/dict/words.
Non-Unix and Linux users can use: https://raw.githubusercontent.com/Arkatgit/spell_check/master/src/spell_checker/dictionary.txt

We shall use any of these as our dictionary list of words (depending on your platform).
The first order of business is to open and read the complete file content into memory so we can easily and efficiently work with it. Let’s write a function that takes a file path, and reads and returns its content as a string.

Well, that was easy. We just had to call the slurp function and hand the file over to it. Below is a snapshot of the file’s content on my Linux/Ubuntu computer.

As we can see, each line contains a single word. So the string returned by read-file will be a sequence of newline(\n) delimetered words. The idea is to convert this into one of Clojure’s collections that will be easy to work with. Let’s write a function that takes a string and converts it to a set of words. Our use of a set is meant to anticipate and remove duplicates.

This function calls split-lines (located in the clojure.string namespace), which splits on newlines(\n) and return feeds(‘\r’) and returns a vector of the word that was split.

Next, we map the trim function (using an anonymous function) to get rid of the white spaces surrounding the individual words.

Now let’s get our dictionary of words by putting the two functions to use.

Smooth! All we had to do was employ the thread-first macro.

Checking for Correction

The next thing to do is to write a function that verifies whether or not a given word is contained in our dictionary.

Pretty easy, right? correct-word? is a predicate function that returns true if the given word is a
dictionary word, and false if otherwise.

Offering a Correction

In a case where the correct-word? predicate function returns false, our program ought to be able to offer a suggestion.

Damerau–Levenshtein distance, as already stated, allows four editing operations. The Demerau-Levenstein distance between “Something” and “Somoething” is one which is obtained by deleting the letter “o.”

Our Damerau–Levenshtein distance uses a dynamic programming solution.
Below is a section of the code.

The full implementation (entire repo) can be found here:

https://github.com/Arkatgit/spell_check/blob/master/src/spell_checker/core.clj

The idea is to find a word in our dictionary that has the smallest distance to the misspelled word the user typed in.

Let’s define a function to do that for us.

Okay—This looks a bit complex, but it’s understandable. This is what the Clojure
documentation had to say about min-key.

It returns x for which (k x) is the minimum for all k applied to the rest of the arguments. In this case, our k is Damerau–Levenshtein which unfortunately takes two arguments.
Clojure allows us to circumvent this by partially applying Damerau–Levenshtein to the given word, and before we finally apply it to all the dictionary words using the apply function. (Sweet!)

Tying It All Together

Finally, we can now implement the main function.

This basically reads a word, checks for correctness, and offers a suggestion if the word is misspelled. The full code can be found here:

https://github.com/Arkatgit/spell_check/blob/master/src/spell_checker/core.clj

Here is a sample run on my screen.

Conclusion

Though our implementation doesn’t perform too badly, other alternatives to spell correction exist—some of which give better and more efficient results. With this gentle introduction to the world of spell checkers, you can delve deeper. Watch out for them doing their thing when typing.

Do you think you can beat this Sweet post?

If so, you may have what it takes to become a Sweetcode contributor... Learn More.

Robert Antwi is currently with Data and Management Systems Limited (DMSL) Ghana where he works with a team of data scientists to provide creative and innovative learning-based solutions.


Discussion

Click on a tab to select how you'd like to leave your comment

Leave a Comment

Your email address will not be published. Required fields are marked *

Menu