With the vast ecosystem of tools and plugins around Visual Studio, the existence of a C# spell check shouldn’t come as a surprise. But maybe it does. It’s entirely possible that this sort of spell checking never occurred to you as something worth doing. Or maybe it just never occurred to you at all.
Personally, I think spell checking your code is most definitely worth doing. I won’t belabor the point here, since I’ve made this case in the past. Suffice it to say that since you can have it so effortlessly, you might as well get your spelling right.
Today, I’d like to talk instead about the problem of a C# spell checker (or a general purpose spell checker to use inside an IDE). In order to make things easy on you, the spell checker has to do some pretty sophisticated stuff and wrestle with subtle problems.
Spell check in a word processor is relatively easy from an implementation perspective. You assume the person is typing natural language and check all words against a dictionary. But in code? Not so much. Let’s take a look at some of the reasons for that.
C# Spell Check: the Basic Challenge
Written English (or any natural language) has very specific rules around spelling and grammar. Spaces demarcate words, and periods do the same for sentences. If you spell words incorrectly or have sentence fragments, you have nonsense, as far as the language is concerned.
This is, of course, also true of source code. Spaces separate tokens, and semicolons (or parentheses) demarcate statements. But whereas precision around English syntax and semantics optimizes for human understanding, these properties in code optimize for compiler understanding. You have to spell and case reserved keywords correctly so that the compiler can identify them. But non-reserved language tokens? The compiler has no opinion or preference.
This creates a rule vacuum of sorts. As I type this sentence, English rules dictate that I capitalize the word “English.” If I wrote a method named GetEnglishEquivalent, the compiler would care not a whit whether I capitalized any or all of those letters. We programmers are left to figure this out by convention.
So the core challenge of C# spell check is to join these two worlds and navigate the subtleties of rules versus conventions.
The Broader Context Challenge
That means a spell checker plugin to Visual Studio has to understand and switch between many different potential contexts. As an example of the complexity this adds, think of a string literal. You have to understand the rules of the particular language in question, which might mean looking for a single or a double quotation.
Or think of the interesting challenge posed by comments. Depending on the language, you start comments with different syntax. The checker has to be smart enough to understand all of those different syntaxes. It must then also recognize that comments should be treated as simple written English, rather than token names that cannot contain spaces.
Managing Different Dictionaries
Given the conventional nature I’ve just mentioned, dictionary management becomes a more difficult task for a C# spell check tool. It contains, of course, an English dictionary the way that any spell check tool contains a stock dictionary. But it must also account for the fact that users will take many more liberties with the built-in dictionary.
With word processing, you add words to the dictionary at your own peril. I say this because you’re going directly against the English language. So if you get tired of the red squiggly under “asdf” and you decide to add it to the dictionary, you’ve (wrongly) told the spell checker that “asdf” is a valid English word. But in the world of code, that’s much murkier. While it may not be an English word, “asdf” makes a perfectly valid variable name, as far as the compiler is concerned.
Now, imagine inheriting a legacy codebase and turning on spell check. If you have variables named “asdf” everywhere, you might want to ignore that as a spelling mistake and add it to the dictionary. Heck, it might even have taken on some kind of coded significance in your codebase.
The point is that the dictionaries for C# spell check will prove much noisier than their word processing counterpoints. And users will also likely want the ability to have user-level and solution-level preferences. So the C# spell checker has to maintain an awful lot more intelligence around managing its dictionary (dictionaries).
Token Parsing: Casing Paradigms and Abbreviations
I’ve alluded to this in the general sense already, but let’s look specifically at the nuts-and-bolts challenges for a spell checker when it comes to code. You might not think of these things right off the top.
First of all, you have comments. Once you’ve identified something as a comment, you can more or less approach it as plain English, right? Or can you? Think of XML method header comments. Those are going to have method and parameter names in them that require special treatment. And what about string literals? If they’re something like a human readable error message, then you can spell check them normally. But what about email addresses or URLs? What if you’re using reflection?
Of course, then there’s the challenge with the tokens themselves. With both Pascal and camel casing, what you really have is a different separation paradigm than the conventional space in English. So the spell check tool has to recognize these casing schemes and parse the tokens into separate words to spell check. This is on top of also using the conventional space character and then other spacing schemes, such as dashes or underscores.
And what about abbreviations? Do you really want a spelling error flagged every time you use a loop counter called “i” or “j”? The tool has to recognize that developers tend to use one- or two-word variables, particularly in tight scopes.
All of this complexity is also going to come at a cost. In your browser or in Word, you get used to seeing red squiggles appear immediately upon mistyping a word. That’s no small feat when you’re checking against a dictionary containing all words in the English language.
So think of taking that already impressive situation and adding all of the aforementioned complexity: context detection, managing multiple dictionaries, parsing tokens, and applying tons of heuristics. Now, take that and add to it the complexity of Visual Studio as an environment and writing code as an activity. At any given time, you have tons of plugins running as you manage many files simultaneously and perform operations like refactoring.
It’s a lot to ask of the C# spell check tool. So such a tool has to work cleverly in the background, doing the heavy lifting while you pause, and throttling back when you’re typing quickly or performing refactorings.
An Interesting Problem to Solve
I don’t necessarily have any grand overarching wisdom to impart here. I just thought it would be interesting to walk through the challenges one faces in solving this problem. If you’re anything like me, it’s fun to contemplate such things and to get your brain working on how you would go about tackling them, should you need to.
A spell checker seems so conceptually simple. And when you see that you can spell check your code with a manageable false positive rate, you probably think, “Oh, that’s handy.” But if you really stop and think about it, it’s also pretty impressive.
Learn more about GhostDoc's truly source code spell checker and eliminate embarrassing typos in your apps and documentation before you ship them.