Norvig spelling corrector4/11/2023 To compress our n-gram model let’s use an approach described in Efficient Minimal Perfect Hash Language Models paper. For example, for following text of 5 words: “ a b c a b” we store following frequencies:Ī => 2 b => 2 c => 1 a b => 2 b c => 1 c a => 1 a b c => 1 b c a => 1 c a b => 1Īnother reason - high memory overhead of the hash table data structure (hash table is used inside python dict or c++ unordered_map). One reason of such a high memory usage is that we don’t store a plain text, instead we store frequencies. Half of that size is used by language model, and another half by the symspell index. Training n-gram model on 600 mb file leads to significant memory consumption (25 Gb). To get the best possible accuracy we need a big dataset (at least few gigabytes). However, model become really huge, and everything works so slow. Sentence probability can be calculated like this: def predict(self, sentence): result = 0 for i in range(0, len(sentence) - 2): p2 = self.getGram3Prob(sentence, sentence, sentence) p3 = self.getGram2Prob(sentence, sentence) p4 = self.getGram1Prob(sentence) result += math.log(p2) + math.log(p3) + math.log(p4) return resultĪnd n-gram probabilities like this: def getGram1Prob(self, wordID): wordCounts = (wordID, 0) + SimpleLangModel.K vocabSize = len(am1) return float(wordCounts) / (self.totalWords + vocabSize) def getGram2Prob(self, wordID1, wordID2): countsWord1 = (wordID1, 0) + self.totalWords countsBigram = ((wordID1, wordID2), 0) + SimpleLangModel.K return float(countsBigram) / countsWord1 def getGram3Prob(self, wordID1, wordID2, wordID3): countsGram2 = ((wordID1, wordID2), 0) + self.totalWords countsGram3 = ((wordID1, wordID2, wordID3), 0) + SimpleLangModel.K return float(countsGram3) / countsGram2 Now we can use our extended language model to estimate candidates with context. Let’s estimate probability of some fragment as a product of all n-grams of n-size: Let’s pre-calculate not only single words, but word and a small context (3 nearest words). Adding some contextįirst improvement - adding n-gram language model (3-grams). The candidate word with highest frequency is taken as an answer. For each vocabulary word frequencies are pre-calculated, based on some big text collections. We repeat this procedure for every word for a second time to get candidates with bigger edit distance (for cases with two errors).Įach candidate is estimated with unigram language model. for word abc possible candidates will be: ab ac bc bac cba acb a_bc ab_c aabc abbc acbc adbc aebc etc.Įvery word is added to a candidate list. Let’s take a word and brute force all possible edits, such as delete, insert, transpose, replace and split. Peter Norvig (director of research at Google) described the following approach to spelling correction. Let start with a Norvig’s spelling corrector and iteratively increase its capabilities. We need an automatic spelling corrector which can fix words with typos and, at the same time not break correct spellings. As painful as it may be, data should be cleaned before fitting. As the result, we are unable to reach the best score. In real-world NLP problems we often meet texts with a lot of typos. Even with having to kick off two processes in order to run the levenshtein algorith and read in an ASCII dictionary, I could still get results in a tenth of a second, on a machine that's now ten years old.Dirty data leads to bad model quality. (I've done the Levenshtein distance thing, too, for a demo. The best way to do that is never depend on any data from the internet :-) It can do a linear and non-optimized scan the entire Unicode character tables in the blink of a eye, even on slow devices.įor the app (not server) world, once you hit blink-of-an-eye, you're done. Similarly, I've got a unicode table lookup in my calculator program. Even on slow machines, it easily keeps up with the user's keystrokes. When a word being typed isn't found, I do a "LIKE" type of SQL search with no particular optimization. I wrote a simple offline dictionary program for Windows Phone using Wiktionary (about 400K words) as the source of correctness the actual data is just in a SQL database. Programmers constantly underestimate just how freaking fast modern processors are.
0 Comments
Leave a Reply.AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |