<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DZone Snippets: linguistics code</title>
    <link>http://snippets.dzone.com/posts</link>
    <pubDate>Sat, 11 Oct 2008 18:43:20 GMT</pubDate>
    <description>DZone Snippets: linguistics code</description>
    <item>
      <title>Spelling correction using the Python Natural Language Toolkit (nltk)</title>
      <link>http://snippets.dzone.com/posts/show/3395</link>
      <description>Google "Did you mean"-like. More here:&lt;br /&gt;&lt;a href="http://www.biais.org/blog/index.php/2007/01/31/25-spelling-correction-using-the-python-natural-language-toolkit-nltk"&gt;http://www.biais.org/blog/index.php/2007/01/31/25-spelling-correction-using-the-python-natural-language-toolkit-nltk&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Outputs:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;birdd - Did you mean "birds" ? (or "bird")&lt;br /&gt;oklaoma - Did you mean "oklahoma" ?&lt;br /&gt;emphasise - Did you mean "emphasize" ? (or "emphasizes", "emphasizing")&lt;br /&gt;bird - This word seems OK&lt;br /&gt;carot - I can't found it in my learned db&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Here is the class:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;from nltk_lite.stem.porter import Porter&lt;br /&gt;from nltk_lite.corpora import brown&lt;br /&gt;from nltk_lite import tokenize&lt;br /&gt; &lt;br /&gt;import sys&lt;br /&gt;from collections import defaultdict&lt;br /&gt;import operator&lt;br /&gt; &lt;br /&gt;def sortby(nlist ,n, reverse=0):&lt;br /&gt;    nlist.sort(key=operator.itemgetter(n), reverse=reverse)&lt;br /&gt; &lt;br /&gt;class mydict(dict):&lt;br /&gt;    def __missing__(self, key):&lt;br /&gt;        return 0&lt;br /&gt; &lt;br /&gt;class DidYouMean:&lt;br /&gt;    def __init__(self):&lt;br /&gt;        self.stemmer = Porter()&lt;br /&gt; &lt;br /&gt;    def specialhash(self, s):&lt;br /&gt;        s = s.lower()&lt;br /&gt;        s = s.replace("z", "s")&lt;br /&gt;        s = s.replace("h", "")&lt;br /&gt;        for i in [chr(ord("a") + i) for i in range(26)]:&lt;br /&gt;            s = s.replace(i+i, i)&lt;br /&gt;        s = self.stemmer.stem(s)&lt;br /&gt;        return s&lt;br /&gt; &lt;br /&gt;    def test(self, token):&lt;br /&gt;        hashed = self.specialhash(token)&lt;br /&gt;        if hashed in self.learned:&lt;br /&gt;            words = self.learned[hashed].items()&lt;br /&gt;            sortby(words, 1, reverse=1)&lt;br /&gt;            if token in [i[0] for i in words]:&lt;br /&gt;                return 'This word seems OK'&lt;br /&gt;            else:&lt;br /&gt;                if len(words) == 1:&lt;br /&gt;                    return 'Did you mean "%s" ?' % words[0][0]&lt;br /&gt;                else:&lt;br /&gt;                    return 'Did you mean "%s" ? (or %s)' \&lt;br /&gt;                           % (words[0][0], ", ".join(['"'+i[0]+'"' \&lt;br /&gt;                                                      for i in words[1:]]))&lt;br /&gt;        return "I can't found similar word in my learned db"&lt;br /&gt; &lt;br /&gt;    def learn(self, listofsentences=[], n=2000):&lt;br /&gt;        self.learned = defaultdict(mydict)&lt;br /&gt;        if listofsentences == []:&lt;br /&gt;            listofsentences = brown.raw()&lt;br /&gt;        for i, sent in enumerate(listofsentences):&lt;br /&gt;            if i &gt;= n: # Limit to the first nth sentences of the corpus&lt;br /&gt;                break&lt;br /&gt;            for word in sent:&lt;br /&gt;                self.learned[self.specialhash(word)][word.lower()] += 1&lt;br /&gt; &lt;br /&gt;def demo():&lt;br /&gt;    d = DidYouMean()&lt;br /&gt;    d.learn()&lt;br /&gt;    # choice of words to be relevant related to the brown corpus&lt;br /&gt;    for i in "birdd, oklaoma, emphasise, bird, carot".split(", "):&lt;br /&gt;        print i, "-", d.test(i)&lt;br /&gt; &lt;br /&gt;if __name__ == "__main__":&lt;br /&gt;    demo()&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Wed, 31 Jan 2007 17:51:02 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/3395</guid>
      <author>maxme (Maxime Biais)</author>
    </item>
  </channel>
</rss>
