<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DZone Snippets: Williscool's Code Snippets</title>
    <link>http://snippets.dzone.com/posts</link>
    <pubDate>Thu, 24 Jul 2008 23:24:59 GMT</pubDate>
    <description>DZone Snippets: Williscool's Code Snippets</description>
    <item>
      <title>Python String Breaking and Beginners Reg Ex All-in-One !</title>
      <link>http://snippets.dzone.com/posts/show/2731</link>
      <description>http://mail.python.org/pipermail/python-list/2002-October/125367.html&lt;br /&gt;&lt;br /&gt;// I found this on a python mailling list site&lt;br /&gt;&lt;br /&gt;Ken wrote:&lt;br /&gt;&gt; "Padraig Brady" &lt;Padraig@Linux.ie&gt; wrote in message&lt;br /&gt;&gt; news:3D9AFA69.2020804@Linux.ie...&lt;br /&gt;&gt; &lt;br /&gt;&gt;&gt;Ken wrote:&lt;br /&gt;&gt;&gt;&lt;br /&gt;&gt;&gt;&gt;Hi all, I am trying to do a simple word search engine. Is there an easy&lt;br /&gt;&gt;&gt;&lt;br /&gt;&gt; way&lt;br /&gt;&gt; &lt;br /&gt;&gt;&gt;&gt;to break up a sentence into individual words so that I can use it to&lt;br /&gt;&gt;&gt;&lt;br /&gt;&gt; compare&lt;br /&gt;&gt; &lt;br /&gt;&gt;&gt;&gt;without traversing through every character?&lt;br /&gt;&gt;&gt;&gt;&lt;br /&gt;&gt;&gt;&gt;Eg, something like this:&lt;br /&gt;&gt;&gt;&gt;from: "This is an example"&lt;br /&gt;&gt;&gt;&gt;to: ["This", "is", "an", "example"]&lt;br /&gt;&gt;&gt;&lt;br /&gt;&gt;&gt;You can use &lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;"".split()&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;but that will not&lt;br /&gt;&gt;&gt;deal with punctuation. For that you will&lt;br /&gt;&gt;&gt;need re.&lt;br /&gt;&gt;&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;import re&lt;br /&gt;re.split('\W+', "This is an, example.")&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&gt;&gt;&lt;br /&gt;&gt;&gt;This however will create an empty list item&lt;br /&gt;&gt;&gt;for the last '.'&lt;br /&gt;&gt;&gt;&lt;br /&gt;&gt;&gt;You can get around this by using the converse:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;re.findall('\w+', "This is an, example.")&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&gt;&gt;&lt;br /&gt;&gt;&gt;P&#225;draig.&lt;br /&gt;&gt; &lt;br /&gt; Does string.rstrip() get rid of the commas, fullstops etc.?&lt;br /&gt;&lt;br /&gt;Don't get mixed up between strip and split.&lt;br /&gt;The help for rstrip says "returns a string with trailing whitespace removed".&lt;br /&gt;I.E. "123  " -&gt; "123" however "123   ." doesn't change.&lt;br /&gt;&lt;br /&gt;&gt; Also, can you explain what the parameters "re.findall('\w+', "This is an,&lt;br /&gt;&gt; example.")" mean?&lt;br /&gt;&lt;br /&gt;The \w+ is a regular expression that matches one or more&lt;br /&gt;characters in the set [a-zA-Z0-9_]. I.E. it matches words.&lt;br /&gt;Anything else (like spaces, punctuation) is not returned.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;import re&lt;br /&gt;mystring="This is an, example."&lt;br /&gt;re.findall(r'\w+', mystring)     #all words&lt;br /&gt;re.findall(r'\w*i\w*', mystring) #all words containing letter i&lt;br /&gt;re.findall(r'\w{4,}', mystring)  #all words &gt;= 4 letters&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;P&#225;draig.&lt;br /&gt;</description>
      <pubDate>Fri, 29 Sep 2006 19:09:51 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/2731</guid>
      <author>Williscool (William Harris)</author>
    </item>
  </channel>
</rss>
