Never been to DZone Snippets before?

Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world

About this user

William Harris www.upscalews.com/harris

« Newer Snippets
Older Snippets »
Showing 1-1 of 1 total  RSS 

Python String Breaking and Beginners Reg Ex All-in-One !

http://mail.python.org/pipermail/python-list/2002-October/125367.html

// I found this on a python mailling list site

Ken wrote:
> "Padraig Brady" <Padraig@Linux.ie> wrote in message
> news:3D9AFA69.2020804@Linux.ie...
>
>>Ken wrote:
>>
>>>Hi all, I am trying to do a simple word search engine. Is there an easy
>>
> way
>
>>>to break up a sentence into individual words so that I can use it to
>>
> compare
>
>>>without traversing through every character?
>>>
>>>Eg, something like this:
>>>from: "This is an example"
>>>to: ["This", "is", "an", "example"]
>>
>>You can use

   1  
   2  "".split()


but that will not
>>deal with punctuation. For that you will
>>need re.
>>
   1  
   2  import re
   3  re.split('\W+', "This is an, example.")

>>
>>This however will create an empty list item
>>for the last '.'
>>
>>You can get around this by using the converse:
   1  
   2  re.findall('\w+', "This is an, example.")

>>
>>Pádraig.
>
Does string.rstrip() get rid of the commas, fullstops etc.?

Don't get mixed up between strip and split.
The help for rstrip says "returns a string with trailing whitespace removed".
I.E. "123 " -> "123" however "123 ." doesn't change.

> Also, can you explain what the parameters "re.findall('\w+', "This is an,
> example.")" mean?

The \w+ is a regular expression that matches one or more
characters in the set [a-zA-Z0-9_]. I.E. it matches words.
Anything else (like spaces, punctuation) is not returned.

   1  
   2  import re
   3  mystring="This is an, example."
   4  re.findall(r'\w+', mystring)     #all words
   5  re.findall(r'\w*i\w*', mystring) #all words containing letter i
   6  re.findall(r'\w{4,}', mystring)  #all words >= 4 letters


Pádraig.
« Newer Snippets
Older Snippets »
Showing 1-1 of 1 total  RSS