Skip to content Skip to sidebar Skip to footer

Pythonic String Testing

For my Information Retrieval class I have to make an index of terms from a group of files. Valid terms contain an alphabetical character, so to test I just made a simple function a

Solution 1:

You can start by simplifying content_test():

defcontent_test(term):
    returnany(c.isalpha() for c in term)

In fact, that's simple enough that you don't really need a separate function for it anymore.

What I'd do in this case is write a generator that yields only valid terms from the file. Then just convert that to a list using the list() constructor. This way you can read just a line at a time, which will save you a good bit of memory if the files are large.

defread_valid_terms(filename):
    withopen(filename) as f:
        for line in f:
            for term in line.split():
                ifany(c.isalpha() for c in term):
                    yield term

terms = list(read_valid_terms("terms.txt"))

Or if you are just going to iterate over the terms anyway, and only once, then just do that directly rather than making a list:

for term in read_valid_terms("terms.txt"):
    print term,
print

Solution 2:

In Python, string objects already contain a method that does that for you:

>>> "abc".isalpha()
True>>> "abc22".isalpha()
False

Solution 3:

While you could use a regular expression, a pythonic way would be to use any:

import string
defcontent_test(term):
    returnany((c in string.ascii_lowercase) for c in term)

If you also want to allow upper-case and locale-dependent characters, you can use str.isalpha.

A couple of additional notes:

  • FileRead should inherit from object, to make sure it's a new-style class.
  • Instead of writing if content_test(term) is False:, you can simply write if not content_test(term):.
  • clean can be written a lot, ahem, cleaner, by using filter:

defclean(self):
    self.terms = filter(content_test, self.terms)
  • You're not closing the file f, and may therefore leak the handle. Use the with statement to automatically close it, like this:

withopen(filename, 'r') as f:
    content = f.read()
    self.terms = content.split()

Solution 4:

Using regular expressions:

import re

# Match any number of non-whitespace characters, with an alpha char in it.
terms = re.findall('\S*[a-zA-Z]\S*', content)

Post a Comment for "Pythonic String Testing"