Word Frequency In Text Using Python But Disregard Stop Words
Solution 1:
You can download lists of stopwords as files in various formats, e.g. from here -- all Python needs to do is to read the file (and these are in csv
format, easily read with the csv
module), make a set, and use membership in that set (probably with some normalization, e.g., lowercasing) to exclude words from the count.
Solution 2:
There's an easy way to handle this by slightly modifying the code you have (edited to reflect John's comment):
stopWords = set(['a', 'an', 'the', ...])
fullWords = re.findall(r'\w+', allText)
d = defaultdict(int)
for word in fullWords:
if word notin stopWords:
d[word] += 1
finalFreq = sorted(d.iteritems(), key=lambda t: t[1], reverse=True)
self.response.out.write(finalFreq)
This approach constructs the sorted list in two steps: first it filters out any words in your desired list of "stop words" (which has been converted to a set
for efficiency), then it sorts the remaining entries.
Solution 3:
I know that NLTK has a package with a corpus and the stopwords for many languages, including English, see here for more information. NLTK has also a word frequency counter, it's a nice module for natural language processing that you should consider to use.
Solution 4:
stopwords = set(['an', 'a', 'the']) # etc...finalFreq = sorted((k,v) for k,v in d.iteritems() if k not in stopwords,
key = operator.itemgetter(1), reverse=True)
This will filter out any keys which are in the stopwords
set.
Post a Comment for "Word Frequency In Text Using Python But Disregard Stop Words"