How To Treat Number With Decimals Or With Commas As One Word In Countvectorizer
Solution 1:
The default regex pattern the tokenizer is using for the token_pattern parameter is:
token_pattern='(?u)\\b\\w\\w+\\b'So a word is defined by a \b word boundary at the beginning and the end with \w\w+ one alphanumeric character followed by one or more alphanumeric characters between the boundaries. To interpret the regex, the backslashes have to be escaped by \\.
So you could change the token pattern to:
token_pattern='\\b(\\w+[\\.,]?\\w+)\\b'Explanation: [\\.,]?allows for the optional appearance of a . or ,. The regex for the first appearing alphanumeric character \w has to be extended to \w+ to match numbers with more than one digit before the punctuation.
For your slightly adjusted example:
corpus=["I am userna lightning strike 2.5 release re-spins there's many 10,000x bet in NA!"]
analyzer = CountVectorizer().build_analyzer()
vectorizer = CountVectorizer(token_pattern='\\b(\\w+[\\.,]?\\w+)\\b')
result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()
print(pd.DataFrame(result, columns = cols))
Output:
10,000x 2.5 am bet in lightning many na re release spins strike there userna
011111111111111Alternatively you could modify your input text, e.g. by replacing the decimal point .with underscore _ and removing commas standing between digits.
import re
corpus = ["I am userna lightning strike 2.5 release re-spins there's many 10,000x bet in NA!"]
for i in range(len(corpus)):
corpus[i] = re.sub("(\d+)\.(\d+)", "\\1_\\2", corpus[i])
corpus[i] = re.sub("(\d+),(\d+)", "\\1\\2", corpus[i])
analyzer =CountVectorizer().build_analyzer()
vectorizer =CountVectorizer()
result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()
print(pd.DataFrame(result, columns = cols))
Output:
10000x 2_5 am bet in lightning many na re release spins strike there userna
011111111111111
Post a Comment for "How To Treat Number With Decimals Or With Commas As One Word In Countvectorizer"