Unable To Detect Gibberish Names Using Python
I am trying to build Python model that could classify account names as either legitimate or gibberish. Capitalization is not important in this particular case as some legitimate ac
Solution 1:
For the 1st characteristic, you can train a character-based n-gram language model, and treat all names with low average per-character probability as suspicious.
A quick-and-dirty example of such language model is below. It is a mixture of 1-gram, 2-gram and 3-gram language models, trained on a Brown corpus. I am sure you can find more relevant training data (e.g. list of all names of actors).
from nltk.corpus import brown
from collections import Counter
import numpy as np
text = '\n '.join([' '.join([w for w in s]) for s in brown.sents()])
unigrams = Counter(text)
bigrams = Counter(text[i:(i+2)] for i in range(len(text)-2))
trigrams = Counter(text[i:(i+3)] for i in range(len(text)-3))
weights = [0.001, 0.01, 0.989]
def strangeness(text):
r = 0
text = ' ' + text + '\n'
for i in range(2, len(text)):
char = text[i]
context1 = text[(i-1):i]
context2 = text[(i-2):i]
num = unigrams[char] * weights[0] + bigrams[context1+char] * weights[1] + trigrams[context2+char] * weights[2]
den = sum(unigrams.values()) * weights[0] + unigrams[context1] * weights[1] + bigrams[context2] * weights[2]
r -= np.log(num / den)
return r / (len(text) - 2)
Now you can apply this strangeness measure to your examples.
t1 = '128, 127, h4rugz4sx383a6n64hpo, tt, t66, t65, asdfds'.split(', ')
t2 = 'Michael, sara, jose colmenares, Dimitar, Jose Rafael, Morgan, Eduardo Medina, Luis R. Mendez, Hikaru, SELENIA, Zhang Ming, Xuting Liu, Chen Zheng'.split(', ')
for t in t1 + t2:
print('{:20} -> {:9.5}'.format(t, strangeness(t)))
You see that gibberish names are in most cases more "strange" than normal ones. You could use for example a threshold of 3.9 here.
128 -> 5.5528
127 -> 5.6572
h4rugz4sx383a6n64hpo -> 5.9016
tt -> 4.9392
t66 -> 6.9673
t65 -> 6.8501
asdfds -> 3.9776
Michael -> 3.3598
sara -> 3.8171
jose colmenares -> 2.9539
Dimitar -> 3.4602
Jose Rafael -> 3.4604
Morgan -> 3.3628
Eduardo Medina -> 3.2586
Luis R. Mendez -> 3.566
Hikaru -> 3.8936
SELENIA -> 6.1829
Zhang Ming -> 3.4809
Xuting Liu -> 3.7161
Chen Zheng -> 3.6212
Of course, a simpler solution is to collect a list of popular names in all your target languages and use no machine learning at all - just lookups.
Post a Comment for "Unable To Detect Gibberish Names Using Python"