How To Stop Bert From Breaking Apart Specific Words Into Word-piece
Solution 1:
You are free to add new tokens to the existing pretrained tokenizer, but then you need to train your model with the improved tokenizer (extra tokens).
Example:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
v = tokenizer.get_vocab()
print(len(v))
tokenizer.add_tokens(['whatever', 'underdog'])
v = tokenizer.get_vocab()
print(len(v))
If token already exists like 'whatever' it will not be added.
Output:
30522
30523
Solution 2:
Based on the discussion here, one way to use my own additional vocabulary dictionary which is containing the specific words is to modify the first ~1000 lines of the vocab.txt file ([unused] lines) with the specific words. For example I replaced '[unused1]' with 'metastasis' in the vocab.txt and after tokenization with the modified vocab.txt I got this output:
tokens = tokenizer.tokenize("metastasis")
Output: ['metastasis']
Solution 3:
I think if I use the solution, like
tokenizer.add_tokens(['whatever', 'underdog'])
the vocab_size is changed, this means that I can not use pretrain model from transformers? because the embedding size is not correct.
Post a Comment for "How To Stop Bert From Breaking Apart Specific Words Into Word-piece"