Skip to content Skip to sidebar Skip to footer

How To Stop Bert From Breaking Apart Specific Words Into Word-piece

I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces.

Solution 1:

You are free to add new tokens to the existing pretrained tokenizer, but then you need to train your model with the improved tokenizer (extra tokens).

Example:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
v = tokenizer.get_vocab()
print(len(v))
tokenizer.add_tokens(['whatever', 'underdog'])
v = tokenizer.get_vocab()
print(len(v))

If token already exists like 'whatever' it will not be added.

Output:

30522
30523

Solution 2:

Based on the discussion here, one way to use my own additional vocabulary dictionary which is containing the specific words is to modify the first ~1000 lines of the vocab.txt file ([unused] lines) with the specific words. For example I replaced '[unused1]' with 'metastasis' in the vocab.txt and after tokenization with the modified vocab.txt I got this output:

tokens = tokenizer.tokenize("metastasis")
Output: ['metastasis']

Solution 3:

I think if I use the solution, like

tokenizer.add_tokens(['whatever', 'underdog'])

the vocab_size is changed, this means that I can not use pretrain model from transformers? because the embedding size is not correct.

Post a Comment for "How To Stop Bert From Breaking Apart Specific Words Into Word-piece"