Extract Companies' Register Number In Python By Getting The Next Word
I am trying to get the German Handelsregisternummer (companies' register number) which usually is directly written behind the word HRB. However there are exceptions which I would l
Solution 1:
You could extend the character class and move the word boundary to before matching digits.
\bHRB[.,: \w-]*\b(\d+)
See the updated regex
Or a bit more precise match:
\bHRB[,:]?(?:[- ](?:Nr|Nummer)[.:]*)? (\d+)
\bHRB
Word boundary, then match HRB[,:]?
Optionally match,
or:
(?:
Non capture group[- ](?:Nr|Nummer)[.:]*
Match space or-
, then Nr or Nummer and 0+ times a . or :
)?
Close the group and make it optional(\d+)
Match a space and capture in the first group 1 or more digits
Solution 2:
You may use
\bHRB\b(?:[-\s]N(?:umme)?r)?[,.:\s]*(\d+)
See the regex demo
Details
\bHRB\b
- a whole wordHRB
(?:[-\s]N(?:umme)?r)?
- an optional group matching-
or whitespace and thenNr
orNummer
[,.:\s]*
- 0 or more commas, dots, colons or whitespaces(\d+)
- Group 1: one or more digits.
See a Python demo:
import re
strings = ['HRB 21156','HRB, 1234','HRB: 99887','HRB-Nummer 21156','HRB-Nr. 12345','HRB-Nr: 21156','HRB Nr. 21156','HRB Nr: 21156','HRB Nr.: 21156','HRB Nummer 21156', 'no number here']
def get_company_register_number(string, keyword):
return re.findall(fr'\b{keyword}\b(?:[-\s]N(?:umme)?r)?[,.:\s]*(\d+)', string)
for s in strings:
print(s, '=>', get_company_register_number(s, 'HRB'))
Output:
HRB 21156 => ['21156']
HRB, 1234 => ['1234']
HRB: 99887 => ['99887']
HRB-Nummer 21156 => ['21156']
HRB-Nr. 12345 => ['12345']
HRB-Nr: 21156 => ['21156']
HRB Nr. 21156 => ['21156']
HRB Nr: 21156 => ['21156']
HRB Nr.: 21156 => ['21156']
HRB Nummer 21156 => ['21156']
no number here => []
Post a Comment for "Extract Companies' Register Number In Python By Getting The Next Word"