Extract Companies' Register Number In Python By Getting The Next Word

January 29, 2023 Post a Comment

I am trying to get the German Handelsregisternummer (companies' register number) which usually is directly written behind the word HRB. However there are exceptions which I would l

Solution 1:

You could extend the character class and move the word boundary to before matching digits.

\bHRB[.,: \w-]*\b(\d+)

See the updated regex

Or a bit more precise match:

\bHRB[,:]?(?:[- ](?:Nr|Nummer)[.:]*)? (\d+)

\bHRB Word boundary, then match HRB
[,:]? Optionally match , or :
(?: Non capture group
- [- ](?:Nr|Nummer)[.:]* Match space or -, then Nr or Nummer and 0+ times a . or :
)? Close the group and make it optional
(\d+) Match a space and capture in the first group 1 or more digits

Regex demo

Solution 2:

You may use

\bHRB\b(?:[-\s]N(?:umme)?r)?[,.:\s]*(\d+)

See the regex demo

Details

\bHRB\b - a whole word HRB
(?:[-\s]N(?:umme)?r)? - an optional group matching - or whitespace and then Nr or Nummer
[,.:\s]* - 0 or more commas, dots, colons or whitespaces
(\d+) - Group 1: one or more digits.

See a Python demo:

import re

strings = ['HRB 21156','HRB, 1234','HRB: 99887','HRB-Nummer 21156','HRB-Nr. 12345','HRB-Nr: 21156','HRB Nr. 21156','HRB Nr: 21156','HRB Nr.: 21156','HRB Nummer 21156', 'no number here']

def get_company_register_number(string, keyword):
  return re.findall(fr'\b{keyword}\b(?:[-\s]N(?:umme)?r)?[,.:\s]*(\d+)', string)

for s in strings:
  print(s, '=>', get_company_register_number(s, 'HRB'))

Output:

HRB 21156 => ['21156']
HRB, 1234 => ['1234']
HRB: 99887 => ['99887']
HRB-Nummer 21156 => ['21156']
HRB-Nr. 12345 => ['12345']
HRB-Nr: 21156 => ['21156']
HRB Nr. 21156 => ['21156']
HRB Nr: 21156 => ['21156']
HRB Nr.: 21156 => ['21156']
HRB Nummer 21156 => ['21156']
no number here => []

Python stackoverflow Examples

Extract Companies' Register Number In Python By Getting The Next Word

Solution 1:

Solution 2:

Post a Comment for "Extract Companies' Register Number In Python By Getting The Next Word"