Skip to content Skip to sidebar Skip to footer

Extract Companies' Register Number In Python By Getting The Next Word

I am trying to get the German Handelsregisternummer (companies' register number) which usually is directly written behind the word HRB. However there are exceptions which I would l

Solution 1:

You could extend the character class and move the word boundary to before matching digits.

\bHRB[.,: \w-]*\b(\d+)

See the updated regex

Or a bit more precise match:

\bHRB[,:]?(?:[- ](?:Nr|Nummer)[.:]*)? (\d+)
  • \bHRB Word boundary, then match HRB
  • [,:]? Optionally match , or :
  • (?: Non capture group
    • [- ](?:Nr|Nummer)[.:]* Match space or -, then Nr or Nummer and 0+ times a . or :
  • )? Close the group and make it optional
  • (\d+) Match a space and capture in the first group 1 or more digits

Regex demo


Solution 2:

You may use

\bHRB\b(?:[-\s]N(?:umme)?r)?[,.:\s]*(\d+)

See the regex demo

Details

  • \bHRB\b - a whole word HRB
  • (?:[-\s]N(?:umme)?r)? - an optional group matching - or whitespace and then Nr or Nummer
  • [,.:\s]* - 0 or more commas, dots, colons or whitespaces
  • (\d+) - Group 1: one or more digits.

See a Python demo:

import re

strings = ['HRB 21156','HRB, 1234','HRB: 99887','HRB-Nummer 21156','HRB-Nr. 12345','HRB-Nr: 21156','HRB Nr. 21156','HRB Nr: 21156','HRB Nr.: 21156','HRB Nummer 21156', 'no number here']

def get_company_register_number(string, keyword):
  return re.findall(fr'\b{keyword}\b(?:[-\s]N(?:umme)?r)?[,.:\s]*(\d+)', string)

for s in strings:
  print(s, '=>', get_company_register_number(s, 'HRB'))

Output:

HRB 21156 => ['21156']
HRB, 1234 => ['1234']
HRB: 99887 => ['99887']
HRB-Nummer 21156 => ['21156']
HRB-Nr. 12345 => ['12345']
HRB-Nr: 21156 => ['21156']
HRB Nr. 21156 => ['21156']
HRB Nr: 21156 => ['21156']
HRB Nr.: 21156 => ['21156']
HRB Nummer 21156 => ['21156']
no number here => []

Post a Comment for "Extract Companies' Register Number In Python By Getting The Next Word"