Python: Regex V. Beautifulsoup To Remove From Text

February 28, 2024 Post a Comment

I need to remove all sections from a text between tags EX and and XML and . I was thinking to use regex as follow: re.sub(r'(?is)

Solution 1:

You can use a regular expression (yes) to match the contained text:

soup.find_all('TYPE', text=re.compile('^\s*(?:EX|XML)', re.I))

This will find all tags with tagname TYPE, whose directly contained text starts with EX or XML (case insensitively) but allowing for whitespace between the opening tag and the text.

You can then extract those tags to remove them:

for type_tag in soup.find_all('TYPE', text=re.compile('^\s*(?:EX|XML)', re.I)):
    type_tag.extract()

I am assuming you parsed the document as XML, with BeautifulSoup(text, 'xml'); otherwise tags are matched case-insensitively and you need to lowercase the tags you are looking for (e.g. find_all('type', ....)). You'll need to have lxml installed for BeautifulSoup to support XML parsing.

Python stackoverflow Examples

Python: Regex V. Beautifulsoup To Remove From Text

Solution 1:

Post a Comment for "Python: Regex V. Beautifulsoup To Remove From Text "