Python: Regex V. Beautifulsoup To Remove From Text
I need to remove all sections from a text between tags EX and and XML and . I was thinking to use regex as follow: re.sub(r'(?is)
Solution 1:
You can use a regular expression (yes) to match the contained text:
soup.find_all('TYPE', text=re.compile('^\s*(?:EX|XML)', re.I))
This will find all tags with tagname TYPE
, whose directly contained text starts with EX
or XML
(case insensitively) but allowing for whitespace between the opening tag and the text.
You can then extract those tags to remove them:
for type_tag in soup.find_all('TYPE', text=re.compile('^\s*(?:EX|XML)', re.I)):
type_tag.extract()
I am assuming you parsed the document as XML, with BeautifulSoup(text, 'xml')
; otherwise tags are matched case-insensitively and you need to lowercase the tags you are looking for (e.g. find_all('type', ....)
). You'll need to have lxml
installed for BeautifulSoup to support XML parsing.
Post a Comment for "Python: Regex V. Beautifulsoup To Remove From Text"