Python Scrape Value Between Static Html Tags Containing Static Text
This is my first post in this forum and i believe that this forum would answer my basic question here. My requirement here consists of two steps. In the first step, i need to ext
Solution 1:
import re
data = """<SPANCLASS="c8">DOCUMENT-TYPE: </SPAN><SPANCLASS="c2">**Paid Death Notice**</SPAN><SPANCLASS="c8">PUBLICATION-TYPE: </SPAN><SPANCLASS="c2">Newspaper</SPAN><SPANCLASS="c8">DOCUMENT-TYPE: </SPAN><SPANCLASS="c2">Paid Notice: Deaths THORNTON, ROBERT</SPAN>
"""
pattern="\<SPANCLASS=\"c8\"\>DOCUMENT-TYPE: </SPAN><SPANCLASS=\"c2\"\>(.*)\</SPAN>"
print [a.strip("*") for a in re.findall(pattern,data)]
Output:
['Paid Death Notice', 'Paid Notice: Deaths THORNTON, ROBERT']
Solution 2:
Code:
from bs4 import BeautifulSoup
data = """<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">**Paid Death Notice**</SPAN>
<SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>"""
soup = BeautifulSoup(data,'lxml')
doc = soup.find('span',class_='c8')
print(doc.text)
Result:
DOCUMENT-TYPE:
Solution 3:
You can use findall method from re module, and regular expression.
Example:
import re
data = """<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">**Paid Death Notice**</SPAN>
<SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>
<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Notice: Deaths THORNTON, ROBERT</SPAN>
"""
data = data.replace('\n',' ')
res = re.findall("""<SPAN *CLASS="c8"> *([^:<]+): *</SPAN> *<SPAN *CLASS="c2">([^<]*)</SPAN>""",
data,
re.IGNORECASE
)
print res
print"\n".join([ "%s: %s" % (item[0],item[1]) for item in res ])
Output:
[('DOCUMENT-TYPE', '**Paid Death Notice**'), ('PUBLICATION-TYPE', 'Newspaper'), ('DOCUMENT-TYPE', 'Paid Notice: Deaths THORNTON, ROBERT')]
DOCUMENT-TYPE: **Paid Death Notice**
PUBLICATION-TYPE: Newspaper
DOCUMENT-TYPE: Paid Notice: Deaths THORNTON, ROBERT
You can simply get the res variable and get all keys and values. If you would like to convert the result to dictionary you can use this code:
res_dict = dict(res)
print res_dict
but in that case, the first 'DOCUMENT-TYPE' occurrence will be overrided, by the last one:
{'DOCUMENT-TYPE': 'Paid Notice: Deaths THORNTON, ROBERT', 'PUBLICATION-TYPE': 'Newspaper'}
Solution 4:
Do not mix regexes and BeautifulSoup, BS has enough methods to navigate DOM tree:
if doc.text.startswith('DOCUMENT-TYPE'):
print doc.find_next_sibling().text
# prints **Paid Death Notice**
You can also iterate on all tags with particular property:
for tag in soup.find_all('span', class_='c8'):
print tag.text
# DOCUMENT-TYPE:# PUBLICATION-TYPE:
Post a Comment for "Python Scrape Value Between Static Html Tags Containing Static Text"