Skip to content Skip to sidebar Skip to footer

Counting Html Images With Python

I need some feedback on how to count HTML images with Python 3.01 after extracting them, maybe my regular expression are not used properly. Here is my code: import re, os import ur

Solution 1:

using beautifulsoup4 (an html parser) rather than a regex:

import urllib.request

import bs4  # beautifulsoup4

html = urllib.request.urlopen('http://www.imgur.com/').read()
soup = bs4.BeautifulSoup(html)
images = soup.findAll('img')
print(len(images))

Solution 2:

A couple of points about your code:

  1. It's much easiser to use a dedicated HTML parsing library to parse your pages (that's the python way).. I personally prefer Beautiful Soup
  2. You're over-writing your line variable in the loop
  3. total will always be 0 with your current logic
  4. no need to compile your RE, as it will be cached by the interpreter
  5. you're discarding your exception, so no clues about what's going on in the code!
  6. there could be other attributes to the <img> tags.. so your Regex is a little basic, also, use the re.findall() method to catch multiple instances on the same line...

changing your code around a little, I get:

import re
from urllib.request import urlopen

def get_image(url):

    total  = 0
    page   = urlopen(url).readlines()

    for line in page:

        hit   = re.findall('<img.*?>', str(line))
        total += len(hit)

    print('{0} Images total: {1}'.format(url, total))

get_image("http://google.com")
get_image("http://flickr.com")

Post a Comment for "Counting Html Images With Python"