furas.pl
# prywatne notatki - Python, Linux, Machine Learning, etc.

Scraping incomplete data with Selenium

Sometimes people scrape separatelly different values from page

all_names  = driver.find_elements_by_xpath('.//h3/a')
all_prices = driver.find_elements_by_class_name('price_color')
all_others = driver.find_elements_by_class_name("other")

and later group them using zip()

for row in zip(all_names, all_prices, all_others):
    print(row)

but it makes problem if for some items data are incomplete - like other in example - because values may move from one item to another and it can create less results then we expect.

For example this gives no results because there is no class other on page - so all_others is empty and zip(..., all_others) doesn't create items

import selenium.webdriver

driver = selenium.webdriver.Firefox()
driver.get('https://books.toscrape.com')

all_names  = driver.find_elements_by_xpath('.//h3/a')
all_prices = driver.find_elements_by_class_name('price_color')
all_others = driver.find_elements_by_class_name('other')

for name, price, other in zip(all_names, all_prices, all_others):
    row = [
        name.get_attribute('title'),
        item.text.strip(),
        other.text.strip()
    ]
    print(row)

Better is to find object which groups all information for single element

all_items = driver.find_elements_by_class_name('product_pod')

and later search values only in this element - using item instead of driver - because then we can add some default values for missing data - ie. "NAN"

data = []

for item in all_items:
    try:
        name = item.find_element_by_xpath('.//h3/a').get_attribute('title')
    except Exception as ex:
        #print('[Exception] name:', ex)
        name = 'NAN'

    try: 
        price = item.find_element_by_class_name('price_color').text.strip()
    except Exception as ex:
        #print('[Exception] price:', ex)
        price = 'NAN'

    try: 
        other = item.find_element_by_class_name('other').text.strip()
    except Exception as ex:
        #print('[Exception] other:', ex)
        other = 'NAN'

    data.append([name, price, other])

Full version:

import selenium.webdriver

driver = selenium.webdriver.Firefox()
driver.get('https://books.toscrape.com')

all_items = driver.find_elements_by_class_name('product_pod')

data = []

for item in all_items:
    try:
        name = item.find_element_by_xpath('.//h3/a').get_attribute('title')
    except Exception as ex:
        #print('[Exception] name:', ex)
        name = ''

    try: 
        price = item.find_element_by_class_name('price_color').text.strip()
    except Exception as ex:
        #print('[Exception] price:', ex)
        price = ''

    try: 
        other = item.find_element_by_class_name('other').text.strip()
    except Exception as ex:
        #print('[Exception] other:', ex)
        other = 'NAN'

    data.append([name, price, other])

for row in data:
    print(row)

Result:

['A Light in the Attic', '£51.77', 'NAN']
['Tipping the Velvet', '£53.74', 'NAN']
['Soumission', '£50.10', 'NAN']
['Sharp Objects', '£47.82', 'NAN']
['Sapiens: A Brief History of Humankind', '£54.23', 'NAN']
['The Requiem Red', '£22.65', 'NAN']
['The Dirty Little Secrets of Getting Your Dream Job', '£33.34', 'NAN']
['The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', '£17.93', 'NAN']
['The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', '£22.60', 'NAN']
['The Black Maria', '£52.15', 'NAN']
['Starving Hearts (Triangular Trade Trilogy, #1)', '£13.99', 'NAN']
["Shakespeare's Sonnets", '£20.66', 'NAN']
['Set Me Free', '£17.46', 'NAN']
["Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", '£52.29', 'NAN']
['Rip it Up and Start Again', '£35.02', 'NAN']
['Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', '£57.25', 'NAN']
['Olio', '£23.88', 'NAN']
['Mesaerion: The Best Science Fiction Stories 1800-1849', '£37.59', 'NAN']
['Libertarianism for Beginners', '£51.33', 'NAN']
["It's Only the Himalayas", '£45.17', 'NAN']

I used page http://books.toscrape.com/ created (by authors of Scrapy framework) specially to learn scraping. See also http://toscrape.com/ for more examples to scrape

Książki: python-dla-kazdego-podstawy-programowania python-wprowadzenie python-leksykon-kieszonkowy python-receptury python-programuj-szybko-i-wydajnie python-projekty-do-wykorzystania black-hat-python-jezyk-python-dla-hackerow-i-pentesterow efektywny-python-59-sposobow-na-lepszy-kod tdd-w-praktyce-niezawodny-kod-w-jezyku-python aplikacje-internetowe-z-django-najlepsze-receptury