Scraping incomplete data with Selenium
Sometimes people scrape separatelly different values from page
all_names = driver.find_elements_by_xpath('.//h3/a')
all_prices = driver.find_elements_by_class_name('price_color')
all_others = driver.find_elements_by_class_name("other")
and later group them using zip()
for row in zip(all_names, all_prices, all_others):
print(row)
but it makes problem if for some items data are incomplete - like other
in example - because values may move from one item to another and it can create less results then we expect.
For example this gives no results because there is no class other
on page - so all_others
is empty and zip(..., all_others)
doesn't create items
import selenium.webdriver
driver = selenium.webdriver.Firefox()
driver.get('https://books.toscrape.com')
all_names = driver.find_elements_by_xpath('.//h3/a')
all_prices = driver.find_elements_by_class_name('price_color')
all_others = driver.find_elements_by_class_name('other')
for name, price, other in zip(all_names, all_prices, all_others):
row = [
name.get_attribute('title'),
item.text.strip(),
other.text.strip()
]
print(row)
Better is to find object which groups all information for single element
all_items = driver.find_elements_by_class_name('product_pod')
and later search values only in this element - using item
instead of driver
-
because then we can add some default values for missing data - ie. "NAN"
data = []
for item in all_items:
try:
name = item.find_element_by_xpath('.//h3/a').get_attribute('title')
except Exception as ex:
#print('[Exception] name:', ex)
name = 'NAN'
try:
price = item.find_element_by_class_name('price_color').text.strip()
except Exception as ex:
#print('[Exception] price:', ex)
price = 'NAN'
try:
other = item.find_element_by_class_name('other').text.strip()
except Exception as ex:
#print('[Exception] other:', ex)
other = 'NAN'
data.append([name, price, other])
Full version:
import selenium.webdriver
driver = selenium.webdriver.Firefox()
driver.get('https://books.toscrape.com')
all_items = driver.find_elements_by_class_name('product_pod')
data = []
for item in all_items:
try:
name = item.find_element_by_xpath('.//h3/a').get_attribute('title')
except Exception as ex:
#print('[Exception] name:', ex)
name = ''
try:
price = item.find_element_by_class_name('price_color').text.strip()
except Exception as ex:
#print('[Exception] price:', ex)
price = ''
try:
other = item.find_element_by_class_name('other').text.strip()
except Exception as ex:
#print('[Exception] other:', ex)
other = 'NAN'
data.append([name, price, other])
for row in data:
print(row)
Result:
['A Light in the Attic', '£51.77', 'NAN']
['Tipping the Velvet', '£53.74', 'NAN']
['Soumission', '£50.10', 'NAN']
['Sharp Objects', '£47.82', 'NAN']
['Sapiens: A Brief History of Humankind', '£54.23', 'NAN']
['The Requiem Red', '£22.65', 'NAN']
['The Dirty Little Secrets of Getting Your Dream Job', '£33.34', 'NAN']
['The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', '£17.93', 'NAN']
['The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', '£22.60', 'NAN']
['The Black Maria', '£52.15', 'NAN']
['Starving Hearts (Triangular Trade Trilogy, #1)', '£13.99', 'NAN']
["Shakespeare's Sonnets", '£20.66', 'NAN']
['Set Me Free', '£17.46', 'NAN']
["Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", '£52.29', 'NAN']
['Rip it Up and Start Again', '£35.02', 'NAN']
['Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', '£57.25', 'NAN']
['Olio', '£23.88', 'NAN']
['Mesaerion: The Best Science Fiction Stories 1800-1849', '£37.59', 'NAN']
['Libertarianism for Beginners', '£51.33', 'NAN']
["It's Only the Himalayas", '£45.17', 'NAN']
I used page http://books.toscrape.com/ created (by authors of Scrapy framework) specially to learn scraping. See also http://toscrape.com/ for more examples to scrape
Buy a Coffee