Bartłomiej 'furas' Burek
furas.pl
# prywatne notatki - Python, Linux, Machine Learning, etc.

Scraping: How to use regular expression in BeautifulSoup to scrape Nobel Laureats from table in Wikipedia

I wanted to try to use regex to get links to laureats in table on page List of Nobel Memorial Prize laureates in Economics

First I tried to use r'^/wiki/[A-Z][a-z]*_[A-Z][a-z]*$') because links looks like

/wiki/Paul_Krugman

but this gets also links like

/wiki/United_States

and there are few links with more _ and with native chars (ö) converted to hex codes (ie. %C3%B6)

`/wiki/Bengt_R._Holmstr%C3%B6m` (`Bengt Holmström`)

I decided to find only first table and work with every row separatelly and get only link from third column. But there is problem because HTML uses colspan to join columns in two/three rows so in every row this link is in different <td> in HTML code.

I decide to find first link in row which matchs r'^/wiki/[^:]*$'. This way I skip link with image /wiki/File:.... Because I use find() instead of find_all() so I find only link to laureat and I don't get link to United State which is in next column.

import requests
from bs4 import BeautifulSoup as BS
import re

r = requests.get('https://en.wikipedia.org/wiki/List_of_Nobel_Memorial_Prize_laureates_in_Economics')
soup = BS(r.text, 'html.parser')

all_tables = soup.find_all('table')

pattern = re.compile(r'^/wiki/[^:]*$')

for row in all_tables[0].find_all('tr'):
    item = row.find('a', {'href': pattern})
    if item:
        print(item['href'], '|', item['title'])
Książki: python-dla-kazdego-podstawy-programowania python-wprowadzenie python-leksykon-kieszonkowy python-receptury python-programuj-szybko-i-wydajnie python-projekty-do-wykorzystania black-hat-python-jezyk-python-dla-hackerow-i-pentesterow efektywny-python-59-sposobow-na-lepszy-kod tdd-w-praktyce-niezawodny-kod-w-jezyku-python aplikacje-internetowe-z-django-najlepsze-receptury