Scraping: How to use regular expression in BeautifulSoup to scrape Nobel Laureats from table in Wikipedia
I wanted to try to use regex to get links to laureats in table on page List of Nobel Memorial Prize laureates in Economics
First I tried to use r'^/wiki/[A-Z][a-z]*_[A-Z][a-z]*$')
because links looks like
/wiki/Paul_Krugman
but this gets also links like
/wiki/United_States
and there are few links with more _
and with native chars (ö
) converted to hex codes (ie. %C3%B6
)
`/wiki/Bengt_R._Holmstr%C3%B6m` (`Bengt Holmström`)
I decided to find only first table and work with every row separatelly and get only link from third column.
But there is problem because HTML uses colspan
to join columns in two/three rows
so in every row this link is in different <td>
in HTML code.
I decide to find first link in row which matchs r'^/wiki/[^:]*$'
.
This way I skip link with image /wiki/File:...
.
Because I use find()
instead of find_all()
so I find only link to laureat
and I don't get link to United State
which is in next column.
import requests
from bs4 import BeautifulSoup as BS
import re
r = requests.get('https://en.wikipedia.org/wiki/List_of_Nobel_Memorial_Prize_laureates_in_Economics')
soup = BS(r.text, 'html.parser')
all_tables = soup.find_all('table')
pattern = re.compile(r'^/wiki/[^:]*$')
for row in all_tables[0].find_all('tr'):
item = row.find('a', {'href': pattern})
if item:
print(item['href'], '|', item['title'])
Buy a Coffee