Search on blog:

Scraping: Python tools and modules for scraping (updated)

Last update: 2022.03.28

Get HTML from server


  • standard module, preinstalled with Python
  • some operations need more code than requests
  • it has urlretrive() to download file


Search data in HTML


  • it uses css or own functions which can use regex
  • it doesn't use xpath


  • it uses xpath
  • it doesn't use css


  • it uses css or xpath with regex
  • it is used by Scrapy


  • it converts css to xpath and use lxml to search


  • it uses selectors like jquery.
  • it uses pseudo classes which doesn't exist in css ie. :first :last :even :odd :eq :lt :gt :checked :selected :file

Scraping framework(s)


  • it uses xpath and css selectors
  • it can run many processes at the same time
  • it can use proxies
  • it can be used on servers
  • extension to work with selenium: scrapy-selenium



  • it is Requests + BeautifulSoup




Work with JavaScript

If page uses JavaScript then you may need one of this modules which can control real web browser


See also



Other tools for scraping


  • docker with tool that allows for visually scraping.
  • created by authors of Scrapy

Other tools for help

  • it can be used to test HTTP requests
  • it sends back all data which it gets so you check if your requests creates correct data.

Web Scraping Sandbox with two fictional pages/portals which you can use it to learn scraping.

There are examples with:

  • normal pagination
  • infinite scrolling pagination
  • JavaScript generated content
  • a table based messed-up layout
  • login with CSRF token (any user/passwd works)
  • ViewState (C# DotNet)

  • can convert curl command to code in Python (requests) or other languages
  • some (API) documentation show examples as command curl
  • some conversion may have mistake

Older link


  • curl2scrapy (convert for module scrapy)

  • tools Postman and Insomni also can generate code for Python

"DevTools" in Chrome and Firefox

  • tab: Inspecion - to search items in HTML and get CSS or XPath selector. (but it gives selector which doesn't use classes and ids so it can be long and unreadable for human)
  • tab: Network - to see all requests from browser to server and get requests used by JavaScript to get
  • tab: Console - to test JavaScript code or use $("...") to test css selector or $x("...") to test xpath
  • extensions:

Extra doc for Firefox

Tools which can be used to test requests and API

They can also generate code in python using urllib or requests

By The Way

pandas.read_html() can read HTML from file or url and scrape data from all standard <table> and create list with DataFrames

Example codes for different pages and different tools on my GitHub:

If you like it
Buy a Coffee