Articles for tag: scraping

Search on blog:

Scraping: Python tools and modules for scraping (updated)

Last update: 2022.03.28

Get HTML from server

urllib.request

standard module, preinstalled with Python
some operations need more code than requests
it has urlretrive() to download file

Requests

popular module which makes life easier
extensions and modifications

Search data in HTML

BeautifulSoup

it uses css or own functions which can use regex
it doesn't use xpath

Parsel

it uses css or xpath with regex
it is used by Scrapy

cssselector

it converts css to xpath and use lxml to search

pyquery

it uses selectors like jquery.
it uses pseudo classes which doesn't exist in css ie. :first :last :even :odd :eq :lt :gt :checked :selected :file

Scraping framework(s)

Scrapy

it uses xpath and css selectors
it can run many processes at the same time
it can use proxies
it can be used on servers zyte.com
extension to work with selenium: scrapy-selenium

Source: https://doc.scrapy.org/en/latest/topics/architecture.html

Work with JavaScript

If page uses JavaScript then you may need one of this modules which can control real web browser

Selenium

it controls real web browser which can run JavaScript
use xpath, css selectors
drivers
- standard driver for Chrome: chromedriver
- standard driver for Firefox: geckodriver
- standard driver for Edge: edge driver
- special driver for Chrome: undetected chromedriver
- module to automatically download driver: webdriver-manager

Other tools for scraping

Portia

docker with tool that allows for visually scraping.
created by authors of Scrapy

Other tools for help

httpbin.org

it can be used to test HTTP requests
it sends back all data which it gets so you check if your requests creates correct data.

ToScrape.com

Web Scraping Sandbox with two fictional pages/portals which you can use it to learn scraping.

There are examples with:

normal pagination
infinite scrolling pagination
JavaScript generated content
a table based messed-up layout
login with CSRF token (any user/passwd works)
ViewState (C# DotNet)

curlconverter.com

can convert curl command to code in Python (requests) or other languages
some (API) documentation show examples as command curl
some conversion may have mistake

Older link curl.trillworks.com

Similar:

reqbin.com
curl2scrapy (convert for module scrapy)
tools Postman and Insomni also can generate code for Python

"DevTools" in Chrome and Firefox

tab: Inspecion - to search items in HTML and get CSS or XPath selector. (but it gives selector which doesn't use classes and ids so it can be long and unreadable for human)
tab: Network - to see all requests from browser to server and get requests used by JavaScript to get
tab: Console - to test JavaScript code or use $("...") to test css selector or $x("...") to test xpath
extensions:
- button to turn off JavaScript:
  - JavaScript Toggle On and Off
  - disable-javascript

Extra doc for Firefox

Tools which can be used to test requests and API

They can also generate code in python using urllib or requests

charlesproxy - local proxy server
mitmproxy - local proxy server (created in Python)

By The Way

pandas.read_html() can read HTML from file or url and scrape data from all standard <table> and create list with DataFrames

Example codes for different pages and different tools on my GitHub:

furas / python-examples / scraping

Python: How to convert text with hex values and \ to normal characters

Sometimes you can scrape string with double \\ and hex values

\\x3Cstyle\\x3E\\x0A.mainDiv\\x0A\\x7B\\x0A\\x20\\x20width\\x3A1000px\\x3B\\x0A}\\x0A\\x3C/style\\x3E

But it should be

<style>
   .mainDiv
  {
  width:1000px;
  background-image:url
<style>

It needs to encode it back with raw_unicode_escape` and then decode with ``unicode_escape …

read more | czytaj więcej

Python: Jak zamienić tekst z wartosciami szestastwkoymi (hex) z \ na normalne znaki

Czasami podczas scrapingu można otrzymać tekst z podwójnymi \\ i szesnastkowymi kodami znaków

\\x3Cstyle\\x3E\\x0A.mainDiv\\x0A\\x7B\\x0A\\x20\\x20width\\x3A1000px\\x3B\\x0A}\\x0A\\x3C/style\\x3E

A to powinno wyglądać jako

<style>
   .mainDiv
  {
  width:1000px;
  background-image:url
<style>

Należy to przekonwerterować ponownie do bytes z użyciem raw_unicode_escape a …

read more | czytaj więcej

Scraping: How to download tgz file from eogauth.mines.edu.

It it problem from Stackoverflow.

The main problem was wrong url used in POST.

Often form sends data to the same url as page with form but it doesn't have to be true on all pages.

Form may send data to different url which can be defined as action in …

read more | czytaj więcej

Scraping: Jak pobrać plik tgz ze strony eogauth.mines.edu.

Oto problem ze Stackoverflow.

Głównym problemem był zły url użyty w POST.

Często form wysyła dane to tego samego adresu jaki ma strona z tym formularze ale nie musi tak być na każdej stronie.

Formularz może wysyłać dane pod inny adres, który jest zdefiniowany jako action w HTML <form action …

read more | czytaj więcej

Python: How to find element next after (previous before) another element with BeautifulSoup.

BeautifulSoup has many functions to search elements - not only find() and find_all() but also

It can also search in other direction using

It has also attributes (for single element)

and iterators (for many elements)

which can work different …

read more | czytaj więcej

Python: Jak w BeautifulSoup znaleść element występujący za (lub przed) innym elementem.

BeautifulSoup ma wiele funkcji do szukania elementów - nie tylko find() i find_all() ale także

Może on też szukać w przeciwnym kierunku używając

Ma także atrybuty (dla pogrania pojedyńczego elementu)

i iteratorory (dla pogrania wielu elementów)

które mogą działać …

read more | czytaj więcej

Python: How to use Selenium with local HTML in string.

To run Selenium on local HTML which you have in string you can use

driver.get("data:text/html;charset=utf-8," + html)

Full example

html = '''
<ul>
  <li>Contains Enzymatically Active B-Vitamins
  </li>
  <li>Dietary Supplement
  </li>
  <li>Non-GMO LE Certified
  </li>
</ul>'''

import selenium.webdriver

driver = selenium.webdriver.Firefox()

driver …

read more | czytaj więcej

Python: Jak użyć Selenium z lokalnym HTML w string.

Aby użyć Selenium z lokalnym HTML który jest w string można użyć

driver.get("data:text/html;charset=utf-8," + html)

Full example

html = '''
<ul>
  <li>Contains Enzymatically Active B-Vitamins
  </li>
  <li>Dietary Supplement
  </li>
  <li>Non-GMO LE Certified
  </li>
</ul>'''

import selenium.webdriver

driver = selenium.webdriver.Firefox()

driver.get("data …

read more | czytaj więcej

How to use DevTools in Firefox to find JSON data in EpicGames.com

Movie shows Devtools in Firefox, tab Network, filter XHR.

You can access DevTools using menu Web Developer or key shortcut F12.

After clicking link in DevTools it shows also tabs Headers and Response with JSON data.

Using context menu on link (right mouse click) you can also use Open In …

read more | czytaj więcej

Jak użyć DevTools w Firefox do szukania danych JSON na EpicGames.com

Film pokazuje Devtools w Firefox, zakłada Network, filtr XHR.

Można dostać się do DevTools używając menu Web Developer lub skrótu klawiszowego F12.

Po kliknięciu w link w DevTools pokazuje także boczne zakładki Headers i Response z danymi JSON.

Używając na linku menu kontektowego (prawy przycisk myszy) można także użyć Open …

read more | czytaj więcej

Python: How to use Tor Network with requests to change IP?

Tor Network can be used to run requests with changed IP.

If you have installed Tor then it should run all time as service and you could use it as proxy server with address 127.0.0.1:9050 (localhost:9050)

In requests you can use it

proxy = {
    'http':  'socks5 …

read more | czytaj więcej

Python: Jak użyć sieć Tor z requests ze zmienionym IP?

Sieć Tor może być użyta do uruchomienia requests ze zmienionym IP.

Jeśli masz już zainstalowany Tor wtedy powinien on chodzić cały czas jako usługa i powinna być możliwość użycia go jako proxy server z adresem 127.0.0.1:9050 (localhost:9050)

W requests możesz użyć

proxy = {
    'http':  'socks5://127 …

read more | czytaj więcej

Scraping: How to use regular expression in BeautifulSoup to scrape Nobel Laureats from table in Wikipedia

I wanted to try to use regex to get links to laureats in table on page List of Nobel Memorial Prize laureates in Economics

First I tried to use r'^/wiki/[A-Z][a-z]*_[A-Z][a-z]*$') because links looks like

/wiki/Paul_Krugman

but this gets also links like

/wiki/United_States …

read more | czytaj więcej

Scraping: Jak użyć wyrażenia regularnego w BeautifulSoup aby pobrać Laureatów Nobla z tabeli w Wikipedii

Chciałem użyć wyrażenia regularnego do pobrania linków do laureatów w tabeli na stronie List of Nobel Memorial Prize laureates in Economics

Najpierw próbowałem użyć r'^/wiki/[A-Z][a-z]*_[A-Z][a-z]*$') ponieważ wyglądało, że linki mają postać

/wiki/Paul_Krugman

ale okazało się, że to znajduje także linki postaci

/wiki/United_States …

read more | czytaj więcej

Scraping: How to get data from interactive plot created with HighCharts

On page https://www.worldometers.info/coronavirus/#countries you can see Highcharts with "Total Coronavirus Death". I tried to get data which it uses to display this chart. It doesn't use AJAX to load data from other url so I couldn't read it directly. It also doesn't keep data in …

read more | czytaj więcej

Scraping: Jak pobrać dane z interaktywnego wykresu stworzonego przez HighCharts

Na stronie https://www.worldometers.info/coronavirus/#countries jest wykres Highcharts z "Total Coronavirus Death". Chciałem pobrać dane, które zostały użyte do wyświetlenia tego wykresu.

Wykres nie używa AJAX do wczytywania danych z innego url więc nie mogłem pobrać je bezpośredion. Wykres nie trzyma ich także w oddzielnej zmiennej w …

read more | czytaj więcej

Selenium: How to close alert created by JavaScript

JavaScript can create three standard popup alerts: alert(), confirm() or prompt().

all of them have button OK
confirm() and prompt() have button CANCEL
prompt() has text field

To press button OK

driver.switch_to.alert.accept()  # press OK

To press button CANCEL (only in confirm() and prompt())

driver.switch_to.alert.dismiss …

read more | czytaj więcej

Selenium: Jak zamknąć alert stworzony przez JavaScript

JavaScript może tworzyć trzy standarowe wyskakujące alerty: alert(), confirm() lub prompt().

wszystkie z nich mają przycisk OK
confirm() i prompt() mają przycisk CANCEL
prompt() ma pole tekstowe

Aby wcisnąć OK

driver.switch_to.alert.accept()   # press 'OK'

Aby wcisnąć CANCEL (tylko w confirm() i prompt())

driver.switch_to.alert.dismiss()   # press 'Cancel …

read more | czytaj więcej

Selenium: How to send clipboard to field in browser

When you find input field on page then you can send Ctrl+V to send text from clipboard to this field.

import selenium.webdriver
from selenium.webdriver.common.keys import Keys 

driver = selenium.webdriver.Firefox()
driver.get('https://google.com')

item = driver.find_element_by_name('q')
item.send_keys(Keys.CONTROL + "v")
#item …

read more | czytaj więcej

« Page: 1 / 12 »