Scraping: Jak pobrać dane z interaktywnego wykresu stworzonego przez HighCharts

Na stronie https://www.worldometers.info/coronavirus/#countries jest wykres Highcharts z "Total Coronavirus Death". Chciałem pobrać dane, które zostały użyte do wyświetlenia tego wykresu.

Wykres nie używa AJAX do wczytywania danych z innego url więc nie mogłem pobrać je bezpośredion. Wykres nie trzyma ich także w oddzielnej zmiennej w JavaScript lub w tagu HTML. On ma wszystkie dane bezpośrednio w HTML w JavaScript w Highcharts.chart(....) więc spróbowałem pobrać te dane za pomocą róznych metod.

Większość z tych metod wymaga ręcznego znalezienia elementów w danych i stworzenie odpowienich indeksów lub xpath aby pobrać dane. Tak więc nie jest to takie proste.

Najprzyjemniejsze w użyciu było js2xml, które parsuje kod JavaScript i zwraca XML, w którym można użyć xpath do szukania elementów.

Najtrudniejsze w użyciu było pyjsparser, które także parsuje kod JavaScript ale zwraca dane jako słownik w Pythonie, który nie ma method do szukania elementów.

Użyłem także normalnych funkcji do modyfikacji tekstu (split(), wycinanie [start:end]) aby pobrać dane jako JSON i zamienić w dane Pythona używając eval() lub modułu json (ewentualnie dirtyjson jeśli dane nie są poprawnie sformatowanym JSON).

import requests
from bs4 import BeautifulSoup 
import json
#import dirtyjson
import js2xml
import pyjsparser

url= 'https://www.worldometers.info/coronavirus/#countries'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

all_scripts = soup.find_all('script')

script = all_scripts[24].text
print(script)

#------------------------------------------------------------------------

print('\n--- eval ---\n')

data = script.split('data: [', 1)[1].split(']', 1)[0]
data = eval(data)  # it creates tuple
print(data)

# text values
data = script.split("title: { text: '", 1)[-1].split("'", 1)[0]
print(data)
data = script.split("title: { text: '", 3)[-1].split("'", 1)[0]
print(data)

#------------------------------------------------------------------------

print('\n--- json ---\n')

data = script.split('data: [', 1)[1].split(']', 1)[0]
data = '[' + data + ']' # create correct JSON data
data = json.loads(data) # this time doesn't need `dirtyjson`
print(data)

# text values
data = script.split("title: { text: '", 1)[-1].split("'", 1)[0]
print(data)
data = script.split("title: { text: '", 3)[-1].split("'", 1)[0]
print(data)

#------------------------------------------------------------------------

print('\n--- js2xml ---\n')

data = js2xml.parse(script)
print(data.xpath('//property[@name="data"]//number/@value')) # nice and short xpath

# text values
#print(js2xml.pretty_print(data.xpath('//property[@name="title"]')[0]))
text = data.xpath('//property[@name="title"]//string/text()')
print(text[0])
print(text[1])

#------------------------------------------------------------------------

print('\n--- pyjsparser ---\n')

data = pyjsparser.parse(script) 
data = data['body'][0]['expression']['arguments'][1]['properties'][-2]['value']['elements'][0]['properties'][-1]['value']['elements'] # a lot of work to find it
#print(json.dumps(data, indent=2))
data = [x['value'] for x in data]
print(data)

# text values
# it needs work

Wynik

 Highcharts.chart('coronavirus-deaths-linear', { chart: { type: 'line' }, title: { text: 'Total Deaths' }, subtitle: { text: '(Linear Scale)' }, xAxis: { categories: ["Jan 22","Jan 23","Jan 24","Jan 25","Jan 26","Jan 27","Jan 28","Jan 29","Jan 30","Jan 31","Feb 01","Feb 02","Feb 03","Feb 04","Feb 05","Feb 06","Feb 07","Feb 08","Feb 09","Feb 10","Feb 11","Feb 12","Feb 13","Feb 14","Feb 15","Feb 16","Feb 17","Feb 18","Feb 19","Feb 20","Feb 21","Feb 22","Feb 23","Feb 24","Feb 25","Feb 26","Feb 27","Feb 28","Feb 29","Mar 01","Mar 02","Mar 03","Mar 04","Mar 05"] }, yAxis: { title: { text: 'Total Coronavirus Deaths' } }, legend: { layout: 'vertical', align: 'right', verticalAlign: 'middle' }, credits: { enabled: false }, series: [{ name: 'Deaths', color: '#FF9900', lineWidth: 5, data: [17,25,41,56,80,106,132,170,213,259,304,362,426,492,565,638,724,813,910,1018,1115,1261,1383,1526,1669,1775,1873,2009,2126,2247,2360,2460,2618,2699,2763,2800,2858,2923,2977,3050,3117,3202,3285,3387] }], responsive: { rules: [{ condition: { maxWidth: 800 }, chartOptions: { legend: { layout: 'horizontal', align: 'center', verticalAlign: 'bottom' } } }] } }); 

--- eval ---

(17, 25, 41, 56, 80, 106, 132, 170, 213, 259, 304, 362, 426, 492, 565, 638, 724, 813, 910, 1018, 1115, 1261, 1383, 1526, 1669, 1775, 1873, 2009, 2126, 2247, 2360, 2460, 2618, 2699, 2763, 2800, 2858, 2923, 2977, 3050, 3117, 3202, 3285, 3387)
Total Deaths
Total Coronavirus Deaths

--- json ---

[17, 25, 41, 56, 80, 106, 132, 170, 213, 259, 304, 362, 426, 492, 565, 638, 724, 813, 910, 1018, 1115, 1261, 1383, 1526, 1669, 1775, 1873, 2009, 2126, 2247, 2360, 2460, 2618, 2699, 2763, 2800, 2858, 2923, 2977, 3050, 3117, 3202, 3285, 3387]
Total Deaths
Total Coronavirus Deaths

--- js2xml ---

['17', '25', '41', '56', '80', '106', '132', '170', '213', '259', '304', '362', '426', '492', '565', '638', '724', '813', '910', '1018', '1115', '1261', '1383', '1526', '1669', '1775', '1873', '2009', '2126', '2247', '2360', '2460', '2618', '2699', '2763', '2800', '2858', '2923', '2977', '3050', '3117', '3202', '3285', '3387']
Total Deaths
Total Coronavirus Deaths

--- pyjsparser ---

[17.0, 25.0, 41.0, 56.0, 80.0, 106.0, 132.0, 170.0, 213.0, 259.0, 304.0, 362.0, 426.0, 492.0, 565.0, 638.0, 724.0, 813.0, 910.0, 1018.0, 1115.0, 1261.0, 1383.0, 1526.0, 1669.0, 1775.0, 1873.0, 2009.0, 2126.0, 2247.0, 2360.0, 2460.0, 2618.0, 2699.0, 2763.0, 2800.0, 2858.0, 2923.0, 2977.0, 3050.0, 3117.0, 3202.0, 3285.0, 3387.0]

PS:

Te same dane są także w postaci tabeli na https://www.worldometers.info/coronavirus/coronavirus-death-toll/

Podobne dane na temat koronawirusa można znaleźć na GitHub: https://github.com/CSSEGISandData/COVID-19

EDIT: 2020.05.06

JavaScript na stronie ma nową strukture więc zmieniłem kod

import requests
from bs4 import BeautifulSoup 
import json
#import dirtyjson
import js2xml
import pyjsparser

# --- functions ---

def test_eval(script):
    print('\n--- eval ---\n')

    # chart values
    text = script.split('data: [', 1)[1] # beginning
    text = text.split(']', 1)[0] # end
    values = eval(text)  # it creates tuple
    print(values)

    # title 
    # I split `yAxis` because there is other `title` without text
    # I split beginning in few steps because text may have different indentations (different number of spaces)
    # (you could use regex to split in one step)
    text = script.split("title: {\n", 1)[1] # beginning
    text = text.split("text: '", 1)[1] # beginning
    title = text.split("'", 1)[0] # end
    print('\ntitle:', title)

    text = script.split("yAxis: {\n", 1)[1] # beginning
    text = text.split("title: {\n", 1)[1] # beginning
    text = text.split("text: '", 1)[1] # beginning
    title = text.split("'", 1)[0] # end
    print('\ntitle:', title)

def test_json(script):
    print('\n--- json ---\n')

    # chart values
    text = script.split('data: [', 1)[1] # beginning
    text = text.split(']', 1)[0] # end
    text = '[' + text + ']' # create correct JSON data
    values = json.loads(text) # this time doesn't need `dirtyjson`
    print(values)

    # title
    # I split `yAxis` because there is other `title` without text
    # I split beginning in few steps because text may have different indentations (different number of spaces)
    # (you could use regex to split in one step)
    text = script.split("title: {\n", 1)[1] # beginning
    text = text.split("text: '", 1)[1] # beginning
    title = text.split("'", 1)[0] # end
    print('\ntitle:', title)

    text = script.split("yAxis: {\n", 1)[1] # beginning
    text = text.split("title: {\n", 1)[1] # beginning
    text = text.split("text: '", 1)[1] # beginning
    title = text.split("'", 1)[0] # end
    print('\ntitle:', title)

def test_js2xml(script):
    print('\n--- js2xml ---\n')

    data = js2xml.parse(script)

    # chart values (short and nice path)
    values = data.xpath('//property[@name="data"]//number/@value')
    #values = [int(x) for x in values] # it may need to convert to int() or float()
    #values = [float(x) for x in values] # it may need to convert to int() or float()
    print(values)

    # title (short and nice path)
    #print(js2xml.pretty_print(data.xpath('//property[@name="title"]')[0]))
    #title = data.xpath('//property[@name="title"]//string/text()')
    #print(js2xml.pretty_print(data.xpath('//property[@name="yAxis"]//property[@name="title"]')[0]))

    title = data.xpath('//property[@name="title"]//string/text()')
    title = title[0]
    print('\ntitle:', title)

    title = data.xpath('//property[@name="yAxis"]//property[@name="title"]//string/text()')
    title = title[0]
    print('\ntitle:', title)

def test_pyjsparser(script):
    print('\n--- pyjsparser ---\n')

    data = pyjsparser.parse(script)

    print("body's number:", len(data['body']))

    for number, body in enumerate(data['body']):
        if (body['type'] == 'ExpressionStatement'
            and body['expression']['callee']['object']['name'] == 'Highcharts'
            and len(body['expression']['arguments']) > 1):

            arguments = body['expression']['arguments']
            #print(json.dumps(values, indent=2))
            for properties in arguments[1]['properties']:
                #print('name: >{}<'.format(p['key']['name']))
                if properties['key']['name'] == 'series':
                    values = properties['value']['elements'][0]
                    values = values['properties'][-1]
                    values = values['value']['elements'] # a lot of work to find it
                    #print(json.dumps(values, indent=2))

                    values = [x['value'] for x in values]
                    print(values)

    # title (very complicated path) 
    # It needs more work to find correct indexes to get title
    # so I skip this part as too complex.

# --- main ---

url= 'https://www.worldometers.info/coronavirus/#countries'

r = requests.get(url)
#print(r.text)
soup = BeautifulSoup(r.text, "html.parser")

all_scripts = soup.find_all('script')
print('number of scripts:', len(all_scripts))

for number, script in enumerate(all_scripts):

    #if 'data: [' in script.text:
    if 'Highcharts.chart' in script.text:
        print('\n=== script:', number, '===\n')
        test_eval(script.text)
        test_json(script.text)
        test_js2xml(script.text)
        test_pyjsparser(script.text)

If you like it

Buy a Coffee