Tried Python Beautifulsoup And Phantom Js: Still Can't Scrape Websites

July 30, 2023 Post a Comment

You may have seen my desperate frustrations over the past few weeks on here. I've been scraping some wait time data and am still unable to grab data from these two sites http://www

Solution 1:

The problem you're facing is that the elements are created by JS, and it might take some time to load them. You need a scraper which handles JS, and can wait until the required elements are created.

You can use PyQt4. Adapting this recipe from webscraping.com and a HTML parser like BeautifulSoup, this is pretty easy:

(after writing this, I found the webscraping library for python. It might be worthy a look)

import sys
from bs4 import BeautifulSoup
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import * 

classRender(QWebPage):
    def__init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def_loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()   

url = 'http://hcavirginia.com/home/'
r = Render(url)
soup = BeautifulSoup(unicode(r.frame.toHtml()))
# In Python 3.x, don't unicode the output from .toHtml(): #soup = BeautifulSoup(r.frame.toHtml()) 
nums = [int(span) for span in soup.find_all('span', class_='ehc-er-digits')]
print nums

Output:

[21, 23, 47, 11, 10, 8, 68, 56, 19, 15, 7]

This was my original answer, using ghost.py:

I managed to hack something together for you using ghost.py. (tested on Python 2.7, ghost.py 0.1b3 and PyQt4-4 32-bit). I wouldn't recommend to use this in production code though!

from ghost import Ghost
from time import sleep

ghost = Ghost(wait_timeout=50, download_images=False)
page, extra_resources = ghost.open('http://hcavirginia.com/home/',
                                   headers={'User-Agent': 'Mozilla/4.0'})

# Halt execution of the script until a span.ehc-er-digits is found in # the document
page, resources = ghost.wait_for_selector("span.ehc-er-digits")

# It should be possible to simply evaluate# "document.getElementsByClassName('ehc-er-digits');" and extract the data from# the returned dictionary, but I didn't quite understand the# data structure - hence this inline javascript.
nums, resources = ghost.evaluate(
    """
    elems = document.getElementsByClassName('ehc-er-digits');
    nums = []
    for (i = 0; i < elems.length; ++i) {
        nums[i] = elems[i].innerHTML;
    }
    nums;
    """)

wt_data = [int(x) for x in nums]
print wt_data
sleep(30) # Sleep a while to avoid the crashing of the script. Weird issue!

Some comments:

As you can see from my comments, I didn't quite figure out the structure of the returned dict from Ghost.evaluate(document.getElementsByClassName('ehc-er-digits');) - its probably possible to find the information needed using such a query though.
I also had some problems with the script crashing at the end. Sleeping for 30 seconds fixed the issue.

JavaScript Guide

Tried Python Beautifulsoup And Phantom Js: Still Can't Scrape Websites

Solution 1:

Post a Comment for "Tried Python Beautifulsoup And Phantom Js: Still Can't Scrape Websites"