Tried Python Beautifulsoup And Phantom Js: Still Can't Scrape Websites
Solution 1:
The problem you're facing is that the elements are created by JS, and it might take some time to load them. You need a scraper which handles JS, and can wait until the required elements are created.
You can use PyQt4. Adapting this recipe from webscraping.com and a HTML parser like BeautifulSoup, this is pretty easy:
(after writing this, I found the webscraping library for python. It might be worthy a look)
import sys
from bs4 import BeautifulSoup
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
classRender(QWebPage):
def__init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def_loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://hcavirginia.com/home/'
r = Render(url)
soup = BeautifulSoup(unicode(r.frame.toHtml()))
# In Python 3.x, don't unicode the output from .toHtml(): #soup = BeautifulSoup(r.frame.toHtml())
nums = [int(span) for span in soup.find_all('span', class_='ehc-er-digits')]
print nums
Output:
[21, 23, 47, 11, 10, 8, 68, 56, 19, 15, 7]
This was my original answer, using ghost.py
:
I managed to hack something together for you using ghost.py. (tested on Python 2.7, ghost.py 0.1b3 and PyQt4-4 32-bit). I wouldn't recommend to use this in production code though!
from ghost import Ghost
from time import sleep
ghost = Ghost(wait_timeout=50, download_images=False)
page, extra_resources = ghost.open('http://hcavirginia.com/home/',
headers={'User-Agent': 'Mozilla/4.0'})
# Halt execution of the script until a span.ehc-er-digits is found in # the document
page, resources = ghost.wait_for_selector("span.ehc-er-digits")
# It should be possible to simply evaluate# "document.getElementsByClassName('ehc-er-digits');" and extract the data from# the returned dictionary, but I didn't quite understand the# data structure - hence this inline javascript.
nums, resources = ghost.evaluate(
"""
elems = document.getElementsByClassName('ehc-er-digits');
nums = []
for (i = 0; i < elems.length; ++i) {
nums[i] = elems[i].innerHTML;
}
nums;
""")
wt_data = [int(x) for x in nums]
print wt_data
sleep(30) # Sleep a while to avoid the crashing of the script. Weird issue!
Some comments:
As you can see from my comments, I didn't quite figure out the structure of the returned dict from
Ghost.evaluate(document.getElementsByClassName('ehc-er-digits');)
- its probably possible to find the information needed using such a query though.I also had some problems with the script crashing at the end. Sleeping for 30 seconds fixed the issue.
Post a Comment for "Tried Python Beautifulsoup And Phantom Js: Still Can't Scrape Websites"