Scraping a JavaScript-enabled Website in 2020

DATASET
3 min read

As part of our effort to track data related to the Novel Coronavirus (COVID-19), we wanted to scrape a JavaScript-enabled website on Coronavirus from Hong Kong. Moreover, you'll notice that the website from Hong Kong uses lazy loading based on scroll. So, you have to scroll down to get the whole page. A simple dump of the DOM after load will not suffice. This challenge was a little more difficult than anticipated. This blog outlines our journey and our solution.

Generate HTML for a JavaScript-enabled Website

It was surprisingly hard to find good advice on the internet about the best way to generate HTML for a JavaScript-enabled website. You need some version of a browser to parse and run the JavaScript.

PhantomJS and CasperJS come up a lot but both advertise themselves as deprecated or not actively maintained. A simple PhantomJS script will produce HTML for you.

var fs = require("fs");
var webPage = require("webpage");
var page = webPage.create();

page.open("https://wars.vote4.hk/en/cases/", function (status) {
  fs.writeFile("page.html", page.content);
  phantom.exit();
});

After installing PhantomJS and saving the above as download.js, simply running phantomjs download.js will produce HTML.

After a little more research, the currently-supported way to do this is to run a browser headless. I use Chrome and it has a command-line interface for this:

chrome --headless --disable-gpu --dump-dom https://wars.vote4.hk/en/cases/ > page.html

Simple, once you find it. So, if you want to get the HTML for a JavaScript-enabled page just use headless Chrome the same way you would use curl for HTML websites.

But what if you have to scroll?

After trying in vain to get PhantomJS to do what I wanted, I discovered Puppeteer. Puppeteer is a Chrome automation library developed by Google. Scrolling to the bottom of the page before dumping the DOM was accomplished like so:

const fs = require("fs");
const puppeteer = require("puppeteer");

let url = process.argv[2];
let file = process.argv[3];

if (!url || !file) {
  console.log("Must define a url and file as arguments");
  process.exit(1);
}

(async () => {
  // Set up browser and page.
  const browser = await puppeteer.launch({
    args: ["--no-sandbox", "--disable-setuid-sandbox"],
  });
  const page = await browser.newPage();
  page.setViewport({ width: 1280, height: 926 });

  console.log("Loading " + url);
  await page.goto(url);

  console.log("Scrolling to the bottom of the page. Please wait.");
  await autoScroll(page);

  // Save extracted items to a file.
  const html = await page.content();
  console.log("Writing HTML to " + file);
  fs.writeFileSync(file, html);

  // Close the browser.
  await browser.close();
})();

async function autoScroll(page) {
  await page.evaluate(async () => {
    await new Promise((resolve, reject) => {
      var totalHeight = 0;
      var distance = 100;
      var timer = setInterval(() => {
        var scrollHeight = document.body.scrollHeight;
        window.scrollBy(0, distance);
        totalHeight += distance;

        if (totalHeight >= scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, 200);
    });
  });
}

There are a couple cool things going on here. First, the browser is instantiated in headless mode by default. So, if you decide to deploy this on a server somewhere, you don't need to mock out a display. Second, the script scrolls 100 pixels every 200 milliseconds. You can tune this depending on how responsive the website is.

For infinite-scroll websites, you can change the if condition totalHeight >= scrollHeight to something like totalHeight >= 10000. You have to stop somewhere.

Deploying the job on a server

To run the above script, you need the following dependencies installed: Node/NPM, Puppeteer, and Chrome. Our environment uses Docker. Dependencies for installing Puppeteer can be found here. In our Dockerfile we added the following Chrome dependencies:

    gconf-service \
    libasound2 \
    libatk1.0-0 \
    libatk-bridge2.0-0 \
    libc6 \
    libcairo2 \
    libcups2 \
    libdbus-1-3 \
    libexpat1 \
    libfontconfig1 \
    libgcc1 \
    libgconf-2-4 \
    libgdk-pixbuf2.0-0 \
    libglib2.0-0 \
    libgtk-3-0 \
    libnspr4 \
    libpango-1.0-0 \
    libpangocairo-1.0-0 \
    libstdc++6 \
    libx11-6 \
    libx11-xcb1 \
    libxcb1 \
    libxcomposite1 \
    libxcursor1 \
    libxdamage1 \
    libxext6 \
    libxfixes3 \
    libxi6 \
    libxrandr2 \
    libxrender1 \
    libxss1 \
    libxtst6 \
    ca-certificates \
    fonts-liberation \
    libappindicator1 \
    libnss3 \
    lsb-release \
    xdg-utils \
    wget \

Then, before we run the script, we run npm install puppeteer. If npm doesn’t detect Chrome already installed, it will install it for you as a dependency.

The Final Product

For this workflow, we wrote a perl script to define an Airflow job which will clone dolthub/corona-virus, scrape the Hong Kong data using the JavaScript snippet above, extract the data we need, insert it into Dolt, and then push the updated repository back to DoltHub.

We think this summary will be helpful for anyone else trying to scrape JavaScript-enabled websites. It took us a couple days to find this solution that we think is canonical. We hope we saved the readers some time.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.