Programmatically taking screenshots of a web page is very essential in a testing environment to see about the web page. But the same can be used for automation like getting the screenshot of the news website every morning into your Inbox or generating a report of candidates’ github activities. But this wasn’t possible in command line until the rise of headless browsers and javascript libraries supporting them. Even when such JavaScript libraries where made available, R programmers did not have any option to integrate such functionality in their code.
That is when webshot an R package that helps R programmers take web screenshots programmatically with the help of phantomJS running in the backend.
PhantomJS is a headless webkit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.
PhantomJS is an optimal solution for the following:
Also, the latest development version of webshot is hosted on github and can be installed using the below code:
Initial Setup
As we saw above, the R package webshot works with PhantomJS in the backend, hence it is essential to have PhantomJS installed on the local machine where webshot package is used. To assist with that, webshot itself has an easy function to get PhantomJS installed on your machine.
Now, webshot package is installed and setup and is ready to use. To start with let us take a PDF copy of a web page.
Screenshot Function
webshot package provides one simple function webshot() that takes a webpage url as its first argument and saves it in the given file name that is its second argument. It is important to note that the filename includes the file extensions like '.jpg', '.png', '.pdf' based on which the output file is rendered. Below is the basic structure of how the function goes:
If no folder path is specified along with the filename, the file is downloaded in the current working directory which can be checked with getwd().
Now that we understood the basics of the webshot() function, It is time for us to begin with our cases - starting with downloading/converting a webpage as a PDFcopy.
Case #1: PDF Copy of WebPage
Let us assume, we would like to download Bill Gates' notes on Best Books of 2017 as a PDF copy.
The above code generates a PDF whose (partial) screenshot is below:
Dissecting the above code, we can see that the webshot( ) function has got 3 arguments supplied with it.
Case #2: Webpage Screenshot (Viewport Size)
Now, I'd like to get an automation script running to get screenshot of a News website and probably send it to my inbox for me to see the headlines without going to the browser. Here we will see how to get a simple screenshot of livemint.com an Indian news website.
Case #3: Multiple Selector Based Screenshots
All the while we have seen taking simple screenshots of the whole pages and we dealt with one screenshot and one file, but that is not what usually happens when you are dealing with automation or perform something programmatically. In most of the cases we end up performing more than one action, hence this case deals with taking multiple screenshots and saving multiple files. But instead of taking multiple screenshots of different urls (which is quite straightforward), we will screenshots of different sections of the same web page with different CSS selector and save them in respective files.
References
That is when webshot an R package that helps R programmers take web screenshots programmatically with the help of phantomJS running in the backend.
Take Screenshot from R |
What is PhantomJS?
PhantomJS is a headless webkit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.PhantomJS is an optimal solution for the following:
- Headless website testing
- Screen Capture
- Page Automation
- Network Monitoring
Webshot : R Package
The webshot package allows users to take screenshots of web pages from R with the help of PhantomJS. It also can take screenshots of R Shiny App and R Markdown Documents (both static and interactive).Install and Load Package
The stable version of webshot is available on CRAN hence can be installed using the below code:install.packages('webshot')
library('webshot')
Also, the latest development version of webshot is hosted on github and can be installed using the below code:
#install.packages('devtools')
devtools::install_github('wch/webshot')
Initial Setup
As we saw above, the R package webshot works with PhantomJS in the backend, hence it is essential to have PhantomJS installed on the local machine where webshot package is used. To assist with that, webshot itself has an easy function to get PhantomJS installed on your machine.
webshot::install_phantomjs()The above function automatically downloads PhantomJS from its website and installs it. Please note this is only a first time setup and once both webshot and PhantomJS are installed these above two steps can be skipped for using the package as mentioned in the below sections.
Now, webshot package is installed and setup and is ready to use. To start with let us take a PDF copy of a web page.
Screenshot Function
webshot package provides one simple function webshot() that takes a webpage url as its first argument and saves it in the given file name that is its second argument. It is important to note that the filename includes the file extensions like '.jpg', '.png', '.pdf' based on which the output file is rendered. Below is the basic structure of how the function goes:
library(webshot)
#webshot(url, filename.extension)
webshot("https://www.listendata.com/", "listendata.png")
If no folder path is specified along with the filename, the file is downloaded in the current working directory which can be checked with getwd().
Now that we understood the basics of the webshot() function, It is time for us to begin with our cases - starting with downloading/converting a webpage as a PDFcopy.
Case #1: PDF Copy of WebPage
Let us assume, we would like to download Bill Gates' notes on Best Books of 2017 as a PDF copy.
#loading the required library
library(webshot)
#PDF copy of a web page / article
webshot("https://www.gatesnotes.com/About-Bill-Gates/Best-Books-2017",
"billgates_book.pdf",
delay = 2)
The above code generates a PDF whose (partial) screenshot is below:
Snapshot of PDF Copy |
Dissecting the above code, we can see that the webshot( ) function has got 3 arguments supplied with it.
- URL from which the screenshot has to be taken.
- Output Filename along with its file extensions.
- Time to wait before taking screenshot, in seconds. Sometimes a longer delay is needed for all assets to display properly.
Case #2: Webpage Screenshot (Viewport Size)
Now, I'd like to get an automation script running to get screenshot of a News website and probably send it to my inbox for me to see the headlines without going to the browser. Here we will see how to get a simple screenshot of livemint.com an Indian news website.
#Screenshot of Viewport
webshot('https://www.livemint.com/','livemint.png', cliprect = 'viewport')
While the first two arguments are similar to the above function, there's a new third argument cliprect which specifies the size of the Clipping rectangle.
If cliprect is unspecified, the screenshot of the complete web page is taken (like in the above case). Since we are updated in only the latest news (which is usually on the top of the website), we use cliprect with the value 'viewport' which clips only the viewport part of the browser, as below.
If cliprect is unspecified, the screenshot of the complete web page is taken (like in the above case). Since we are updated in only the latest news (which is usually on the top of the website), we use cliprect with the value 'viewport' which clips only the viewport part of the browser, as below.
Screenshot of Viewport of Browser |
Case #3: Multiple Selector Based Screenshots
All the while we have seen taking simple screenshots of the whole pages and we dealt with one screenshot and one file, but that is not what usually happens when you are dealing with automation or perform something programmatically. In most of the cases we end up performing more than one action, hence this case deals with taking multiple screenshots and saving multiple files. But instead of taking multiple screenshots of different urls (which is quite straightforward), we will screenshots of different sections of the same web page with different CSS selector and save them in respective files.
#Multiple Selector Based Screenshots
webshot("https://github.com/hadley",
file = c("organizations.png","contributions.png"),
selector = list("div.border-top.py-3.clearfix","div.js-contribution-graph"))
In the above code, we take screenshot of two CSS Selectors from the github profile page of Hadley Wickham and save them in two PNG files - organizations.png and contributions.png.
Thus, we have seen how to use the R package webshot for taking screenshots programmatically in R. Hope, this post helps fuel your automation needs and helps your organisation improve its efficiency.Contributions.png |
Organizations.png |
References
Thanks for the post. Du you have an idea how to use this for websites which needs an authentification? e.g. facebook needs sometimes an authentification (see: https://www.facebook.com/BoschHomeSuisse/posts/936374403147314:0)
ReplyDelete