What is Selenium?
You may not familiar with Selenium so it is important to understand the background. Selenium is an open-source tool which is very popular in testing domain and used for automating web browsers. It allows you to write test scripts in several programming languages. Selenium is available in both R and Python.Translate Page in Web Scraping in R and Python
In R there is a package named RSelenium whereas Selenium can be installed by installing selenium package in Python. Following is a list of languages chrome supports along with their code. You need this code in making chrome understand from which language to what language you want to translate the web page.Name | Code |
---|---|
Amharic | am |
Arabic | ar |
Basque | eu |
Bengali | bn |
English (UK) | en-GB |
Portuguese (Brazil) | pt-BR |
Bulgarian | bg |
Catalan | ca |
Cherokee | chr |
Croatian | hr |
Czech | cs |
Danish | da |
Dutch | nl |
English (US) | en |
Estonian | et |
Filipino | fil |
Finnish | fi |
French | fr |
German | de |
Greek | el |
Gujarati | gu |
Hebrew | iw |
Hindi | hi |
Hungarian | hu |
Icelandic | is |
Indonesian | id |
Italian | it |
Japanese | ja |
Kannada | kn |
Korean | ko |
Latvian | lv |
Lithuanian | lt |
Malay | ms |
Malayalam | ml |
Marathi | mr |
Norwegian | no |
Polish | pl |
Portuguese (Portugal) | pt-PT |
Romanian | ro |
Russian | ru |
Serbian | sr |
Chinese (PRC) | zh-CN |
Slovak | sk |
Slovenian | sl |
Spanish | es |
Swahili | sw |
Swedish | sv |
Tamil | ta |
Telugu | te |
Thai | th |
Chinese (Taiwan) | zh-TW |
Turkish | tr |
Urdu | ur |
Ukrainian | uk |
Vietnamese | vi |
Welsh | cy |
http://premier.gov.ru/events/
. At the end we are taking snapshot of the webpage.
R Code
You need to install Docker first before running the code below. Go to Products and downloadDocker Desktop
Once downloaded and installed follow the code below.
library(RSelenium) shell('docker run -d -p 4445:4444 selenium/standalone-chrome') eCaps <- list(chromeOptions = list(args = c('--disable-gpu','--window-size=1920,1080', '--lang=en'), prefs = list(translate_whitelists=list('ru' = 'en'), translate=list('enabled'='true')))) remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "chrome", extraCapabilities = eCaps) remDr$open(silent = TRUE) remDr$navigate("http://premier.gov.ru/events/") remDr$screenshot(display = TRUE) remDr$close()In the above program I am translating from Russian language
ru
to English en
Python Code
Make sure to install chrome driver before running the following python program. Check the version of installed chrome browser on your machine and download the driver accordingly. Specify the file location in the code below where chrome driver is installed.from selenium import webdriver myoptions = webdriver.ChromeOptions() prefs = { "translate_whitelists": {"ru":"en"}, "translate":{"enabled":"true"} } myoptions.add_experimental_option("prefs", prefs) d = webdriver.Chrome('C:/Users/dbhalla/Downloads/chromedriver_win32/chromedriver', options=myoptions) d.get('http://premier.gov.ru/events/') # Take screenshot d.save_screenshot("image.png") # Loading the image from PIL import Image image = Image.open("image.png") # Showing the image image.show() # Close the session d.close()
To disable images and CSS Styles in web scraping you can use the below options in
prefs
prefs = list("profile.managed_default_content_settings.images" = 2, 'profile.managed_default_content_settings.stylesheet' = 2, 'profile.managed_default_content_settings.css' = 2 )
Good
ReplyDeleteThis is very helpful. But do you have any idea why it doesn't work when we set options.headless = True ? or is it work only when headless is false?
ReplyDeleteWill the scraper have performance constraints when dealing with more larger tasks? or does a spider with item pipelines be a more better performing choice?
ReplyDeleteIs there a translation confidence estimate that we can refer to?