Translating Web Page while Scraping

Suppose you need to scrape data from a website after translating the web page in R and Python. In google chrome, there is an option (or functionality) to translate any foreign language. If you are an english speaker and don't know any other foreign language and you want to extract data from the website which does not have option to convert language to English, this article would help you how to perform translation of a webpage.

What is Selenium?

You may not familiar with Selenium so it is important to understand the background. Selenium is an open-source tool which is very popular in testing domain and used for automating web browsers. It allows you to write test scripts in several programming languages. Selenium is available in both R and Python.

Translate Page in Web Scraping in R and Python

In R there is a package named RSelenium whereas Selenium can be installed by installing selenium package in Python. Following is a list of languages chrome supports along with their code. You need this code in making chrome understand from which language to what language you want to translate the web page.

Name	Code
Amharic	am
Arabic	ar
Basque	eu
Bengali	bn
English (UK)	en-GB
Portuguese (Brazil)	pt-BR
Bulgarian	bg
Catalan	ca
Cherokee	chr
Croatian	hr
Czech	cs
Danish	da
Dutch	nl
English (US)	en
Estonian	et
Filipino	fil
Finnish	fi
French	fr
German	de
Greek	el
Gujarati	gu
Hebrew	iw
Hindi	hi
Hungarian	hu
Icelandic	is
Indonesian	id
Italian	it
Japanese	ja
Kannada	kn
Korean	ko
Latvian	lv
Lithuanian	lt
Malay	ms
Malayalam	ml
Marathi	mr
Norwegian	no
Polish	pl
Portuguese (Portugal)	pt-PT
Romanian	ro
Russian	ru
Serbian	sr
Chinese (PRC)	zh-CN
Slovak	sk
Slovenian	sl
Spanish	es
Swahili	sw
Swedish	sv
Tamil	ta
Telugu	te
Thai	th
Chinese (Taiwan)	zh-TW
Turkish	tr
Urdu	ur
Ukrainian	uk
Vietnamese	vi
Welsh	cy

We are mainly performing 3 actions in the following syntax. First we are loading selenium and specifying languages for translation. Then we are opening a blank page in chrome and navigating to the URL from which we want to extract data. Here it is http://premier.gov.ru/events/. At the end we are taking snapshot of the webpage.

R Code

You need to install Docker first before running the code below. Go to Products and download Docker Desktop Once downloaded and installed follow the code below.

library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-chrome')

eCaps <- list(chromeOptions = list(args = c('--disable-gpu','--window-size=1920,1080', '--lang=en'),
                                   prefs = list(translate_whitelists=list('ru' = 'en'),
                                                translate=list('enabled'='true'))))
                                                
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "chrome", extraCapabilities = eCaps)

remDr$open(silent = TRUE)
remDr$navigate("http://premier.gov.ru/events/")
remDr$screenshot(display = TRUE)
remDr$close()

In the above program I am translating from Russian language ru to English en

Python Code

Make sure to install chrome driver before running the following python program. Check the version of installed chrome browser on your machine and download the driver accordingly. Specify the file location in the code below where chrome driver is installed.

from selenium import webdriver

myoptions = webdriver.ChromeOptions()

prefs = {
  "translate_whitelists": {"ru":"en"},
  "translate":{"enabled":"true"}
}

myoptions.add_experimental_option("prefs", prefs)
d = webdriver.Chrome('C:/Users/dbhalla/Downloads/chromedriver_win32/chromedriver', options=myoptions)
d.get('http://premier.gov.ru/events/')

# Take screenshot 
d.save_screenshot("image.png") 
  
# Loading the image 
from PIL import Image 
image = Image.open("image.png") 
  
# Showing the image 
image.show() 

# Close the session 
d.close()

To disable images and CSS Styles in web scraping you can use the below options in prefs

prefs = list("profile.managed_default_content_settings.images" = 2,
                                 'profile.managed_default_content_settings.stylesheet' = 2,
                                 'profile.managed_default_content_settings.css' = 2
                    )

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn

Post Comment 3 Responses to "Translating Web Page while Scraping "

UnknownMarch 13, 2021 at 2:43 AM
Good
AnonymousApril 13, 2021 at 5:50 AM
This is very helpful. But do you have any idea why it doesn't work when we set options.headless = True ? or is it work only when headless is false?
AnonymousMay 17, 2021 at 9:23 AM
Will the scraper have performance constraints when dealing with more larger tasks? or does a spider with item pipelines be a more better performing choice?

Is there a translation confidence estimate that we can refer to?