Web Scraping Website with R

In this tutorial, we will cover how to extract information from a matrimonial website using R.  We will do web scraping which is a process of converting data available in unstructured format on the website to structured format which can be further used for analysis.

We will use a R package called rvest which was created by Hadley Wickham. This package simplifies the process of scraping web pages.
Web Scraping in R
Web Scraping in R

Install the required packages

To download and install the rvest package, run the following command. We will also use dplyr which is useful for data manipulation tasks.

Load the required Libraries

To make the libraries in use, you need to submit the program below.

Scrape Information from Matrimonial Website

First we need to understand the structure of URL. See the URLs below.

The first URL takes you to the webpage wherein girls' profiles of Punjabi community are shown whereas second URL provides details about boys' profiles' of Punjabi community.

We need to split the main URL into different elements so that we can access it. 
Main_URL = Static_URL + Mother_Tongue + Brides_Grooms
Check out the following R code how to prepare the main URL. In the code, you need to provide the following details -
  1. Whether you are looking for girls'/boys' profiles. Type bride to see girls' profiles. Enter groom to check out boys' profiles.
  2. Select Mother Tongue. For example, punjabi, tamil etc.
# Looking for bride/groom
Bride_Groom = "bride"
# Possible Values : bride, groom

# Select Mother Tongue
Mother_Tongue = "punjabi"
# Possible Values
# punjabi
# tamil
# bengali
# telugu
# kannada
# marathi

if (tolower(Bride_Groom) == "bride") {
html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-brides-girls')
} else {
html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-grooms-boys')
See the output :
[1] "https://www.jeevansathi.com/punjabi-brides-girls"

Extract Profile IDs

First you need to select parts of an html document using css selectors: html_nodes(). Use SelectorGadget which is a chrome extension available for free. It is the easiest and quickest way to find out which selector pulls the data that you are interested in.

How to use SelectorGadget : Click on a page element that you would like your selector to match (it will turn green). It will then generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector.
text = read_html(html) %>% html_nodes(".profileContent .color11 a") %>% html_text()
profileIDs = data.frame(ID = text)
1  ZARX0345
2  ZZWX5573
3  ZWVT2173
4  ZAYZ6100
5  ZYTS6885
6  ZXYV9849
7   TRZ8475
8   VSA7284
9  ZXTU1965
10 ZZSA6877
11 ZZSZ6545
12 ZYSW4809
13 ZARW2199
14 ZRSY0723
15 ZXAT2801
16 ZYXX8818
17 ZAWA8567
18  WXZ2147
19 ZVRT8875
20 ZWWR9533
21 ZYXW4043
The basic functions in rvest are very user-friendly and robust. Explanation of these functions are listed below -
  1. read_html() :  you can create a html document from a URL
  2. html_nodes() : extracts pieces out of HTML documents.
  3. html_nodes(".class") : calls node based on CSS class
  4. html_nodes("#class") : calls node based on <div>, <span>, <pre> id
  5. html_text() : extracts only the text from HTML tag
  6. html_attr() : extracts contents of a single attribute

Difference between .class and #class

1. .class targets the following element:
<div class="class"></div>

2. #class targets the following element:
<div id="class"></div>

How to find HTML/ CSS code of website 

Perform the steps below -
  1. On Google chrome, right click and select "Inspect" option. Or use shortcut Ctrl + Shift + I
  2. Select a particular section of the website.
  3. Press Ctrl + Shift + C to inspect a particular element.
  4. See the selected code under "Elements" section.

Inspect element

Get Detailed Information of Profiles

The following program performs the following tasks -
  1. Loop through profile IDs
  2. Pull information about Age, Height, Qualification etc.
  3. Extract details about appearance
  4. Fetch 'About Me' section of profiles
# Get Detailed Information
finaldf = data.frame()
for (i in 1:length(profileIDs$ID)){
ID = profileIDs[i,1]
link = paste0("https://www.jeevansathi.com/profile/viewprofile.php?stype=4&username=", ID)
FormattedInfo = data.frame(t(read_html(link) %>% html_nodes(".textTru li") %>%
# Final Table
FormattedInfo = data.frame(ProfileID = ID,
                             Description = read_html(link) %>% 
                             html_nodes("#myinfoView") %>%
                             Appearance = read_html(link) %>% 
                             html_nodes("#section-lifestyle #appearanceView") %>%

finaldf = bind_rows(finaldf, FormattedInfo)

# Assign Variable Names
names(finaldf) = c("ProfileID", "Description", "Appearance", "Age_Height", "Qualification", "Location", "Profession", "Mother Tongue", "Salary", "Religion", "Status", "Has_Children")
Web Scraping Output
Web Scraping Output PartII

Download Display Pic

To download display pic, you first need to fetch image URL of profile and then hit download.file( ) function to download it. In the script below, you need to provide a profile ID.
# Download Profile Pic of a particular Profile
text3 = read_html(html) %>% html_nodes(".vtop") %>% html_attr('src')
pic = data.frame(cbind(profileIDs, URL = text3[!is.na(text3)]))
download.file(as.character(pic$URL[match(ID, pic$ID)]), "match.jpg", mode = "wb")
# File saved as match.jpg

We have accessed only publicly available data which does not require login or registration. The purpose is not to cause any damage or copy the content from the website.
Other Functions of rvest( )
You can extract, modify and submit forms with html_form(), set_values() and submit_form(). Refer the case study below -

You can collect google search result by submitting the google search form with search term. You need to supply search term. Here, I entered 'Datascience' search term.
url       = "http://www.google.com"
pgsession = html_session(url)           
pgform    = html_form(pgsession)[[1]]

# Set search term
filled_form = set_values(pgform, q="Datascience")
session = submit_form(pgsession,filled_form)

# look for headings of first page
session %>% html_nodes(".g .r a") %>% html_text()
 [1] "Data science - Wikipedia"                                          
 [2] "Data Science Courses | Coursera"                                   
 [3] "Data Science | edX"                                                
 [4] "Data science - Wikipedia"                                          
 [5] "DataScience.com | Enterprise Data Science Platform Provider"       
 [6] "Top Data Science Courses Online - Updated February 2018 - Udemy"   
 [7] "Data Science vs. Big Data vs. Data Analytics - Simplilearn"        
 [8] "What Is Data Science? What is a Data Scientist? What is Analytics?"
 [9] "Online Data Science Courses | Microsoft Professional Program"      
[10] "News for Datascience"                                              
[11] "Data Science Course - Cognitive Class"    

Important Points related to Web Scraping
Please make sure of the following points -
  1. Use website API rather than web scraping.
  2. Too many requests from a certain IP-address might result to IP address being blocked. Do not scrape more than 8 keywords requests on google.
  3. Do not use web scraping for commercial purpose.

R Tutorials : 75 Free R Tutorials

About Author:

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains like banking, Telecom, HR and Health Insurance.

While I love having friends who agree, I only learn from those who don't.

Let's Get Connected: Email | LinkedIn

Get Free Email Updates :
*Please confirm your email address by clicking on the link sent to your Email*
Related Posts:
1 Response to "Web Scraping Website with R"

Next →