Web Scraping Website with R

Deepanshu Bhalla 15 Comments ,
In this tutorial, we will cover how to extract information from a matrimonial website using R.  We will do web scraping which is a process of converting data available in unstructured format on the website to structured format which can be further used for analysis.

We will use a R package called rvest which was created by Hadley Wickham. This package simplifies the process of scraping web pages.
Web Scraping in R
Web Scraping in R

Install the required packages

To download and install the rvest package, run the following command. We will also use dplyr which is useful for data manipulation tasks.
install.packages("rvest")
install.packages("dplyr")

Load the required Libraries

To make the libraries in use, you need to submit the program below.
library(rvest)
library(dplyr)

Scrape Information from Matrimonial Website

First we need to understand the structure of URL. See the URLs below.
https://www.jeevansathi.com/punjabi-brides-girls
https://www.jeevansathi.com/punjabi-grooms-boys

The first URL takes you to the webpage wherein girls' profiles of Punjabi community are shown whereas second URL provides details about boys' profiles' of Punjabi community.

We need to split the main URL into different elements so that we can access it. 
Main_URL = Static_URL + Mother_Tongue + Brides_Grooms
Check out the following R code how to prepare the main URL. In the code, you need to provide the following details -
  1. Whether you are looking for girls'/boys' profiles. Type bride to see girls' profiles. Enter groom to check out boys' profiles.
  2. Select Mother Tongue. For example, punjabi, tamil etc.
# Looking for bride/groom
Bride_Groom = "bride"
# Possible Values : bride, groom

# Select Mother Tongue
Mother_Tongue = "punjabi"
# Possible Values
# punjabi
# tamil
# bengali
# telugu
# kannada
# marathi

# URL
if (tolower(Bride_Groom) == "bride") {
html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-brides-girls')
} else {
html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-grooms-boys')
}
See the output :
[1] "https://www.jeevansathi.com/punjabi-brides-girls"

Extract Profile IDs

First you need to select parts of an html document using css selectors: html_nodes(). Use SelectorGadget which is a chrome extension available for free. It is the easiest and quickest way to find out which selector pulls the data that you are interested in.

How to use SelectorGadget : Click on a page element that you would like your selector to match (it will turn green). It will then generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector.
text = read_html(html) %>% html_nodes(".profileContent .color11 a") %>% html_text()
profileIDs = data.frame(ID = text)
         ID
1  ZARX0345
2  ZZWX5573
3  ZWVT2173
4  ZAYZ6100
5  ZYTS6885
6  ZXYV9849
7   TRZ8475
8   VSA7284
9  ZXTU1965
10 ZZSA6877
11 ZZSZ6545
12 ZYSW4809
13 ZARW2199
14 ZRSY0723
15 ZXAT2801
16 ZYXX8818
17 ZAWA8567
18  WXZ2147
19 ZVRT8875
20 ZWWR9533
21 ZYXW4043
The basic functions in rvest are very user-friendly and robust. Explanation of these functions are listed below -
  1. read_html() :  you can create a html document from a URL
  2. html_nodes() : extracts pieces out of HTML documents.
  3. html_nodes(".class") : calls node based on CSS class
  4. html_nodes("#class") : calls node based on <div>, <span>, <pre> id
  5. html_text() : extracts only the text from HTML tag
  6. html_attr() : extracts contents of a single attribute

Difference between .class and #class

1. .class targets the following element:
<div class="class"></div>

2. #class targets the following element:
<div id="class"></div>

How to find HTML/ CSS code of website 

Perform the steps below -
  1. On Google chrome, right click and select "Inspect" option. Or use shortcut Ctrl + Shift + I
  2. Select a particular section of the website.
  3. Press Ctrl + Shift + C to inspect a particular element.
  4. See the selected code under "Elements" section.

Inspect element

Extract attribute information

You can fetch details of html attribute by using html_attr() function. In the code below we pull src of img tag.
 read_html("https://timesofindia.indiatimes.com/") %>% 
  html_nodes(".main-sprite img") %>% 
  html_attr("src")

Get Detailed Information of Profiles

The following program performs the following tasks -
  1. Loop through profile IDs
  2. Pull information about Age, Height, Qualification etc.
  3. Extract details about appearance
  4. Fetch 'About Me' section of profiles
# Get Detailed Information
finaldf = data.frame()
for (i in 1:length(profileIDs$ID)){
ID = profileIDs[i,1]
link = paste0("https://www.jeevansathi.com/profile/viewprofile.php?stype=4&username=", ID)
FormattedInfo = data.frame(t(read_html(link) %>% html_nodes(".textTru li") %>%
                               html_text()))
# Final Table
FormattedInfo = data.frame(ProfileID = ID,
                             Description = read_html(link) %>% 
                             html_nodes("#myinfoView") %>%
                             html_text(), 
                             Appearance = read_html(link) %>% 
                             html_nodes("#section-lifestyle #appearanceView") %>%
                             html_text(),
                             FormattedInfo)

finaldf = bind_rows(finaldf, FormattedInfo)
}

# Assign Variable Names
names(finaldf) = c("ProfileID", "Description", "Appearance", "Age_Height", "Qualification", "Location", "Profession", "Mother Tongue", "Salary", "Religion", "Status", "Has_Children")
Web Scraping Output
Web Scraping Output PartII


Download Display Pic

To download display pic, you first need to fetch image URL of profile and then hit download.file( ) function to download it. In the script below, you need to provide a profile ID.
# Download Profile Pic of a particular Profile
ID = "XXXXXXX"
text3 = read_html(html) %>% html_nodes(".vtop") %>% html_attr('src')
pic = data.frame(cbind(profileIDs, URL = text3[!is.na(text3)]))
download.file(as.character(pic$URL[match(ID, pic$ID)]), "match.jpg", mode = "wb")
# File saved as match.jpg

Disclaimer
We have accessed only publicly available data which does not require login or registration. The purpose is not to cause any damage or copy the content from the website.
Other Functions of rvest( )
You can extract, modify and submit forms with html_form(), set_values() and submit_form(). Refer the case study below -

You can collect google search result by submitting the google search form with search term. You need to supply search term. Here, I entered 'Datascience' search term.
library(rvest)
url       = "http://www.google.com"
pgsession = html_session(url)           
pgform    = html_form(pgsession)[[1]]

# Set search term
filled_form = set_values(pgform, q="Datascience")
session = submit_form(pgsession,filled_form)

# look for headings of first page
session %>% html_nodes(".g .r a") %>% html_text()
 [1] "Data science - Wikipedia"                                          
 [2] "Data Science Courses | Coursera"                                   
 [3] "Data Science | edX"                                                
 [4] "Data science - Wikipedia"                                          
 [5] "DataScience.com | Enterprise Data Science Platform Provider"       
 [6] "Top Data Science Courses Online - Updated February 2018 - Udemy"   
 [7] "Data Science vs. Big Data vs. Data Analytics - Simplilearn"        
 [8] "What Is Data Science? What is a Data Scientist? What is Analytics?"
 [9] "Online Data Science Courses | Microsoft Professional Program"      
[10] "News for Datascience"                                              
[11] "Data Science Course - Cognitive Class"    

Important Points related to Web Scraping
Please make sure of the following points -
  1. Use website API rather than web scraping.
  2. Too many requests from a certain IP-address might result to IP address being blocked. Do not scrape more than 8 keywords requests on google.
  3. Do not use web scraping for commercial purpose.
Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

15 Responses to "Web Scraping Website with R"
  1. is it possible to web scrapping on facebook???

    ReplyDelete
  2. The URLs of the website’s do not work.

    ReplyDelete
  3. we are in fact thankful in your weblog declare. you will locate a lot of approaches after journeying your screen. i was exactly attempting to find. thanks for such say and make smile shop it taking region. incredible take steps. website

    ReplyDelete
  4. I was surfing the Internet for information and came across your blog. I am impressed by the information you have on this blog. It shows how well you understand this subject. web rimini

    ReplyDelete
  5. Cool stuff you have and you keep overhaul every one of us Web Hosting Plans

    ReplyDelete
  6. I found so many interesting stuff in your blog especially its discussion. From the tons of comments on your articles, I guess I am not the only one having all the enjoyment here! keep up the good work... Webdesign

    ReplyDelete
  7. I haven’t any word to appreciate this post.....Really i am impressed from this post....the person who create this post it was a great human..thanks for shared this with us. SEO optimalisatie

    ReplyDelete
  8. I glanced around to check whether anything I had done was sufficient and discovered that the stuff they sell there is great! Webdesign

    ReplyDelete
  9. Also is it possible to get access to information on what characteristics they desire in their mates?
    And what are the ethical concerns that we keep in mind? As we are not taking consents from these people?

    ReplyDelete
Next → ← Prev