In this tutorial, we will cover how to extract information from a matrimonial website using R. We will do web scraping which is a process of converting data available in unstructured format on the website to structured format which can be further used for analysis.
We will use a R package called rvest which was created by Hadley Wickham. This package simplifies the process of scraping web pages.
Web Scraping in R |
Install the required packages
To download and install the rvest package, run the following command. We will also use dplyr which is useful for data manipulation tasks.
install.packages("rvest")
install.packages("dplyr")
Load the required Libraries
To make the libraries in use, you need to submit the program below.
library(rvest)
library(dplyr)
Scrape Information from Matrimonial Website
First we need to understand the structure of URL. See the URLs below.https://www.jeevansathi.com/punjabi-brides-girls
https://www.jeevansathi.com/punjabi-grooms-boys
The first URL takes you to the webpage wherein girls' profiles of Punjabi community are shown whereas second URL provides details about boys' profiles' of Punjabi community.
We need to split the main URL into different elements so that we can access it.
Main_URL = Static_URL + Mother_Tongue + Brides_GroomsCheck out the following R code how to prepare the main URL. In the code, you need to provide the following details -
- Whether you are looking for girls'/boys' profiles. Type bride to see girls' profiles. Enter groom to check out boys' profiles.
- Select Mother Tongue. For example, punjabi, tamil etc.
# Looking for bride/groom Bride_Groom = "bride" # Possible Values : bride, groom # Select Mother Tongue Mother_Tongue = "punjabi" # Possible Values # punjabi # tamil # bengali # telugu # kannada # marathi # URL if (tolower(Bride_Groom) == "bride") { html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-brides-girls') } else { html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-grooms-boys') }See the output :
[1] "https://www.jeevansathi.com/punjabi-brides-girls"
Extract Profile IDs
First you need to select parts of an html document using css selectors: html_nodes(). Use SelectorGadget which is a chrome extension available for free. It is the easiest and quickest way to find out which selector pulls the data that you are interested in.How to use SelectorGadget : Click on a page element that you would like your selector to match (it will turn green). It will then generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector.
text = read_html(html) %>% html_nodes(".profileContent .color11 a") %>% html_text()
profileIDs = data.frame(ID = text)
ID 1 ZARX0345 2 ZZWX5573 3 ZWVT2173 4 ZAYZ6100 5 ZYTS6885 6 ZXYV9849 7 TRZ8475 8 VSA7284 9 ZXTU1965 10 ZZSA6877 11 ZZSZ6545 12 ZYSW4809 13 ZARW2199 14 ZRSY0723 15 ZXAT2801 16 ZYXX8818 17 ZAWA8567 18 WXZ2147 19 ZVRT8875 20 ZWWR9533 21 ZYXW4043The basic functions in rvest are very user-friendly and robust. Explanation of these functions are listed below -
- read_html() : you can create a html document from a URL
- html_nodes() : extracts pieces out of HTML documents.
- html_nodes(".class") : calls node based on CSS class
- html_nodes("#class") : calls node based on <div>, <span>, <pre> id
- html_text() : extracts only the text from HTML tag
- html_attr() : extracts contents of a single attribute
Difference between .class and #class
1. .class targets the following element:
2. #class targets the following element:
How to find HTML/ CSS code of website
Perform the steps below -
1. .class targets the following element:
<div class="class"></div>
2. #class targets the following element:
<div id="class"></div>
How to find HTML/ CSS code of website
Perform the steps below -
- On Google chrome, right click and select "Inspect" option. Or use shortcut Ctrl + Shift + I
- Select a particular section of the website.
- Press Ctrl + Shift + C to inspect a particular element.
- See the selected code under "Elements" section.
Extract attribute information
You can fetch details of html attribute by usinghtml_attr()
function. In the code below we pull src of img tag.
read_html("https://timesofindia.indiatimes.com/") %>% html_nodes(".main-sprite img") %>% html_attr("src")
Get Detailed Information of Profiles
The following program performs the following tasks -
- Loop through profile IDs
- Pull information about Age, Height, Qualification etc.
- Extract details about appearance
- Fetch 'About Me' section of profiles
# Get Detailed Information finaldf = data.frame() for (i in 1:length(profileIDs$ID)){ ID = profileIDs[i,1] link = paste0("https://www.jeevansathi.com/profile/viewprofile.php?stype=4&username=", ID) FormattedInfo = data.frame(t(read_html(link) %>% html_nodes(".textTru li") %>% html_text())) # Final Table FormattedInfo = data.frame(ProfileID = ID, Description = read_html(link) %>% html_nodes("#myinfoView") %>% html_text(), Appearance = read_html(link) %>% html_nodes("#section-lifestyle #appearanceView") %>% html_text(), FormattedInfo) finaldf = bind_rows(finaldf, FormattedInfo) } # Assign Variable Names names(finaldf) = c("ProfileID", "Description", "Appearance", "Age_Height", "Qualification", "Location", "Profession", "Mother Tongue", "Salary", "Religion", "Status", "Has_Children")
Download Display Pic
To download display pic, you first need to fetch image URL of profile and then hit download.file( ) function to download it. In the script below, you need to provide a profile ID.# Download Profile Pic of a particular Profile
ID = "XXXXXXX"
text3 = read_html(html) %>% html_nodes(".vtop") %>% html_attr('src')
pic = data.frame(cbind(profileIDs, URL = text3[!is.na(text3)]))
download.file(as.character(pic$URL[match(ID, pic$ID)]), "match.jpg", mode = "wb")
# File saved as match.jpg
Disclaimer
We have accessed only publicly available data which does not require login or registration. The purpose is not to cause any damage or copy the content from the website.Other Functions of rvest( )
You can extract, modify and submit forms with html_form(), set_values() and submit_form(). Refer the case study below -
You can collect google search result by submitting the google search form with search term. You need to supply search term. Here, I entered 'Datascience' search term.
library(rvest)
url = "http://www.google.com"
pgsession = html_session(url)
pgform = html_form(pgsession)[[1]]
# Set search term
filled_form = set_values(pgform, q="Datascience")
session = submit_form(pgsession,filled_form)
# look for headings of first page
session %>% html_nodes(".g .r a") %>% html_text()
[1] "Data science - Wikipedia" [2] "Data Science Courses | Coursera" [3] "Data Science | edX" [4] "Data science - Wikipedia" [5] "DataScience.com | Enterprise Data Science Platform Provider" [6] "Top Data Science Courses Online - Updated February 2018 - Udemy" [7] "Data Science vs. Big Data vs. Data Analytics - Simplilearn" [8] "What Is Data Science? What is a Data Scientist? What is Analytics?" [9] "Online Data Science Courses | Microsoft Professional Program" [10] "News for Datascience" [11] "Data Science Course - Cognitive Class"
Important Points related to Web Scraping
Please make sure of the following points -
- Use website API rather than web scraping.
- Too many requests from a certain IP-address might result to IP address being blocked. Do not scrape more than 8 keywords requests on google.
- Do not use web scraping for commercial purpose.
Good one!
ReplyDeleteis it possible to web scrapping on facebook???
ReplyDeleteThe URLs of the website’s do not work.
ReplyDeleteSame for me.
DeleteIt is interesting to read your blog post and I am going to share it with my friends.aybabg
ReplyDeletewe are in fact thankful in your weblog declare. you will locate a lot of approaches after journeying your screen. i was exactly attempting to find. thanks for such say and make smile shop it taking region. incredible take steps. website
ReplyDeleteI was surfing the Internet for information and came across your blog. I am impressed by the information you have on this blog. It shows how well you understand this subject. web rimini
ReplyDeleteI need your help 9011161031
ReplyDelete9011161021
ReplyDeleteaa
ReplyDeleteCool stuff you have and you keep overhaul every one of us Web Hosting Plans
ReplyDeleteI found so many interesting stuff in your blog especially its discussion. From the tons of comments on your articles, I guess I am not the only one having all the enjoyment here! keep up the good work... Webdesign
ReplyDeleteI haven’t any word to appreciate this post.....Really i am impressed from this post....the person who create this post it was a great human..thanks for shared this with us. SEO optimalisatie
ReplyDeleteI glanced around to check whether anything I had done was sufficient and discovered that the stuff they sell there is great! Webdesign
ReplyDeleteProud of you boss!!
ReplyDeleteAlso is it possible to get access to information on what characteristics they desire in their mates?
ReplyDeleteAnd what are the ethical concerns that we keep in mind? As we are not taking consents from these people?