Web scrape google news with R

This tutorial outlines how to extract google news with R programming language. It is useful when you need to show newsletter of the topic you are interested to see in the dashboard. In Google news you can search news with the keywords of your interest.

Make sure to install rvest, dplyr and xml2 R packages before running the following script. The script returns the following columns (information).

Title : Headline of the article
Source : Name of the Original Content Creator
Time : When article was published
Author : Author of the article
Link : URL of the article

In the code below, we are searching for news related to the 'S&P 500' as specified in the 'query' variable.

library(dplyr)
library(xml2)
library(rvest)

query ='S&P 500'
encode_special_characters <- function(text) {
  encoded_text <- ''
  special_characters <- list('&' = '%26', '=' = '%3D', '+' = '%2B', ' ' = '%20')  # Add more special characters as needed
  for (char in strsplit(text, '')[[1]]) {
    encoded_text <- paste0(encoded_text, ifelse(is.null(special_characters[[char]]), char, special_characters[[char]]))
  }
  return(tolower(encoded_text))
}

query2 <- encode_special_characters(query)
html_dat <- read_html(paste0("https://news.google.com/search?q=",query2,"&hl=en-US&gl=US&ceid=US%3Aen"))
dat <- data.frame(Link = html_dat %>%
                    html_nodes("article") %>% 
                    html_node("a") %>% 
                    html_attr('href')) %>% 
  mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))

news_text <- html_dat %>%
  html_nodes("article") %>% 
  html_text2()

x <- strsplit(news_text, "\n")
news_df <- data.frame(
  Title = sapply(x, function(item) item[3]),
  Source = sapply(x, function(item) item[1]),
  Time = sapply(x, function(item) item[4]),
  Author = gsub("By\\s+.*","", sapply(x, function(item) item[5])),
  Link = dat$Link
)

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn

Post Comment 14 Responses to "Web scrape google news with R"

AnonymousJune 4, 2021 at 9:00 AM
For some reason after running this part I am getting an empty dat

html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-IN&gl=IN&ceid=US%3Aen"))

dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
João SilvaJune 21, 2021 at 6:03 PM
Sir, thank you so much for this!
AnonymousJune 26, 2021 at 11:56 AM
No, the function is not working. You will see it if you tried to run your script. I took the function out to see in each part where is the problem and the issue seems to be from the beginning. Just run this to see that the "dat" is empty.

<<
term<-"Tesla"

require(dplyr)
require(xml2)
require(rvest)

html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-IN&gl=IN&ceid=US%3Aen"))
#html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-US&gl=US&ceid=US:en"))

dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))

news_dat <- data.frame(
Title = html_dat %>%
html_nodes('.DY5T1d') %>%
html_text(),
Link = dat$Link,
Description = html_dat %>%
html_nodes('.Rai5ob') %>%
html_text()
)

>>
AnonymousJuly 5, 2021 at 7:56 AM
Thanks for your reply. I tried both and still 0 observations. I remember I was checking this again a couple of months ago and it was working. I don t know what happened now
AnonymousAugust 14, 2021 at 1:26 AM
Yes
UnknownAugust 14, 2021 at 1:52 AM
Replace '.Rai5ob' by '.RZIKme'. It will work.
ÁngelJune 10, 2023 at 1:54 PM
The same for me. May be google news changed. Do you have any solution. Thanks in advance
Raoni VilelaFebruary 23, 2024 at 11:13 AM
Still not working. First pipe produces a "dat" object that have no lines.

html_dat <- read_html(paste0("https://news.google.com/search?q=", pc))
> dat <- data.frame(Link = html_dat %>%
+ html_nodes('.VDXfz') %>%
+ html_attr('href')) %>%
+ mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
> html_dat <- read_html(paste0("https://news.google.com/search?q=", pc))
> dat <- data.frame(Link = html_dat %>%
+ html_nodes('.VDXfz') %>%
+ html_attr('href')) %>%
+ mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
> dat
[1] Link
<0 linhas> (ou row.names de comprimento 0)

obs.: pc is my term for search, because I'm trying to use it inside a loop.