Web scrape google news with R

Deepanshu Bhalla 14 Comments ,

This tutorial outlines how to extract google news with R programming language. It is useful when you need to show newsletter of the topic you are interested to see in the dashboard. In Google news you can search news with the keywords of your interest.

Make sure to install rvest, dplyr and xml2 R packages before running the following script. The script returns the following columns (information).

  • Title : Headline of the article
  • Source : Name of the Original Content Creator
  • Time : When article was published
  • Author : Author of the article
  • Link : URL of the article
Scrape Google news

In the code below, we are searching for news related to the 'S&P 500' as specified in the 'query' variable.

library(dplyr)
library(xml2)
library(rvest)

query ='S&P 500'
encode_special_characters <- function(text) {
  encoded_text <- ''
  special_characters <- list('&' = '%26', '=' = '%3D', '+' = '%2B', ' ' = '%20')  # Add more special characters as needed
  for (char in strsplit(text, '')[[1]]) {
    encoded_text <- paste0(encoded_text, ifelse(is.null(special_characters[[char]]), char, special_characters[[char]]))
  }
  return(tolower(encoded_text))
}

query2 <- encode_special_characters(query)
html_dat <- read_html(paste0("https://news.google.com/search?q=",query2,"&hl=en-US&gl=US&ceid=US%3Aen"))
dat <- data.frame(Link = html_dat %>%
                    html_nodes("article") %>% 
                    html_node("a") %>% 
                    html_attr('href')) %>% 
  mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))

news_text <- html_dat %>%
  html_nodes("article") %>% 
  html_text2()

x <- strsplit(news_text, "\n")
news_df <- data.frame(
  Title = sapply(x, function(item) item[3]),
  Source = sapply(x, function(item) item[1]),
  Time = sapply(x, function(item) item[4]),
  Author = gsub("By\\s+.*","", sapply(x, function(item) item[5])),
  Link = dat$Link
)
Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

Post Comment 14 Responses to "Web scrape google news with R"
  1. For some reason after running this part I am getting an empty dat

    html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-IN&gl=IN&ceid=US%3Aen"))

    dat <- data.frame(Link = html_dat %>%
    html_nodes('.VDXfz') %>%
    html_attr('href')) %>%
    mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))

    ReplyDelete
    Replies
    1. term is an argument in the function so it works inside the function.

      Delete
  2. Sir, thank you so much for this!

    ReplyDelete
  3. No, the function is not working. You will see it if you tried to run your script. I took the function out to see in each part where is the problem and the issue seems to be from the beginning. Just run this to see that the "dat" is empty.

    <<
    term<-"Tesla"

    require(dplyr)
    require(xml2)
    require(rvest)



    html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-IN&gl=IN&ceid=US%3Aen"))
    #html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-US&gl=US&ceid=US:en"))

    dat <- data.frame(Link = html_dat %>%
    html_nodes('.VDXfz') %>%
    html_attr('href')) %>%
    mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))

    news_dat <- data.frame(
    Title = html_dat %>%
    html_nodes('.DY5T1d') %>%
    html_text(),
    Link = dat$Link,
    Description = html_dat %>%
    html_nodes('.Rai5ob') %>%
    html_text()
    )

    >>

    ReplyDelete
    Replies
    1. I ran the code and dat contains 104 observations. Suggestion - Don't post as anonymous when seeking for support. Can you manually check if this link is working - https://news.google.com/search?q=Tesla&hl=en-IN&gl=IN&ceid=US%3Aen? Also check your firewalls if they are restricting.

      Delete
  4. Thanks for your reply. I tried both and still 0 observations. I remember I was checking this again a couple of months ago and it was working. I don t know what happened now

    ReplyDelete
  5. Replace '.Rai5ob' by '.RZIKme'. It will work.

    ReplyDelete
    Replies
    1. Thank you very much. It still though generates an empty data.frame to me :(

      Delete
  6. The same for me. May be google news changed. Do you have any solution. Thanks in advance

    ReplyDelete
    Replies
    1. Yes, it changed.

      Solution is change "html_nodes('.VDXfz') %>%" to "html_nodes(xpath = '//a') %>%"

      In true, all nodes changed. Logic is working, but you have to update for news nodes from google.

      Delete
    2. I have fixed the code. It's no more dependent on specific CSS classes. Thanks!

      Delete
    3. Thanks very much!!! It is working fine now!!!

      Delete
  7. Still not working. First pipe produces a "dat" object that have no lines.

    html_dat <- read_html(paste0("https://news.google.com/search?q=", pc))
    > dat <- data.frame(Link = html_dat %>%
    + html_nodes('.VDXfz') %>%
    + html_attr('href')) %>%
    + mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
    > html_dat <- read_html(paste0("https://news.google.com/search?q=", pc))
    > dat <- data.frame(Link = html_dat %>%
    + html_nodes('.VDXfz') %>%
    + html_attr('href')) %>%
    + mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
    > dat
    [1] Link
    <0 linhas> (ou row.names de comprimento 0)

    obs.: pc is my term for search, because I'm trying to use it inside a loop.

    ReplyDelete
Next → ← Prev