Make sure to install rvest, dplyr and xml2
R packages before running the following script. The script returns the following columns (information).
- Title : Headline of the article
- Link : URL of the article
- Description : 1 or 2 lines summary of the article
- Source : Name of the Original Content Creator
- Time : When article was published
news <- function(term) { require(dplyr) require(xml2) require(rvest) html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-IN&gl=IN&ceid=US%3Aen")) dat <- data.frame(Link = html_dat %>% html_nodes('.VDXfz') %>% html_attr('href')) %>% mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link)) news_dat <- data.frame( Title = html_dat %>% html_nodes('.DY5T1d') %>% html_text(), Link = dat$Link, Description = html_dat %>% html_nodes('.Rai5ob') %>% html_text() ) # Extract Source and Time (To avoid missing content) prod <- html_nodes(html_dat, ".SVJrMe") Source <- lapply(prod, function(x) { norm <- tryCatch(html_node(x, "a") %>% html_text() , error=function(err) {NA}) }) time <- lapply(prod, function(x) { norm <- tryCatch(html_node(x, "time") %>% html_text(), error=function(err) {NA}) }) mydf <- data.frame(Source = do.call(rbind, Source), Time = do.call(rbind, time), stringsAsFactors = F) dff <- cbind(news_dat, mydf) %>% distinct(Time, .keep_all = TRUE) return(dff) } newsdf <- news('indian"%20economy')
%20 refers to space between the two words.
For some reason after running this part I am getting an empty dat
ReplyDeletehtml_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-IN&gl=IN&ceid=US%3Aen"))
dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
term is an argument in the function so it works inside the function.
DeleteSir, thank you so much for this!
ReplyDeleteNo, the function is not working. You will see it if you tried to run your script. I took the function out to see in each part where is the problem and the issue seems to be from the beginning. Just run this to see that the "dat" is empty.
ReplyDelete<<
term<-"Tesla"
require(dplyr)
require(xml2)
require(rvest)
html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-IN&gl=IN&ceid=US%3Aen"))
#html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-US&gl=US&ceid=US:en"))
dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
news_dat <- data.frame(
Title = html_dat %>%
html_nodes('.DY5T1d') %>%
html_text(),
Link = dat$Link,
Description = html_dat %>%
html_nodes('.Rai5ob') %>%
html_text()
)
>>
I ran the code and dat contains 104 observations. Suggestion - Don't post as anonymous when seeking for support. Can you manually check if this link is working - https://news.google.com/search?q=Tesla&hl=en-IN&gl=IN&ceid=US%3Aen? Also check your firewalls if they are restricting.
DeleteThanks for your reply. I tried both and still 0 observations. I remember I was checking this again a couple of months ago and it was working. I don t know what happened now
ReplyDeleteThe problem is: html_nodes('.Rai5ob') -> character (empty)
ReplyDeleteYes
DeleteReplace '.Rai5ob' by '.RZIKme'. It will work.
ReplyDeleteThank you very much. It still though generates an empty data.frame to me :(
DeleteThe same for me. May be google news changed. Do you have any solution. Thanks in advance
ReplyDelete