How to Scrape Google News with Python

This tutorial explains how to scrape Google News for articles related to the topic of your choice using Python.

We are interested to see the following information for each news article.

Title : Article Headline
Source : Original News Source or Blogger Name
Time : Publication Date/Time
Author : Article Author
Link : Article Link

Python Code to Scrape Google News

Step 1 : Install the following python libraries if they are not already installed.

requests
BeautifulSoup
pandas

You can install any python library using the command pip install library_name.

Step 2 : Set Search Query

The next step is to define a search query which is the topic or term for which you want to search for related articles. In the code below, you can specify it in the 'query' variable.

The code below extracts relevant information such as titles, sources, times, authors and links from Google news related to a specific topic and stores them in a CSV file named 'news.csv'.


import requests
from bs4 import BeautifulSoup
import pandas as pd

# Search Query
query = 'US Economy'

# Encode special characters in a text string
def encode_special_characters(text):
    encoded_text = ''
    special_characters = {'&': '%26', '=': '%3D', '+': '%2B', ' ': '%20'}  # Add more special characters as needed
    for char in text.lower():
        encoded_text += special_characters.get(char, char)
    return encoded_text

query2 = encode_special_characters(query)
url = f"https://news.google.com/search?q={query2}&hl=en-US&gl=US&ceid=US%3Aen"

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

articles = soup.find_all('article')
links = [article.find('a')['href'] for article in articles]
links = [link.replace("./articles/", "https://news.google.com/articles/") for link in links]

news_text = [article.get_text(separator='\n') for article in articles]
news_text_split = [text.split('\n') for text in news_text]

news_df = pd.DataFrame({
    'Title': [text[2] for text in news_text_split],
    'Source': [text[0] for text in news_text_split],
    'Time': [text[3] if len(text) > 3 else 'Missing' for text in news_text_split],
    'Author': [text[4].split('By ')[-1] if len(text) > 4 else 'Missing' for text in news_text_split],
    'Link': links
})

# Write to CSV
news_df.to_csv('news.csv', index=False)

Explanation

The function encode_special_characters(text) is used to replace special characters like '&' with their encoded text. It is to make the URL follow web standards.
The code sends a request to the google news URL using requests.get() and parses the HTML content using BeautifulSoup.
It finds all the articles in the HTML and extracts 'Title', 'Source', 'Time', 'Author' and 'Link' information. If some articles don't have a publishing date or author details, we will set them missing.

How to Extract Top Stories

If you want to see the latest news from Google News, you can replace the 'url' variable with the code below -

url = "https://news.google.com/home?hl=en-US&gl=US&ceid=US%3Aen"

How to Set Location and Language for Articles?

Refer to the parameters of the URL which you can customize according to your country and location.

hl=en-US: Language setting for the page where "hl" stands for "host language" and "en-US" refers to US English as the language.
gl=US: Geographical location for the content.
ceid=US:en: Country edition specifying the edition for US in English.

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn