Pollution in India : Real-time AQI Data

Air pollution has become a serious problem in recent years across the world. Effects of Air Pollution is devastating and its harmful effects are not just limited to Humans but also animals and plants as well. It also leads to global warming which is esentially increasing air and ocean temperatures around the world.

Indian cities have been topping the list of polluted cities. In order to solve the problem of air pollution the most important thing is to track air pollution on real-time basis first which alerts people to avoid outdoor activities during high air Pollution. This post explains how you can fetch real-time Air Quality Index (AQI) of Indian cities using Python and R code. It allows both Python and R programmers to pull pollution data.

You can download the dataset which contains static information about Indian states, cities and AQI stations. Variables stored in this dataset will be used further to fetch real-time data.


         id                                    stationID longitude
1 site_5331 Kariavattom, Thiruvananthapuram - Kerala PCB  76.88650
2  site_252   Plammoodu, Thiruvananthapuram - Kerala PCB  76.94359
3 site_5272          Kacheripady, Ernakulam - Kerala PCB  76.28134
4 site_5276              Thavakkara, Kannur - Kerala PCB  75.37320
5 site_5334             Polayathode, Kollam - Kerala PCB  76.60730
6 site_5271              Palayam, Kozhikode - Kerala PCB  75.78437
   latitude  live avg             cityID stateID
1  8.563700 FALSE  NA Thiruvananthapuram  Kerala
2  8.514909  TRUE  20 Thiruvananthapuram  Kerala
3  9.985653  TRUE  27          Ernakulam  Kerala
4 11.875000  TRUE  56             Kannur  Kerala
5  8.878700  TRUE  54             Kollam  Kerala
6 11.249077  TRUE  70          Kozhikode  Kerala

Step 1: Get API Key

First step is to log in to https://data.gov.in/ with Google/FB/Twitter/LinkedIn/Github profile. Once logged in, go to Dashboard > MyAccount. You will find API key there.

Incase you face any issue registering yourself and getting API key, you can use Sample API Key for experimentation purpose.See the Sample API key below. Sample API key has a limitation of 10 records it can fetch at a single run.

579b464db66ec23bdd000001cdd3946e44ce4aad7209ff7b23ac571b

Step 2: Fetch Real Time Air Quality Index

Government of India provides data API Real Time Air Quality Index From Various Locations which we will use in this step.

2.1 AQI of Cities

In the code below, there are two arguments that user needs to input - API Key Filter criteria. Filter criteria can have "state", "city", "station", "pollutant_id". To see the unique values of state, city and station, you can download and refer the dataset shown above. Distinct values of pollutant_id are as follows -

"PM2.5" "PM10"  "NO2"   "NH3"   "SO2"   "CO"    "OZONE"

Python Code

import requests
import json
import pandas as pd
import re
import datetime
import time
import base64
from itertools import product

stationsData = pd.read_csv("https://raw.githubusercontent.com/deepanshu88/Datasets/master/UploadedFiles/stations.csv")

def getData(api, filters):
    url1 = "https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=" + api + "&format=json&limit=500"
    criteriaAll = [[(k, re.sub(r'\s+', '%20', v)) for v in criteria[k]] for k in criteria]
    url2 = [url1 + ''.join(f'&filters[{ls}]={value}' for ls, value in p) for p in product(*criteriaAll)]
    
    pollutionDfAll = pd.DataFrame()
    for i in url2:
        response = requests.get(i, verify=True)
        response_dict = json.loads(response.text)
        pollutionDf = pd.DataFrame(response_dict['records'])
        pollutionDfAll = pd.concat([pollutionDfAll, pollutionDf])
    
    return pollutionDfAll


# Sample key
api = "579b464db66ec23bdd000001cdd3946e44ce4aad7209ff7b23ac571b"
criteria = {'city':["Greater Noida","Delhi"], 'pollutant_id': ["PM10", "PM2.5"]}
mydata = getData(api, criteria)

R Code

library(httr)
library(jsonlite)
library(dplyr)

stationsData <- read.csv("https://raw.githubusercontent.com/deepanshu88/Datasets/master/UploadedFiles/stations.csv")

getData <- function(api, filters) {
  
  url1 <- paste0("https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=",api, "&format=json&limit=500")
  url2 <- paste0('%s', paste0('&filters[', names(filters), ']=%s', collapse = ''))
  urlAll <- do.call(sprintf, c(url2, url1, expand.grid(lapply(filters, function(x) gsub("\\s+", "%20", x)))))
  
  pollutionDfAll <- data.frame()
  for (i in urlAll){
    request <- GET(url=i)
    response <- content(request, as = "text", encoding = "UTF-8")
    df <- fromJSON(response, flatten = TRUE)
    pollutionDf <- df[["records"]]
    pollutionDfAll <- rbind(pollutionDfAll, pollutionDf)
  }
    
  return(pollutionDfAll)
}


# Sample key
api <- "579b464db66ec23bdd000001cdd3946e44ce4aad7209ff7b23ac571b"

criteria <- list(city=c("Greater Noida","Delhi"),pollutant_id=c("PM10", "PM2.5"))
mydata <- getData(api, criteria)

Output

    id country         state          city
1 1686   India Uttar_Pradesh Greater Noida
2 1693   India Uttar_Pradesh Greater Noida
3  297   India         Delhi         Delhi
4  304   India         Delhi         Delhi
5  311   India         Delhi         Delhi
6  318   India         Delhi         Delhi
                                      station         last_update
1 Knowledge Park - III, Greater Noida - UPPCB 11-07-2022 05:00:00
2   Knowledge Park - V, Greater Noida - UPPCB 11-07-2022 05:00:00
3                        Alipur, Delhi - DPCC 11-07-2022 05:00:00
4                   Anand Vihar, Delhi - DPCC 11-07-2022 05:00:00
5                   Ashok Vihar, Delhi - DPCC 11-07-2022 05:00:00
6                      Aya Nagar, Delhi - IMD 11-07-2022 05:00:00
  pollutant_id pollutant_min pollutant_max pollutant_avg
1         PM10            57           136            93
2         PM10            56           147            98
3         PM10            45           118            77
4         PM10            96           179           132
5         PM10            80           122            95
6         PM10            38            83            65
  pollutant_unit
1             NA
2             NA
3             NA
4             NA
5             NA
6             NA

2.2 AQI of Stations

You may wish to find AQI score of station(s) which is the most granular level of information. You can club it with the pollutant ID to narrow down your search result.

Python Code

criteria = {"station":["Anand Vihar, Delhi - DPCC", "Okhla Phase-2, Delhi - DPCC"], "pollutant_id":["PM10"]}
mydata = getData(api, criteria)

R Code

criteria <- list(station=c("Anand Vihar, Delhi - DPCC", "Okhla Phase-2, Delhi - DPCC"), pollutant_id=c("PM10"))
mydata <- getData(api, criteria)

Top / Least Polluted Indian Cities

Most of the polluted cities in India are in Bihar and Haryana states. Whereas air pollution poses little or no risk in cities of Karnataka and Sikkim states.

We can pass cities as a criteria to pull AQI of all the cities. Then we can take median of AQIs by city.

Python Code

criteria = {'city' : stationsData.cityID.unique(), 'pollutant_id' : ["PM10"]}
mydata = getData(api, criteria)

R Code

criteria <- list(city= unique(stationsData$cityID), pollutant_id= c("PM10"))
mydata <- getData(api, criteria)

Get Historical AQI Data

The above method from data.gov.in API does not allow users to fetch historical AQI scores. Suppose you wish to see yesterday's AQI score of your location and compare it with the today AQI score.

The program below returns two dataframes - summary, pollutants. Dataframe named pollutants returns scores with respect to various pollutants in the location. Function has two arguments - id and dt. id refers to unique identifier assigned to each station. Format of id : site_* and dt refers to datetime object.

Python Code

import requests
import json
import pandas as pd
import re
import datetime
import time
import base64
from itertools import product

def get_data_cpcb(id, dt):
    
    datetime2 = dt.strftime('%Y-%m-%dT%H:%M:%SZ')
    
    key  = '{"station_id":"' + id + '","date":"' + datetime2 + '"}'
    body = base64.b64encode(key.encode()).decode()
    
    timeZoneoffset = int((datetime.datetime.utcnow() - datetime.datetime.now()).total_seconds()/60)
    token = '{"time":' + str(int(time.time())) + ',"timeZoneOffset":'+ str(timeZoneoffset ) +'}'
    accessToken = base64.b64encode(str(token).encode()).decode()
    
    headers = {
        'accept': 'application/json, text/javascript, */*; q=0.01',
        'accesstoken': accessToken,
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
        'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'origin': 'https://app.cpcbccr.com',
        'referer': 'https://app.cpcbccr.com/AQI_India/',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'accept-language': 'en-US,en;q=0.9'
    }
    
    response = requests.post('https://app.cpcbccr.com/aqi_dashboard/aqi_all_Parameters', headers=headers, data=body, verify=True)
    response_dict = json.loads(response.text)
    info = pd.DataFrame({'title':response_dict['title'], 'date':response_dict['date']}, index=[0])
    pollutionDf = pd.concat([pd.DataFrame([response_dict['aqi']]), info], axis=1)    
    pollutants  = pd.concat([pd.DataFrame(response_dict['metrics']), info], axis=1)
    
    return pollutionDf, pollutants


id = stationsData.id[0]
summary, pollutants = get_data_cpcb(id, datetime.datetime(2022, 7, 9, 18, 44, 59, 0))

R Code

library(httr)
library(jsonlite)
library(dplyr)

get_data_cpcb <- function(id, datetime) {

  is.POSIXct <- function(x) inherits(x, "POSIXct")
  if(!is.POSIXct(datetime)) {stop("datetime must be POSIXct object")}
  
  key = paste0('{"station_id":"', id, '","date":"', gsub("\\s+", "T",as.character(datetime)), "Z",'"}')
  body = gsub("\\n","",base64_enc(key))
  
  timeZoneoffset <- ceiling((as.numeric(as.POSIXct(format(datetime),tz="UTC")) - as.numeric(datetime))/60)
  token = paste0('{"time":', ceiling(as.numeric(datetime)), ',"timeZoneOffset":', timeZoneoffset, '}')
  accesstoken = base64_enc(token)
  
  URL <- "https://app.cpcbccr.com/aqi_dashboard/aqi_all_Parameters"
  headers <- add_headers(`user-agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
                         accept = "application/json, text/javascript, */*; q=0.01",
                         `accept-encoding` = "gzip, deflate, br",
                         `accept-language` = "en-US,en;q=0.9,id;q=0.8,pt;q=0.7",
                          accesstoken = accesstoken,
                         `content-type`    = "application/x-www-form-urlencoded; charset=UTF-8",
                          origin  = "https://app.cpcbccr.com",
                          referer = 'https://app.cpcbccr.com/AQI_India/',
                         `sec-fetch-dest` = "empty",
                         `sec-fetch-mode` = "cors",
                         `sec-fetch-site`	 = "same-origin")
  
  request  <- POST(URL, headers, body = body, encode = "form")
  response <- content(request, as = "text", encoding = "UTF-8")
  df <- fromJSON(response, flatten = TRUE)
  return(df)
  
}

id <- stationsData$id[1]
datetime <- as.POSIXct("2022-07-08 16:35:00")
df <- get_data_cpcb(id, datetime)
summary <- data.frame(df[c("title","date")], t(unlist(df$aqi)))
pollutants <- data.frame(df[c("title","date")], df$metrics)

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn