Convert Text to Speech in R

Deepanshu Bhalla Add Comment ,

This article explains how to convert text to speech in R for free using Gemini.

Gemini Text-to-Speech (TTS) models automatically detect the input language and supports 24 languages. These models also support 30 prebuilt voices with different tones and pitch styles.

Step 1 : Get Gemini API Key

Go to Google AI Studio, sign in with your Google account and click 'Create API key' to generate a key using an existing Google Cloud project.

Step 2 : Install Required Libraries

We need to ensure that these R libraries (httr, jsonlite, base64enc and tuneR) are installed. It helps to make API calls, handling JSON, encoding/decoding base64 and processing audio files.


# Check and install missing packages
packages <- c("httr", "jsonlite", "base64enc", "tuneR")
installed <- packages %in% rownames(installed.packages())
if (any(!installed)) {
  install.packages(packages[!installed])
}

Step 3: Configure Gemini TTS

In this step, you need to update the following configuration settings for Gemini Text-to-Speech.

  • gemini_api_key: Your API key to access Gemini model.
  • text: The message you want to convert to speech.
  • instruction: Specifies how the text should be spoken (e.g. tone or style).
  • voice_name: The name of the voice to use from prebuilt voices in Gemini.
  • audio_file_name: The output filename to be used for saving the audio.
  • model_id: The specific Gemini model used for generating speech.

# Load required libraries
library(httr)
library(jsonlite)
library(base64enc)
library(tuneR)

# Configuration
gemini_api_key <- "xxxxxxxxxxxxxxxx"
text <- "Hey, how are you doing?. This is my R code. Test text to speech capability of Gemini."
instruction <- "Read aloud in a casual tone."
voice_name <- "Leda"
audio_file_name <- "audio_output.wav"
model_id <- "gemini-2.5-flash-preview-tts"

# TTS Code
generate_content_api <- "streamGenerateContent"
content <- paste(instruction, text, sep = "\n")
base_url <- sprintf(
  "https://generativelanguage.googleapis.com/v1beta/models/%s:%s",
  model_id,
  generate_content_api
)

# Build the request body
req_body <- list(
  contents = list(
    list(
      role = "user",
      parts = list(
        list(text = content)
      )
    )
  ),
  generationConfig = list(
    responseModalities = list("audio"),
    temperature = 1,
    speech_config = list(
      voice_config = list(
        prebuilt_voice_config = list(
          voice_name = voice_name
        )
      )
    )
  )
)

# Perform the POST request
res <- POST(
  url    = base_url,
  query  = list(key = gemini_api_key),
  body   = req_body,
  encode = "json",
  content_type_json()
)

# Check for HTTP errors
stop_for_status(res)

# Parse the JSON response
resp_json <- content(res, as = "parsed", simplifyVector = TRUE)

# Extract the base64‑encoded audio
inline_data <- resp_json$candidates[[1]]$content$parts[[1]]$inlineData

b64_audio <- inline_data$data
mime_type <- inline_data$mimeType

# Decode base64 into raw PCM
pcm_audio <- base64decode(b64_audio)

# --- WAV Header Construction ---
write_wav <- function(filename, audio_data, mime_type) {
  # Parse mime_type for bits per sample and rate
  rate <- 24000
  bits_per_sample <- 16
  matches <- regmatches(mime_type, regexec("L(\\d+).*rate=(\\d+)", mime_type))
  if (length(matches[[1]]) == 3) {
    bits_per_sample <- as.numeric(matches[[1]][2])
    rate <- as.numeric(matches[[1]][3])
  }
  
  num_channels <- 1
  bytes_per_sample <- bits_per_sample / 8
  block_align <- num_channels * bytes_per_sample
  byte_rate <- rate * block_align
  data_size <- length(audio_data)
  chunk_size <- 36 + data_size
  
  # Construct WAV header
  header <- packBits(as.raw(NULL))
  header <- c(
    charToRaw("RIFF"),
    writeBin(as.integer(chunk_size), raw(), size=4, endian="little"),
    charToRaw("WAVE"),
    charToRaw("fmt "),
    writeBin(as.integer(16), raw(), size=4, endian="little"),
    writeBin(as.integer(1), raw(), size=2, endian="little"),
    writeBin(as.integer(num_channels), raw(), size=2, endian="little"),
    writeBin(as.integer(rate), raw(), size=4, endian="little"),
    writeBin(as.integer(byte_rate), raw(), size=4, endian="little"),
    writeBin(as.integer(block_align), raw(), size=2, endian="little"),
    writeBin(as.integer(bits_per_sample), raw(), size=2, endian="little"),
    charToRaw("data"),
    writeBin(as.integer(data_size), raw(), size=4, endian="little")
  )
  
  # Combine header and audio data
  full_wav <- c(header, audio_data)
  
  # Write to file
  writeBin(full_wav, filename)
  message("Saved proper WAV to: ", filename)
}

# Save WAV file with header
write_wav(audio_file_name, pcm_audio, mime_type)

# Read and play the audio
audio <- readWave(audio_file_name)
play(audio)

Translate Text into Different Languages

Gemini TTS can translate text into multiple languages. The code below instructs Gemini to read English text in Spanish and generate the corresponding audio file. You just need to update the following two parameters.


instruction <- "Read this in spanish language."
text <- "Hey, how are you doing?. This is my R code."

Multi Speaker Text-to-Speech

Gemini's Text-to-Speech API supports multi-speaker synthesis which allows you to assign different voices to different text segments. This helps you to make dynamic, natural-sounding conversations in your audio output. It is ideal for use cases like dialogues, podcasts or narrated content with multiple characters.


# Load required libraries
library(httr)
library(jsonlite)
library(base64enc)
library(tuneR)

# Configuration
gemini_api_key <- "xxxxxxxxxxxxxxxxxxxxxxxxxxx"
text <- "Joe: How's it going today Jane?\nJane: Not too bad, how about you?"
speaker1 <- "Joe"
speaker2 <- "Jane"
audio_file_name <- "multi_speaker_audio.wav"
model_id <- "gemini-2.5-flash-preview-tts"

# TTS endpoint
generate_content_api <- "streamGenerateContent"
base_url <- sprintf(
  "https://generativelanguage.googleapis.com/v1beta/models/%s:%s",
  model_id,
  generate_content_api
)

# Build the request body for multi-speaker
req_body <- list(
  contents = list(
    list(
      role = "user",
      parts = list(
        list(text = text)
      )
    )
  ),
  generationConfig = list(
    responseModalities = list("AUDIO"),
    speechConfig = list(
      multiSpeakerVoiceConfig = list(
        speakerVoiceConfigs = list(
          list(
            speaker = speaker1,
            voiceConfig = list(
              prebuiltVoiceConfig = list(
                voiceName = "Kore"
              )
            )
          ),
          list(
            speaker = speaker2,
            voiceConfig = list(
              prebuiltVoiceConfig = list(
                voiceName = "Puck"
              )
            )
          )
        )
      )
    )
  )
)

# Perform the POST request
res <- POST(
  url    = base_url,
  query  = list(key = gemini_api_key),
  body   = req_body,
  encode = "json",
  content_type_json()
)

# Check for HTTP errors
stop_for_status(res)

# Parse the JSON response
resp_json <- content(res, as = "parsed", simplifyVector = TRUE)

# Extract base64 audio
inline_data <- resp_json$candidates[[1]]$content$parts[[1]]$inlineData
b64_audio <- inline_data$data
mime_type <- inline_data$mimeType

# Decode to PCM
pcm_audio <- base64decode(b64_audio)

# Write WAV function
write_wav <- function(filename, audio_data, mime_type) {
  rate <- 24000
  bits_per_sample <- 16
  matches <- regmatches(mime_type, regexec("L(\\d+).*rate=(\\d+)", mime_type))
  if (length(matches[[1]]) == 3) {
    bits_per_sample <- as.numeric(matches[[1]][2])
    rate <- as.numeric(matches[[1]][3])
  }
  
  num_channels <- 1
  bytes_per_sample <- bits_per_sample / 8
  block_align <- num_channels * bytes_per_sample
  byte_rate <- rate * block_align
  data_size <- length(audio_data)
  chunk_size <- 36 + data_size
  
  header <- c(
    charToRaw("RIFF"),
    writeBin(as.integer(chunk_size), raw(), size=4, endian="little"),
    charToRaw("WAVE"),
    charToRaw("fmt "),
    writeBin(as.integer(16), raw(), size=4, endian="little"),
    writeBin(as.integer(1), raw(), size=2, endian="little"),
    writeBin(as.integer(num_channels), raw(), size=2, endian="little"),
    writeBin(as.integer(rate), raw(), size=4, endian="little"),
    writeBin(as.integer(byte_rate), raw(), size=4, endian="little"),
    writeBin(as.integer(block_align), raw(), size=2, endian="little"),
    writeBin(as.integer(bits_per_sample), raw(), size=2, endian="little"),
    charToRaw("data"),
    writeBin(as.integer(data_size), raw(), size=4, endian="little")
  )
  
  full_wav <- c(header, audio_data)
  writeBin(full_wav, filename)
  message("Saved WAV to: ", filename)
}

# Save and play
write_wav(audio_file_name, pcm_audio, mime_type)
audio <- readWave(audio_file_name)
play(audio)

Supported Voices
Name Gender Description
AchernarFemaleClear mid-range voice; friendly, engaging tone - great for explainers and podcast intros
AoedeFemaleClear, conversational, thoughtful mid‑range - ideal for podcasts, e-learning
AutonoeFemaleMature, deeper female tone; resonant and calm - suits documentaries and audiobooks
CallirrhoeFemaleConfident, clear mid-range; professional and articulate - excellent for business narration
DespinaFemaleWarm, inviting, smooth; friendly and trustworthy mid‑range - perfect for commercials
ErinomeFemaleProfessional, articulate, mid-to-lower mid‑range - suited for education and museum guides
GacruxFemaleSmooth, confident lower-mid-range; authoritative yet approachable
KoreFemaleEnergetic, youthful, firm higher pitch - excellent for upbeat ads and tutorials
LaomedeiaFemaleClear, inquisitive mid-range; conversational and engaged - great for explainers
LedaFemaleYouthful, clear slightly higher pitch; composed and professional
PulcherrimaFemaleBright, energetic higher pitch - youthful and engaging; for commercials, character voices
SulafatFemaleWarm, confident mid-range; persuasive and articulate - good for marketing narration
VindemiatrixFemaleCalm, thoughtful, lower pitch - mature and composed; suited for meditation and reflective content
ZephyrFemaleEnergetic, bright, perky; distinctly higher pitch - excellent for children’s content
AchirdMaleYouthful mid-to-high male; inquisitive, slightly breathy, friendly - good for tutorials
AlgenibMaleWarm, confident male with deep authority - corporate and documentary use
AlnilamMaleEnergetic mid-range male; clear and direct - ideal for commercials and announcements
CharonMaleSmooth, assured, approachable mid-low male - great for podcasts and corporate
EnceladusMaleEnergetic, enthusiastic mid-range; promotional and event tone
FenrirMaleFriendly, clear mid-range - conversational and approachable
IapetusMaleCasual mid-range “everyman” male - ideal for vlogs and informal tutorials
OrusMaleMature, resonant male - calming, authoritative narration
PuckMaleUpbeat, playful, confident mid-range - ideal for demos and how-to guides
RasalgethiMaleConversational mid-range with inquisitive tone - good for podcasts and explainers
SadachbiaMaleDeeper, slightly raspy mid‑low - cool, laid-back with gravitas
SadaltagerMaleFriendly, articulate mid-range - great for corporate presentations
SchedarMaleEven, steady mid-low male - approachable and down-to-earth
UmbrielMaleSmooth, calm lower-mid; friendly authority
ZubenelgenubiMaleDeep, resonant very low - commanding and formal; suited for trailers
Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

Post Comment 0 Response to "Convert Text to Speech in R"
Next →