Convert Text to Speech in R

This article explains how to convert text to speech in R for free using Gemini.

Gemini Text-to-Speech (TTS) models automatically detect the input language and supports 24 languages. These models also support 30 prebuilt voices with different tones and pitch styles.

Step 1 : Get Gemini API Key

Go to Google AI Studio, sign in with your Google account and click 'Create API key' to generate a key using an existing Google Cloud project.

Step 2 : Install Required Libraries

We need to ensure that these R libraries (httr, jsonlite, base64enc and tuneR) are installed. It helps to make API calls, handling JSON, encoding/decoding base64 and processing audio files.


# Check and install missing packages
packages <- c("httr", "jsonlite", "base64enc", "tuneR")
installed <- packages %in% rownames(installed.packages())
if (any(!installed)) {
  install.packages(packages[!installed])
}

Step 3: Configure Gemini TTS

In this step, you need to update the following configuration settings for Gemini Text-to-Speech.

gemini_api_key: Your API key to access Gemini model.
text: The message you want to convert to speech.
instruction: Specifies how the text should be spoken (e.g. tone or style).
voice_name: The name of the voice to use from prebuilt voices in Gemini.
audio_file_name: The output filename to be used for saving the audio.
model_id: The specific Gemini model used for generating speech.


# Load required libraries
library(httr)
library(jsonlite)
library(base64enc)
library(tuneR)

# Configuration
gemini_api_key <- "xxxxxxxxxxxxxxxx"
text <- "Hey, how are you doing?. This is my R code. Test text to speech capability of Gemini."
instruction <- "Read aloud in a casual tone."
voice_name <- "Leda"
audio_file_name <- "audio_output.wav"
model_id <- "gemini-2.5-flash-preview-tts"

# TTS Code
generate_content_api <- "streamGenerateContent"
content <- paste(instruction, text, sep = "\n")
base_url <- sprintf(
  "https://generativelanguage.googleapis.com/v1beta/models/%s:%s",
  model_id,
  generate_content_api
)

# Build the request body
req_body <- list(
  contents = list(
    list(
      role = "user",
      parts = list(
        list(text = content)
      )
    )
  ),
  generationConfig = list(
    responseModalities = list("audio"),
    temperature = 1,
    speech_config = list(
      voice_config = list(
        prebuilt_voice_config = list(
          voice_name = voice_name
        )
      )
    )
  )
)

# Perform the POST request
res <- POST(
  url    = base_url,
  query  = list(key = gemini_api_key),
  body   = req_body,
  encode = "json",
  content_type_json()
)

# Check for HTTP errors
stop_for_status(res)

# Parse the JSON response
resp_json <- content(res, as = "parsed", simplifyVector = TRUE)

# Extract the base64‑encoded audio
inline_data <- resp_json$candidates[[1]]$content$parts[[1]]$inlineData

b64_audio <- inline_data$data
mime_type <- inline_data$mimeType

# Decode base64 into raw PCM
pcm_audio <- base64decode(b64_audio)

# --- WAV Header Construction ---
write_wav <- function(filename, audio_data, mime_type) {
  # Parse mime_type for bits per sample and rate
  rate <- 24000
  bits_per_sample <- 16
  matches <- regmatches(mime_type, regexec("L(\\d+).*rate=(\\d+)", mime_type))
  if (length(matches[[1]]) == 3) {
    bits_per_sample <- as.numeric(matches[[1]][2])
    rate <- as.numeric(matches[[1]][3])
  }
  
  num_channels <- 1
  bytes_per_sample <- bits_per_sample / 8
  block_align <- num_channels * bytes_per_sample
  byte_rate <- rate * block_align
  data_size <- length(audio_data)
  chunk_size <- 36 + data_size
  
  # Construct WAV header
  header <- packBits(as.raw(NULL))
  header <- c(
    charToRaw("RIFF"),
    writeBin(as.integer(chunk_size), raw(), size=4, endian="little"),
    charToRaw("WAVE"),
    charToRaw("fmt "),
    writeBin(as.integer(16), raw(), size=4, endian="little"),
    writeBin(as.integer(1), raw(), size=2, endian="little"),
    writeBin(as.integer(num_channels), raw(), size=2, endian="little"),
    writeBin(as.integer(rate), raw(), size=4, endian="little"),
    writeBin(as.integer(byte_rate), raw(), size=4, endian="little"),
    writeBin(as.integer(block_align), raw(), size=2, endian="little"),
    writeBin(as.integer(bits_per_sample), raw(), size=2, endian="little"),
    charToRaw("data"),
    writeBin(as.integer(data_size), raw(), size=4, endian="little")
  )
  
  # Combine header and audio data
  full_wav <- c(header, audio_data)
  
  # Write to file
  writeBin(full_wav, filename)
  message("Saved proper WAV to: ", filename)
}

# Save WAV file with header
write_wav(audio_file_name, pcm_audio, mime_type)

# Read and play the audio
audio <- readWave(audio_file_name)
play(audio)

Translate Text into Different Languages

Gemini TTS can translate text into multiple languages. The code below instructs Gemini to read English text in Spanish and generate the corresponding audio file. You just need to update the following two parameters.


instruction <- "Read this in spanish language."
text <- "Hey, how are you doing?. This is my R code."

Multi Speaker Text-to-Speech

Gemini's Text-to-Speech API supports multi-speaker synthesis which allows you to assign different voices to different text segments. This helps you to make dynamic, natural-sounding conversations in your audio output. It is ideal for use cases like dialogues, podcasts or narrated content with multiple characters.


# Load required libraries
library(httr)
library(jsonlite)
library(base64enc)
library(tuneR)

# Configuration
gemini_api_key <- "xxxxxxxxxxxxxxxxxxxxxxxxxxx"
text <- "Joe: How's it going today Jane?\nJane: Not too bad, how about you?"
speaker1 <- "Joe"
speaker2 <- "Jane"
audio_file_name <- "multi_speaker_audio.wav"
model_id <- "gemini-2.5-flash-preview-tts"

# TTS endpoint
generate_content_api <- "streamGenerateContent"
base_url <- sprintf(
  "https://generativelanguage.googleapis.com/v1beta/models/%s:%s",
  model_id,
  generate_content_api
)

# Build the request body for multi-speaker
req_body <- list(
  contents = list(
    list(
      role = "user",
      parts = list(
        list(text = text)
      )
    )
  ),
  generationConfig = list(
    responseModalities = list("AUDIO"),
    speechConfig = list(
      multiSpeakerVoiceConfig = list(
        speakerVoiceConfigs = list(
          list(
            speaker = speaker1,
            voiceConfig = list(
              prebuiltVoiceConfig = list(
                voiceName = "Kore"
              )
            )
          ),
          list(
            speaker = speaker2,
            voiceConfig = list(
              prebuiltVoiceConfig = list(
                voiceName = "Puck"
              )
            )
          )
        )
      )
    )
  )
)

# Perform the POST request
res <- POST(
  url    = base_url,
  query  = list(key = gemini_api_key),
  body   = req_body,
  encode = "json",
  content_type_json()
)

# Check for HTTP errors
stop_for_status(res)

# Parse the JSON response
resp_json <- content(res, as = "parsed", simplifyVector = TRUE)

# Extract base64 audio
inline_data <- resp_json$candidates[[1]]$content$parts[[1]]$inlineData
b64_audio <- inline_data$data
mime_type <- inline_data$mimeType

# Decode to PCM
pcm_audio <- base64decode(b64_audio)

# Write WAV function
write_wav <- function(filename, audio_data, mime_type) {
  rate <- 24000
  bits_per_sample <- 16
  matches <- regmatches(mime_type, regexec("L(\\d+).*rate=(\\d+)", mime_type))
  if (length(matches[[1]]) == 3) {
    bits_per_sample <- as.numeric(matches[[1]][2])
    rate <- as.numeric(matches[[1]][3])
  }
  
  num_channels <- 1
  bytes_per_sample <- bits_per_sample / 8
  block_align <- num_channels * bytes_per_sample
  byte_rate <- rate * block_align
  data_size <- length(audio_data)
  chunk_size <- 36 + data_size
  
  header <- c(
    charToRaw("RIFF"),
    writeBin(as.integer(chunk_size), raw(), size=4, endian="little"),
    charToRaw("WAVE"),
    charToRaw("fmt "),
    writeBin(as.integer(16), raw(), size=4, endian="little"),
    writeBin(as.integer(1), raw(), size=2, endian="little"),
    writeBin(as.integer(num_channels), raw(), size=2, endian="little"),
    writeBin(as.integer(rate), raw(), size=4, endian="little"),
    writeBin(as.integer(byte_rate), raw(), size=4, endian="little"),
    writeBin(as.integer(block_align), raw(), size=2, endian="little"),
    writeBin(as.integer(bits_per_sample), raw(), size=2, endian="little"),
    charToRaw("data"),
    writeBin(as.integer(data_size), raw(), size=4, endian="little")
  )
  
  full_wav <- c(header, audio_data)
  writeBin(full_wav, filename)
  message("Saved WAV to: ", filename)
}

# Save and play
write_wav(audio_file_name, pcm_audio, mime_type)
audio <- readWave(audio_file_name)
play(audio)

Supported Voices

Name	Gender	Description
Achernar	Female	Clear mid-range voice; friendly, engaging tone - great for explainers and podcast intros
Aoede	Female	Clear, conversational, thoughtful mid‑range - ideal for podcasts, e-learning
Autonoe	Female	Mature, deeper female tone; resonant and calm - suits documentaries and audiobooks
Callirrhoe	Female	Confident, clear mid-range; professional and articulate - excellent for business narration
Despina	Female	Warm, inviting, smooth; friendly and trustworthy mid‑range - perfect for commercials
Erinome	Female	Professional, articulate, mid-to-lower mid‑range - suited for education and museum guides
Gacrux	Female	Smooth, confident lower-mid-range; authoritative yet approachable
Kore	Female	Energetic, youthful, firm higher pitch - excellent for upbeat ads and tutorials
Laomedeia	Female	Clear, inquisitive mid-range; conversational and engaged - great for explainers
Leda	Female	Youthful, clear slightly higher pitch; composed and professional
Pulcherrima	Female	Bright, energetic higher pitch - youthful and engaging; for commercials, character voices
Sulafat	Female	Warm, confident mid-range; persuasive and articulate - good for marketing narration
Vindemiatrix	Female	Calm, thoughtful, lower pitch - mature and composed; suited for meditation and reflective content
Zephyr	Female	Energetic, bright, perky; distinctly higher pitch - excellent for children’s content
Achird	Male	Youthful mid-to-high male; inquisitive, slightly breathy, friendly - good for tutorials
Algenib	Male	Warm, confident male with deep authority - corporate and documentary use
Alnilam	Male	Energetic mid-range male; clear and direct - ideal for commercials and announcements
Charon	Male	Smooth, assured, approachable mid-low male - great for podcasts and corporate
Enceladus	Male	Energetic, enthusiastic mid-range; promotional and event tone
Fenrir	Male	Friendly, clear mid-range - conversational and approachable
Iapetus	Male	Casual mid-range “everyman” male - ideal for vlogs and informal tutorials
Orus	Male	Mature, resonant male - calming, authoritative narration
Puck	Male	Upbeat, playful, confident mid-range - ideal for demos and how-to guides
Rasalgethi	Male	Conversational mid-range with inquisitive tone - good for podcasts and explainers
Sadachbia	Male	Deeper, slightly raspy mid‑low - cool, laid-back with gravitas
Sadaltager	Male	Friendly, articulate mid-range - great for corporate presentations
Schedar	Male	Even, steady mid-low male - approachable and down-to-earth
Umbriel	Male	Smooth, calm lower-mid; friendly authority
Zubenelgenubi	Male	Deep, resonant very low - commanding and formal; suited for trailers

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn