This article explains how to convert text to speech in R for free using Gemini.
Gemini Text-to-Speech (TTS) models automatically detect the input language and supports 24 languages. These models also support 30 prebuilt voices with different tones and pitch styles.
Step 1 : Get Gemini API Key
Go to Google AI Studio, sign in with your Google account and click 'Create API key' to generate a key using an existing Google Cloud project.
Step 2 : Install Required Libraries
We need to ensure that these R libraries (httr, jsonlite, base64enc and tuneR) are installed. It helps to make API calls, handling JSON, encoding/decoding base64 and processing audio files.
# Check and install missing packages
packages <- c("httr", "jsonlite", "base64enc", "tuneR")
installed <- packages %in% rownames(installed.packages())
if (any(!installed)) {
install.packages(packages[!installed])
}
Step 3: Configure Gemini TTS
In this step, you need to update the following configuration settings for Gemini Text-to-Speech.
gemini_api_key
: Your API key to access Gemini model.text
: The message you want to convert to speech.instruction
: Specifies how the text should be spoken (e.g. tone or style).voice_name
: The name of the voice to use from prebuilt voices in Gemini.audio_file_name
: The output filename to be used for saving the audio.model_id
: The specific Gemini model used for generating speech.
# Load required libraries
library(httr)
library(jsonlite)
library(base64enc)
library(tuneR)
# Configuration
gemini_api_key <- "xxxxxxxxxxxxxxxx"
text <- "Hey, how are you doing?. This is my R code. Test text to speech capability of Gemini."
instruction <- "Read aloud in a casual tone."
voice_name <- "Leda"
audio_file_name <- "audio_output.wav"
model_id <- "gemini-2.5-flash-preview-tts"
# TTS Code
generate_content_api <- "streamGenerateContent"
content <- paste(instruction, text, sep = "\n")
base_url <- sprintf(
"https://generativelanguage.googleapis.com/v1beta/models/%s:%s",
model_id,
generate_content_api
)
# Build the request body
req_body <- list(
contents = list(
list(
role = "user",
parts = list(
list(text = content)
)
)
),
generationConfig = list(
responseModalities = list("audio"),
temperature = 1,
speech_config = list(
voice_config = list(
prebuilt_voice_config = list(
voice_name = voice_name
)
)
)
)
)
# Perform the POST request
res <- POST(
url = base_url,
query = list(key = gemini_api_key),
body = req_body,
encode = "json",
content_type_json()
)
# Check for HTTP errors
stop_for_status(res)
# Parse the JSON response
resp_json <- content(res, as = "parsed", simplifyVector = TRUE)
# Extract the base64‑encoded audio
inline_data <- resp_json$candidates[[1]]$content$parts[[1]]$inlineData
b64_audio <- inline_data$data
mime_type <- inline_data$mimeType
# Decode base64 into raw PCM
pcm_audio <- base64decode(b64_audio)
# --- WAV Header Construction ---
write_wav <- function(filename, audio_data, mime_type) {
# Parse mime_type for bits per sample and rate
rate <- 24000
bits_per_sample <- 16
matches <- regmatches(mime_type, regexec("L(\\d+).*rate=(\\d+)", mime_type))
if (length(matches[[1]]) == 3) {
bits_per_sample <- as.numeric(matches[[1]][2])
rate <- as.numeric(matches[[1]][3])
}
num_channels <- 1
bytes_per_sample <- bits_per_sample / 8
block_align <- num_channels * bytes_per_sample
byte_rate <- rate * block_align
data_size <- length(audio_data)
chunk_size <- 36 + data_size
# Construct WAV header
header <- packBits(as.raw(NULL))
header <- c(
charToRaw("RIFF"),
writeBin(as.integer(chunk_size), raw(), size=4, endian="little"),
charToRaw("WAVE"),
charToRaw("fmt "),
writeBin(as.integer(16), raw(), size=4, endian="little"),
writeBin(as.integer(1), raw(), size=2, endian="little"),
writeBin(as.integer(num_channels), raw(), size=2, endian="little"),
writeBin(as.integer(rate), raw(), size=4, endian="little"),
writeBin(as.integer(byte_rate), raw(), size=4, endian="little"),
writeBin(as.integer(block_align), raw(), size=2, endian="little"),
writeBin(as.integer(bits_per_sample), raw(), size=2, endian="little"),
charToRaw("data"),
writeBin(as.integer(data_size), raw(), size=4, endian="little")
)
# Combine header and audio data
full_wav <- c(header, audio_data)
# Write to file
writeBin(full_wav, filename)
message("Saved proper WAV to: ", filename)
}
# Save WAV file with header
write_wav(audio_file_name, pcm_audio, mime_type)
# Read and play the audio
audio <- readWave(audio_file_name)
play(audio)
Translate Text into Different Languages
Gemini TTS can translate text into multiple languages. The code below instructs Gemini to read English text in Spanish and generate the corresponding audio file. You just need to update the following two parameters.
instruction <- "Read this in spanish language."
text <- "Hey, how are you doing?. This is my R code."
Multi Speaker Text-to-Speech
Gemini's Text-to-Speech API supports multi-speaker synthesis which allows you to assign different voices to different text segments. This helps you to make dynamic, natural-sounding conversations in your audio output. It is ideal for use cases like dialogues, podcasts or narrated content with multiple characters.
# Load required libraries
library(httr)
library(jsonlite)
library(base64enc)
library(tuneR)
# Configuration
gemini_api_key <- "xxxxxxxxxxxxxxxxxxxxxxxxxxx"
text <- "Joe: How's it going today Jane?\nJane: Not too bad, how about you?"
speaker1 <- "Joe"
speaker2 <- "Jane"
audio_file_name <- "multi_speaker_audio.wav"
model_id <- "gemini-2.5-flash-preview-tts"
# TTS endpoint
generate_content_api <- "streamGenerateContent"
base_url <- sprintf(
"https://generativelanguage.googleapis.com/v1beta/models/%s:%s",
model_id,
generate_content_api
)
# Build the request body for multi-speaker
req_body <- list(
contents = list(
list(
role = "user",
parts = list(
list(text = text)
)
)
),
generationConfig = list(
responseModalities = list("AUDIO"),
speechConfig = list(
multiSpeakerVoiceConfig = list(
speakerVoiceConfigs = list(
list(
speaker = speaker1,
voiceConfig = list(
prebuiltVoiceConfig = list(
voiceName = "Kore"
)
)
),
list(
speaker = speaker2,
voiceConfig = list(
prebuiltVoiceConfig = list(
voiceName = "Puck"
)
)
)
)
)
)
)
)
# Perform the POST request
res <- POST(
url = base_url,
query = list(key = gemini_api_key),
body = req_body,
encode = "json",
content_type_json()
)
# Check for HTTP errors
stop_for_status(res)
# Parse the JSON response
resp_json <- content(res, as = "parsed", simplifyVector = TRUE)
# Extract base64 audio
inline_data <- resp_json$candidates[[1]]$content$parts[[1]]$inlineData
b64_audio <- inline_data$data
mime_type <- inline_data$mimeType
# Decode to PCM
pcm_audio <- base64decode(b64_audio)
# Write WAV function
write_wav <- function(filename, audio_data, mime_type) {
rate <- 24000
bits_per_sample <- 16
matches <- regmatches(mime_type, regexec("L(\\d+).*rate=(\\d+)", mime_type))
if (length(matches[[1]]) == 3) {
bits_per_sample <- as.numeric(matches[[1]][2])
rate <- as.numeric(matches[[1]][3])
}
num_channels <- 1
bytes_per_sample <- bits_per_sample / 8
block_align <- num_channels * bytes_per_sample
byte_rate <- rate * block_align
data_size <- length(audio_data)
chunk_size <- 36 + data_size
header <- c(
charToRaw("RIFF"),
writeBin(as.integer(chunk_size), raw(), size=4, endian="little"),
charToRaw("WAVE"),
charToRaw("fmt "),
writeBin(as.integer(16), raw(), size=4, endian="little"),
writeBin(as.integer(1), raw(), size=2, endian="little"),
writeBin(as.integer(num_channels), raw(), size=2, endian="little"),
writeBin(as.integer(rate), raw(), size=4, endian="little"),
writeBin(as.integer(byte_rate), raw(), size=4, endian="little"),
writeBin(as.integer(block_align), raw(), size=2, endian="little"),
writeBin(as.integer(bits_per_sample), raw(), size=2, endian="little"),
charToRaw("data"),
writeBin(as.integer(data_size), raw(), size=4, endian="little")
)
full_wav <- c(header, audio_data)
writeBin(full_wav, filename)
message("Saved WAV to: ", filename)
}
# Save and play
write_wav(audio_file_name, pcm_audio, mime_type)
audio <- readWave(audio_file_name)
play(audio)
Name | Gender | Description |
---|---|---|
Achernar | Female | Clear mid-range voice; friendly, engaging tone - great for explainers and podcast intros |
Aoede | Female | Clear, conversational, thoughtful mid‑range - ideal for podcasts, e-learning |
Autonoe | Female | Mature, deeper female tone; resonant and calm - suits documentaries and audiobooks |
Callirrhoe | Female | Confident, clear mid-range; professional and articulate - excellent for business narration |
Despina | Female | Warm, inviting, smooth; friendly and trustworthy mid‑range - perfect for commercials |
Erinome | Female | Professional, articulate, mid-to-lower mid‑range - suited for education and museum guides |
Gacrux | Female | Smooth, confident lower-mid-range; authoritative yet approachable |
Kore | Female | Energetic, youthful, firm higher pitch - excellent for upbeat ads and tutorials |
Laomedeia | Female | Clear, inquisitive mid-range; conversational and engaged - great for explainers |
Leda | Female | Youthful, clear slightly higher pitch; composed and professional |
Pulcherrima | Female | Bright, energetic higher pitch - youthful and engaging; for commercials, character voices |
Sulafat | Female | Warm, confident mid-range; persuasive and articulate - good for marketing narration |
Vindemiatrix | Female | Calm, thoughtful, lower pitch - mature and composed; suited for meditation and reflective content |
Zephyr | Female | Energetic, bright, perky; distinctly higher pitch - excellent for children’s content |
Achird | Male | Youthful mid-to-high male; inquisitive, slightly breathy, friendly - good for tutorials |
Algenib | Male | Warm, confident male with deep authority - corporate and documentary use |
Alnilam | Male | Energetic mid-range male; clear and direct - ideal for commercials and announcements |
Charon | Male | Smooth, assured, approachable mid-low male - great for podcasts and corporate |
Enceladus | Male | Energetic, enthusiastic mid-range; promotional and event tone |
Fenrir | Male | Friendly, clear mid-range - conversational and approachable |
Iapetus | Male | Casual mid-range “everyman” male - ideal for vlogs and informal tutorials |
Orus | Male | Mature, resonant male - calming, authoritative narration |
Puck | Male | Upbeat, playful, confident mid-range - ideal for demos and how-to guides |
Rasalgethi | Male | Conversational mid-range with inquisitive tone - good for podcasts and explainers |
Sadachbia | Male | Deeper, slightly raspy mid‑low - cool, laid-back with gravitas |
Sadaltager | Male | Friendly, articulate mid-range - great for corporate presentations |
Schedar | Male | Even, steady mid-low male - approachable and down-to-earth |
Umbriel | Male | Smooth, calm lower-mid; friendly authority |
Zubenelgenubi | Male | Deep, resonant very low - commanding and formal; suited for trailers |
Share Share Tweet