This API provides an interface for converting spoken language into written text, which is an essential capability for applications such as voice assistants, subtitling services, note-taking apps, and many more. Here, we will provide a comprehensive guide on the parameters that can be used with this API endpoint to customize and optimize the transcription process according to your needs.
The Audio Transcription API Endpoint accepts audio files and transcribes them into text by leveraging sophisticated speech recognition algorithms. It is crucial to understand the parameters that can be used when making a request to this endpoint, as they dictate how the API processes the audio input.
Sending Audio for Transcription
Audio transcription endpoint
The post endpoint for Gladia Audio Transcription API is:
<https://api.gladia.io/audio/text/audio-transcription/>
Body Params
-
audio
(optional if audio_url filled, file): The file to transcribe -
audio_url
(optional if audio filled, string): - - The file url to transcript, ignored if an audio file is provided. This can be a public file url or any of the supported social platform listed in the documentation.` -
language_behaviour
(optional, enum): - Defines how the transcription model detects the audio language.
Default value isautomatic single language
.
Possible values:Value Description manual
manually define the language of the transcription using the language
parameterautomatic single language
default value and recommended choice for most cases - the model will auto detect the prominent language in the audio, then transcribe the full audio to that language. Segments in other languages will automatically be translated to the prominent language. The mode is also recommended for scenarios where the audio starts in one language for a short while and then switches to another for the majority of the duration automatic multiple languages
For specific scenarios where language is changed multiple times throughout the audio (e.g. a conversation between 2 people, each speaking a different language.).
The model will continuously detect the spoken language and switch the transcription language accordingly.
Please note that certain strong accents can possibly cause this mode to transcribe to the wrong language. -
language
(optional, string): e.g. “english” - If language_behaviour is set to manual, define the language to use for the transcription -
toggle_noise_reduction
(optional, boolean): “true”/”false” - Activate the noise reduction to improve transcription quality -
transcription_hint
(optional, string): Provide a custom vocabulary to the model to improve accuracy of transcribing context specific words, technical terms, names, etc. If empty, this argument is skipped. -
toggle_diarization
(optional, boolean): “true”/”false” - Activate the diarization of the audio -
toggle_direct_translate
[Beta] (optional, boolean): “true”/”false” - Activate the direct translation of the audio transcription. Translation is currently in beta. -
target_translation_language
[Beta] (optional, string): e.g. “english” - If toogle_direct_translate is set to true, define the language to use for the translation of the transcription -
webhook_url
(optional, url): e.g. “https://webhook.site/df092e71-24c0-4fa7-a672-c6913fa164ec” - Webhook URL to send the result to. Make sure it's network is open. -
diarization_num_speakers
(optional, integer): Guiding number of speakers - forces the model to detect an exact number of speakers in the audio -
diarization_min_speakers
(optional, integer): Guiding minimum number of speakers - forces the model to detect no less than number of speakers in the audio. -
diarization_max_speakers
(optional, integer): Guiding maximum number of speakers - forces the model to detect no more than number of speakers in the audio. -
output_format
(optional, enum): Define the output format, allowing to have json, srt, vtt, txt (subtitle file format for video content)
Sending Video for Transcription
Video transcription endpoint
The post endpoint for Gladia Audio Transcription API is:
<https://api.gladia.io/video/text/video-transcription/>
Body Params
video
(optional if video_url filled, file): The file to transcribevideo_url
(optional if video filled, string): - - The file url to transcript, ignored if a video file is provided. This can be a public file url or any of the supported social platform listed in the documentation.`language_behaviour
(optional, enum): one the the following [manual, automatic single language, automatic multiple languages] - Define how the speaker's language will be detected. Default value is 'automatic single language'.language
(optional, string): e.g. “english” - If language_behaviour is set to manual, define the language to use for the transcriptiontoggle_noise_reduction
(optional, boolean): “true”/”false” - Activate the noise reduction to improve transcription qualitytranscription_hint
(optional, string): Provide a custom vocabulary to the model to improve accuracy of transcribing context specific words, technical terms, names, etc. If empty, this argument is skipped.toggle_diarization
(optional, boolean): “true”/”false” - Activate the diarization of the audiotoggle_direct_translate
[Beta] (optional, boolean): “true”/”false” - Activate the direct translation of the audio transcription. Translation is currently in beta.target_translation_language
[Beta] (optional, string): e.g. “english” - If toogle_direct_translate is set to true, define the language to use for the translation of the transcriptionwebhook_url
(optional, url): e.g. “https://webhook.site/df092e71-24c0-4fa7-a672-c6913fa164ec” - Webhook URL to send the result to. Make sure it's network is open.diarization_num_speakers
(optional, integer): Guiding number of speakers - forces the model to detect an exact number of speakers in the audiodiarization_min_speakers
(optional, integer): Guiding minimum number of speakers - forces the model to detect no less than number of speakers in the audio.diarization_max_speakers
(optional, integer): Guiding maximum number of speakers - forces the model to detect no more than number of speakers in the audio.output_format
(optional, enum): Define the output format, allowing to have json, srt, vtt, txt (subtitle file format for video content)
Code samples
curl -X POST \
-H 'x-gladia-key: YOUR_GLADIA_TOKEN' \
-H 'accept: application/json' \
-F '[email protected];type=audio/mp3' \
-F 'toggle_diarization=True' \
https://api.gladia.io/audio/text/audio-transcription/
import requests
import os
headers = {
'x-gladia-key': '', # Replace with your Gladia Token
'accept': 'application/json', # Accept json as a response, but we are sending a Multipart FormData
}
print(os.getcwd())
file_path = 'audio.mp3' # Change with your file path
if os.path.exists(file_path): # This is here to check if the file exists
print("- File exists")
else:
print("- File does not exist")
file_name, file_extension = os.path.splitext(file_path) # Get your audio file name + extension
with open(file_path, 'rb') as f: # Open the file
files = {
# Sending a local audio file
'audio': (file_name, f, 'audio/'+file_extension[1:]), # Send it. Here it represents: (filename: string, file: BufferReader, fileMimeType: string)
# You can also send an URL for your audio file. Make sure it's the direct link and publicly accessible.
# 'audio_url': (None, 'http://files.gladia.io/example/audio-transcription/split_infinity.wav'),
# Then you can pass any parameters you wants. Please see: https://docs.gladia.io/reference/pre-recorded
'toggle_diarization': (None, True),
}
print('- Sending request to Gladia API...');
response = requests.post('https://api.gladia.io/audio/text/audio-transcription/', headers=headers, files=files)
if response.status_code == 200:
print('- Request successful');
result = response.json()
print(result)
else:
print('- Request failed');
print(response.json())
print('- End of work');
const axios = require("axios");
const FormData = require("form-data");
const gladiaKey = process.argv[2]
if (!gladiaKey) {
console.error('You must provide a gladia key. Go to app.gladia.io')
process.exit(1)
} else {
console.log('using the gladia key : ' + gladiaKey)
}
async function url() {
const form = new FormData();
form.append(
"audio_url",
"http://files.gladia.io/example/audio-transcription/split_infinity.wav"
);
form.append("output_format", "json");
const response = await axios.post(
"https://api.gladia.io/audio/text/audio-transcription/",
form,
{
headers: {
...form.getHeaders(),
accept: "application/json",
"x-gladia-key": gladiaKey,
"Content-Type": "multipart/form-data",
},
}
);
const stringResponse = JSON.stringify(response.data, null, 2)
console.log(stringResponse);
}
url()
import axios from "axios";
import fs from "fs";
import FormData from "form-data";
// retrieve gladia key
const gladiaKey = process.argv[2];
if (!gladiaKey) {
console.error("You must provide a gladia key. Go to app.gladia.io");
process.exit(1);
} else {
console.log("using the gladia key : " + gladiaKey);
}
const headers = {
"x-gladia-key": gladiaKey, // Replace with your Gladia Token
accept: "application/json", // Accept json as a response, but we are sending a Multipart FormData
};
const file_path = "audio.mp3"; // Change with your file path
fs.access(file_path, fs.constants.F_OK, (err) => {
if (err) {
console.log("- File does not exist");
} else {
console.log("- File exists");
const form = new FormData();
const stream = fs.createReadStream(file_path);
// Explicitly set filename, and mimeType
form.append("audio", stream, {
filename: "anna-and-sasha-16000.wav",
contentType: "audio/wav",
});
// form.append(
// "audio_url",
// "http://files.gladia.io/example/audio-transcription/split_infinity.wav"
// );
form.append("toggle_diarization", "true"); // form-data library requires fields to be string, Buffer or Stream
console.log("- Sending request to Gladia API...");
axios
.post("https://api.gladia.io/audio/text/audio-transcription/", form, {
// form.getHeaders to get correctly formatted form-data boundaries
// https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type
headers: { ...form.getHeaders(), ...headers },
})
.then((response) => {
console.log(response.data); // Get the results
console.log(response.status); // Status code
console.log("- End of work");
})
.catch((error) => {
console.error(error);
});
}
});
<?php
$file_path = 'audio.mp3'; // Change with your file path
if (file_exists($file_path)) { // This is here to check if the file exists
echo "- File exists\n";
} else {
echo "- File does not exist\n";
}
$file_name = pathinfo($file_path, PATHINFO_FILENAME); // Get your audio file name
$file_extension = pathinfo($file_path, PATHINFO_EXTENSION); // Get your audio file extension
$audio_file = new CurlFile($file_path, 'audio/'.$file_extension); // Create a CurlFile object
$data = [
// Sending a local audio file
'audio' => $audio_file,
// You can also send an URL for your audio file. Make sure it's the direct link and publicly accessible.
// 'audio_url' => 'http://files.gladia.io/example/audio-transcription/split_infinity.wav',
// Then you can pass any parameters you want. Please see: https://docs.gladia.io/reference/pre-recorded
'toggle_diarization' => true,
];
$headers = [
'x-gladia-key: {YOUR_GLADIA_TOKEN}', // Replace with your Gladia Token
'accept: application/json', // Accept json as a response, but we are sending a Multipart FormData
];
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://api.gladia.io/audio/text/audio-transcription/');
curl_setopt($curl, CURLOPT_POST, true);
curl_setopt($curl, CURLOPT_POSTFIELDS, $data);
curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
echo "- Sending request to Gladia API...\n";
$response = curl_exec($curl);
$status_code = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if ($status_code == 200) {
echo "- Request successful\n";
$result = json_decode($response, true);
print_r($result);
} else {
echo "- Request failed\n";
$error = json_decode($response, true);
print_r($error);
}
curl_close($curl);
echo "- End of work\n";
?>
Understanding billing
Billing for audio hours is determined by the number of non-silent and non-similar tracks.
For instance, if there is a one-hour stereo audio consisting of two channels, L (Local) and R (Remote), typically used in call centers, it will be billed as two hours, considering each channel separately.
However, if there is a scenario where one channel has one hour of non-silent audio and the other channel is silent throughout, the billing will only be for the non-silent channel, amounting to one hour.
Furthermore, if there are two channels with non-silent content, a similarity algorithm is executed to assess the resemblance between the tracks. If the content on the two channels is found to be similar or extremely similar, billing will be conducted for only one channel. For example, if there is one hour of similar non-silent audio on channel 0 and channel 1, you will only be billed for the audio on channel 0.