Leveraging AI tools for transcription is a great way to enhance voice chat applications. While some AI services like Whisper from OpenAI use WAV, other services like Google Speech to Text recommend using FLAC instead.
We already have provided a sample on how to use our NodeJS SDK to transcribe audio streams in real-time using OpenAIs whisper. It encodes the audio data to WAV. In this article, we will show you how to encode audio data to FLAC for realtime transcription.
Getting started with ODIN
Please refer to the Transcribe sample for the overall script and how to get started with ODIN. Using the information provided here it should be easy to change it to use FLAC instead of WAV.
Why FLAC?
FLAC is a lossless audio format. This means that the audio data is compressed without any loss in quality. This is important for transcription because the quality of the audio data is crucial for the transcription quality.
How to encode audio data to FLAC
We will use the libflacjs module to encode the audio data to FLAC. The module is a wrapper around the libflac library. The library is written in C and compiled to WebAssembly.
First thing we need to do import the module (in TypeScript - please refer to the libflacjs Github page on how to import it in JavaScript):
import * as Flac from 'libflacjs/dist/libflac';
import {Encoder} from "libflacjs/lib/encoder";
Next, we need to create an instance of the Encoder
class. The constructor takes a Flac
object and an options object.
const encoder = new Encoder(Flac, {
channels: 1,
sampleRate: 48000,
bitsPerSample: 16,
verify: false,
compression: 0
});
Please remember the bitsPerSample
setting, as it will have an important part in the next section.
Understanding audio data
Once you have created the encoder, you can start encoding audio data. The encode
method takes an ArrayBuffer
as input.
In ODIN, you subscribe to the AudioDataReceived
event and receive raw audio data there. Let’s have a look at the event
structure you get:
/**
* The payload for the AudioDataReceived event.
*/
export declare interface OdinAudioDataReceivedEventPayload {
/**
* The ID of the peer that sent the audio data.
*/
peerId: number;
/**
* The ID of the media that sent the audio data.
*/
mediaId: number;
/**
* The audio data received from the peer as 16-bit PCM samples ranging from -32768 to 32767 as a byte array.
* Use `const samplesArray = new Int16Array(samples16.buffer)` to get an actual array
*/
samples16: Uint8Array;
/**
* The audio data received from the peer as 32-bit PCM samples ranging from -1 to 1.
* Use `const floats = new Float32Array(samples32.buffer)` to get an actual array
*/
samples32: Uint8Array;
}
Most audio libraries expect the audio data to be in a specific format. But most formats are just different ways to express the same thing. In the NodeJS SDK, we provide the audio data in two different formats: 16-bit PCM samples and 32-bit PCM samples.
PCM stands for Pulse-Code Modulation. It is a method used to digitally represent sampled analog signals. The samples are stored as a sequence of binary numbers. The number of bits used to represent each sample is called the bit depth. The higher the bit depth, the more accurate the representation of the signal. The bit depth is also called the resolution or the word length.
The bit depth is important because it determines the dynamic range of the audio signal. The dynamic range is the ratio between the loudest and the quietest sound. The higher the dynamic range, the more accurate the representation of the signal. The dynamic range is also called the signal-to-noise ratio.
The sample rate is the number of samples per second. The higher the sample rate, the more accurate the representation of the signal. The sample rate is also called the sampling frequency.
In our SDK, samples16
is a raw byte array with 2 bytes per sample. The samples are signed 16-bit integers. The range is
from -32768 to 32767. samples32
is a raw byte array with 4 bytes per sample. The samples are signed 32-bit floats. The
range is from -1 to 1.
Some audio libraries - like the WAV encoder used in the example code accepts these Uint8array
objects directly. Others,
like the FLAC encoder - expect the audio data to be in a Int32Array
. So, how do we convert the Uint8Array
to a
Int32Array
?
Converting Uint8Array to Int32Array
The Int32Array
constructor takes an ArrayBuffer
as input. So, we need to create an ArrayBuffer
from the Uint8Array
first. We can do that by using the buffer
property of the Uint8Array
:
const samples32 = new Float32Array(event.samples32.buffer);
const samples16 = new Int16Array(event.samples16.buffer);
The Flac library expects the audio data to be in a Int32Array
. So, we need to scale the range of -1 to 1 to either
-32768
to 32767
for 16 Bit (see 256*256=65536
which split into half for positive and negative numbers is 32768)
or -8388608
to 8388607
for 24 Bit. So, all we need to do is getting the samples as Float32Array
scale each sample
to the desired range and convert it to a Int32Array
:
This is the function that does the heavy lifting for us:
convertAudioSamples = function(samples32: Uint8Array, bitDepth: number): Int32Array {
// Samples from Odin are 32 bit floats encoded in a Uint8Array, that is 4 elements of the Uint8Array form one sample
// which is a 32-bit float value. So we need to convert every 4 elements to one 32 bit float value. This can be
// done with a Float32Array view on the Uint8Array buffer
const floats = new Float32Array(samples32.buffer);
// We now have a Float32Array with the samples ranging from -1 to 1. We need to convert them to signed integers
// with the range of the bit depth. So for 16 bit, we need to convert the floats to integers in the range of
// -32768 to 32767. For 24 bit, we need to convert the floats to integers in the range of -8388608 to 8388607.
const scale = Math.pow(2, bitDepth - 1) - 1; // calculate scale based on bitdepth
// Create a new Int32Array with the same length as the Float32Array and convert the floats to integers using
// the scale calculated above
const intArray = new Int32Array(floats.length);
for (let i = 0; i < floats.length; i++) {
intArray[i] = Math.round(floats[i] * scale);
}
return intArray;
}
Encoding the audio data
Now that we have the audio data in the correct format, we can encode it to FLAC. The encode
method of the Encoder
class takes an Int32Array
as input, so we can pass the result of the convertAudioSamples
function to it:
// Add an event filter for audio data received events and write the audio data to the file using the FLAC encoder
room.addEventListener('AudioDataReceived', (data) => {
const intArray = convertAudioSamples(event.samples32, 16);
encoder.encode(intArray);
});
Please note: You’ll need to use the same bit depth as you have used when creating the encoder. In our example, we have
used 16 bit. If you want to use 24 bit, you need to change the bitsPerSample
setting of the encoder to 24 and change
the convertAudioSamples
function to use 24 bit as well.
As you can see that is pretty simple. Once you have finished encoding enough data, i.e. when the user stops talking, you
can call the encode
method of the Encoder
class to finish the encoding process:
// Finalize the encoding process
encoder.encode();
//get the encoded data:
const encData = encoder.getSamples();
// Write file to disc
fs.writeFileSync("./test.flac", encData);
Conclusion
In this article, we have shown you how to encode audio data to FLAC for realtime transcription. We have also shown you how to convert the audio data from the ODIN SDK to the format required by the FLAC encoder. We hope this article was helpful.