Encoding ODIN audio data to FLAC for realtime transcription

Written by Phillip Schuster
30 Jan 2024

Leveraging AI tools for transcription is a great way to enhance voice chat applications. While some AI services like Whisper from OpenAI use WAV, other services like Google Speech to Text recommend using FLAC instead.

We already have provided a sample on how to use our NodeJS SDK to transcribe audio streams in real-time using OpenAIs whisper. It encodes the audio data to WAV. In this article, we will show you how to encode audio data to FLAC for realtime transcription.

Getting started with ODIN

Please refer to the Transcribe sample for the overall script and how to get started with ODIN. Using the information provided here it should be easy to change it to use FLAC instead of WAV.

Why FLAC?

FLAC is a lossless audio format. This means that the audio data is compressed without any loss in quality. This is important for transcription because the quality of the audio data is crucial for the transcription quality.

How to encode audio data to FLAC

We will use the libflacjs module to encode the audio data to FLAC. The module is a wrapper around the libflac library. The library is written in C and compiled to WebAssembly.

First thing we need to do import the module (in TypeScript - please refer to the libflacjs Github page on how to import it in JavaScript):

import * as Flac from 'libflacjs/dist/libflac';
import {Encoder} from "libflacjs/lib/encoder";

Next, we need to create an instance of the Encoder class. The constructor takes a Flac object and an options object.

const encoder = new Encoder(Flac, {
  channels: 1,
  sampleRate: 48000,
  bitsPerSample: 16,
  verify: false,
  compression: 0
});

Tip

Please remember the bitsPerSample setting, as it will have an important part in the next section.

Understanding audio data

Once you have created the encoder, you can start encoding audio data. The encode method takes an ArrayBuffer as input. In ODIN, you subscribe to the AudioDataReceived event and receive raw audio data there. Let’s have a look at the event structure you get:

/**
 * The payload for the AudioDataReceived event.
 */
export declare interface OdinAudioDataReceivedEventPayload {
  /**
   * The ID of the peer that sent the audio data.
   */
  peerId: number;

  /**
   * The ID of the media that sent the audio data.
   */
  mediaId: number;

  /**
   * The audio data received from the peer as 16-bit PCM samples ranging from -32768 to 32767 as a byte array.
   * Use `const samplesArray = new Int16Array(samples16.buffer)` to get an actual array
   */
  samples16: Uint8Array;

  /**
   * The audio data received from the peer as 32-bit PCM samples ranging from -1 to 1.
   * Use `const floats = new Float32Array(samples32.buffer)` to get an actual array
   */
  samples32: Uint8Array;
}

Most audio libraries expect the audio data to be in a specific format. But most formats are just different ways to express the same thing. In the NodeJS SDK, we provide the audio data in two different formats: 16-bit PCM samples and 32-bit PCM samples.

Info

PCM stands for Pulse-Code Modulation. It is a method used to digitally represent sampled analog signals. The samples are stored as a sequence of binary numbers. The number of bits used to represent each sample is called the bit depth. The higher the bit depth, the more accurate the representation of the signal. The bit depth is also called the resolution or the word length.

The bit depth is important because it determines the dynamic range of the audio signal. The dynamic range is the ratio between the loudest and the quietest sound. The higher the dynamic range, the more accurate the representation of the signal. The dynamic range is also called the signal-to-noise ratio.

The sample rate is the number of samples per second. The higher the sample rate, the more accurate the representation of the signal. The sample rate is also called the sampling frequency.

In our SDK, samples16 is a raw byte array with 2 bytes per sample. The samples are signed 16-bit integers. The range is from -32768 to 32767. samples32 is a raw byte array with 4 bytes per sample. The samples are signed 32-bit floats. The range is from -1 to 1.

Some audio libraries - like the WAV encoder used in the example code accepts these Uint8array objects directly. Others, like the FLAC encoder - expect the audio data to be in a Int32Array. So, how do we convert the Uint8Array to a Int32Array?

Converting Uint8Array to Int32Array

The Int32Array constructor takes an ArrayBuffer as input. So, we need to create an ArrayBuffer from the Uint8Array first. We can do that by using the buffer property of the Uint8Array:

const samples32 = new Float32Array(event.samples32.buffer);
const samples16 = new Int16Array(event.samples16.buffer);

The Flac library expects the audio data to be in a Int32Array. So, we need to scale the range of -1 to 1 to either -32768 to 32767 for 16 Bit (see 256*256=65536 which split into half for positive and negative numbers is 32768) or -8388608 to 8388607 for 24 Bit. So, all we need to do is getting the samples as Float32Array scale each sample to the desired range and convert it to a Int32Array:

This is the function that does the heavy lifting for us:

  convertAudioSamples = function(samples32: Uint8Array, bitDepth: number): Int32Array {
    // Samples from Odin are 32 bit floats encoded in a Uint8Array, that is 4 elements of the Uint8Array form one sample
    // which is a 32-bit float value. So we need to convert every 4 elements to one 32 bit float value. This can be
    // done with a Float32Array view on the Uint8Array buffer
    const floats = new Float32Array(samples32.buffer);

    // We now have a Float32Array with the samples ranging from -1 to 1. We need to convert them to signed integers
    // with the range of the bit depth. So for 16 bit, we need to convert the floats to integers in the range of
    // -32768 to 32767. For 24 bit, we need to convert the floats to integers in the range of -8388608 to 8388607.
    const scale = Math.pow(2, bitDepth - 1) - 1; // calculate scale based on bitdepth

    // Create a new Int32Array with the same length as the Float32Array and convert the floats to integers using
    // the scale calculated above
    const intArray = new Int32Array(floats.length);
    for (let i = 0; i < floats.length; i++) {
      intArray[i] = Math.round(floats[i] * scale);
    }

    return intArray;
  }

Encoding the audio data

Now that we have the audio data in the correct format, we can encode it to FLAC. The encode method of the Encoder class takes an Int32Array as input, so we can pass the result of the convertAudioSamples function to it:

  // Add an event filter for audio data received events and write the audio data to the file using the FLAC encoder
room.addEventListener('AudioDataReceived', (data) => {
  const intArray = convertAudioSamples(event.samples32, 16);
  encoder.encode(intArray);
});

Please note: You’ll need to use the same bit depth as you have used when creating the encoder. In our example, we have used 16 bit. If you want to use 24 bit, you need to change the bitsPerSample setting of the encoder to 24 and change the convertAudioSamples function to use 24 bit as well.

As you can see that is pretty simple. Once you have finished encoding enough data, i.e. when the user stops talking, you can call the encode method of the Encoder class to finish the encoding process:

// Finalize the encoding process
encoder.encode();
//get the encoded data:
const encData = encoder.getSamples();
// Write file to disc
fs.writeFileSync("./test.flac", encData);

Conclusion

In this article, we have shown you how to encode audio data to FLAC for realtime transcription. We have also shown you how to convert the audio data from the ODIN SDK to the format required by the FLAC encoder. We hope this article was helpful.

Developer Documentation

Encoding ODIN audio data to FLAC for realtime transcription

Getting started with ODIN#

Why FLAC?#

How to encode audio data to FLAC#

Understanding audio data#

Converting Uint8Array to Int32Array#

Encoding the audio data#

Conclusion#