Basics of Representing Digital Audio

I wanted to cover how data is represented for digital audio. A lot of these concepts are highly coupled so it’s difficult to explain one thing without mentioning things that haven’t been explained yet (PCM, samples, waves, data ranges, etc.), especially without making the article several times longer. So I would suggest reading and understand what you can, and then read it again afterward, and hopefully, all the prerequisites will be covered during the 2nd readthrough.

The article is going to assume that you already know how sound occurs with vibrations.

Table of Contents

Anatomy of a Wave
Data Range
- Integer Samples
- Floating Point Samples
Analog vs Digital
- Analog
- Digital
Signal
PCM Data
Conclusion

Anatomy of a Wave

There are a few parts of a waveform I want to introduce.

The first is the idea of a maximum value and a minimum value. These are the highest and lowest values we can represent when representing samples in an audio wave.

These limits are created by two things, the maximum power we allow our hardware to handle and the limits to representing values in whatever medium we’re representing our audio in. For digital audio, this is the data range from the bytes we’re representing numbers with.
As a metaphor, for analog vinyl: this would be the track’s width and the allowable range the needle can move.

The maximum range will also be referred to as the max clip because if any sample values attempt to be higher than this value, it will be set to it (i.e., clipped to it). The same for the min clip, except for the minimum value.

DC is the resting position of the wave. It’s what the value of the wave would be if there was no power applied to generate the oscillating pattern of the wave. It’s usually the center of the wave and represents silence.

The amplitude is how far the farthest point from the DC the wave will be. This is generally what determines the loudness of the audio.

And the wavelength is what determines the pitch of the audio.

Although as we’ve previously covered, complex audio is often wavelets of many combined and weighted waves going in and out as packets of wavelets. So wavelength and amplitude in this diagram should be considered as “audio data here”.

Data Range

Computer data is made up of bits and bytes. A bit holds 2 unique values (sometimes referred to as 0 and 1, or off and on). And we can group 8 of these to make a byte, which can hold up to 256 unique values. We can then go on to group multiple bytes as a single number to have numbers that can represent an even greater range of unique values. Byte groups such as 2 bytes, 4 bytes, and 8 bytes.

By doing this, we’re able to greatly increase the numbers we can represent, at the cost of requiring more memory.

unique\ bit\ values = 2\\ bits\ per\ byte = 8\\ unique\ byte\ values = 2 ^8 = 256\\ unique\ 2\ byte\ values = 2 ^{(8 * 2)} = 65536\\ unique\ 3\ byte\ values = 2 ^ {(8*3)} = 16777216\\ unique\ 4 \ byte\ values = 2^{(8 * 4)} = 4294967296

There are two common types of numbers used in a modern computer program: integers and IEEE floating-point numbers. Because there are only so many different unique values that can be represented with a given grouping of bytes, sample values will have a limit to the range of numbers they can be set to. This is where the min clip and max clip become relevant.

Integer Samples

Integers can only represent positive and negative whole numbers, nothing fractional. Integers are either signed or unsigned, where signed can represent positive or negative values, and unsigned can only represent positive values.

For 1 bye data, oftentimes the PCM data is unsigned. So each sample is a positive value from 0 to 255. But to get the final used values, they are subtracted by 128. This value of 128 can also be referred to as the bias.

Bytes	Bits	Unique Values u	Min Clip -128	Max Clip u – (128 + 1)
1	8	256	-128	127

The minimum and maximum values of signed bytes. Unsigned byte PCM has a bias of 128. We take away an extra value from the max clip because we need a bit configuration to represent the value 0.

For the rest of the talk about integers, we’re going to assume I’m talking about signed numbers.

Signed integer values use a system called two’s complement. There’s a lot of technical details involved, but for our purposes, it gives us a way to calculate the range of an integer given how many bytes it has.

We first need the calculate the number of unique values. Half that value is the lowest negative value we can represent, and half that value minus 1 is the highest positive value we can represent. We take 1 positive value away so we can represent 0.

When using integers, we want to use the entire range of values we can represent. To do this, we make the highest positive value as the max clip, and the lowest negative value as the min clip.

Bytes	Bits	Unique Values u	Min Clip -u/2	Max Clip (u/2)-1
1	8	256	-128	127
2	16	65536	-32768	32767
3	24	16777216	-8388608	8388607
4	32	4294967296	-2147483648	2147483647

The minimum and maximum values of signed integer audio samples. This is based on how many unique values a certain number of bytes and hold, and two’s complement.

Floating Point Samples

When dealing with 32 bit values, the PCM samples can either be signed integers or floats. Whether or not a file format is using floats depends on the format. But when processing and streaming audio on the computer, 32 bit PCM data often implies a floating point format.

Floating-point values can represent fractional values. So instead of just being able to represent 1, or 2, we can also represent 1.5, or 1.004, or 3.14159. Still, they are made of bytes and have the exact same limits on the number of unique values they can represent as integers do.

Floating-point values are not digital value types specialized for audio. They’re a generic specification for fractional values. Because of this, the actual numbers they represent can get very high and very low. But, for sanity’s sake, we keep the values we use are between -1.0 and 1.0. This means, for floating-point audio samples, the min clip is -1.0 and the max clip is 1.0. This also means we’re throwing a lot of usable values away, but that’s fine. 32-bit floats have more-than-enough values we can use to represent high-quality audio.

Analog vs Digital

There are two different ways to record and store audio: analog, or digital.

Analog

Analog is a physical medium. They use physical properties to store recorded audio data to the highest level of quality that the physics will allow. Examples are cassette tapes and vinyl disks. Because of this, an analog signal is called continuous: the audio signal it can represent can be as sharp or as smooth as possible.

It’s often argued that to the trained ear, analog audio gives a listening quality and experience that can’t be replicated with digital. That’s an argument for another time – it’s also an argument I don’t have skin in the game for.

But because things in the world are constantly degrading and the world is noisy, it’s impossible to make exact analog copies. Especially if copies are made from copies, that are made from copies, that are … all the way to … the real recording. It’s a proverbial game of telephone that adds slight noise and imperfections with each copy down the line.

Digital

Digital, on the other hand, represents a signal as a large ordered collection of discrete values. The values we use for samples can only be the values that we can represent based on how many bytes we’re using, and what kind of number system (floating point vs integers) we’re using.

To increase the quality of digital audio, we need to increase the amount of resources we dedicate towards audio: we need to throw more memory to represent more samples that can represent a greater range, and more computer power to process it.

There are two upsides to using digital. The first advantage is that we can perfectly replicate digital data with computers – which also includes audio. In the end, every piece of data is reduced to bits, a value that can be either a 0 or 1, so when copying those bits, it’s hard to add noise. This means an audio file can be copied many times, and copied from a copy virtual infinite times, and each digital copy will be exactly identical.
It’s even harder to have digital transfer errors when we take into account checksums and error detection/correction schemes.

The second is that digital is the language of the computer, so it’s easy to program, edit, and play with a computer. We can modify and play around with sound using digital signal processing. And, we can effortlessly cut and stitch any parts with non-linear editors and digital audio workstations.

Signal

What is an audio signal? It’s a recording that maps a value to a moment in time. So, for example, if I had a 5-minute audio track, I could “ask” that data questions like “at 3 minutes, 47 seconds, and 5.5 milliseconds, how much voltage should the speaker receive.” And the data will return me a value between the min clip and max clip.
If we’re pedantic with this metaphor, other things besides the audio data, such as system volume and speaker volume, come into play before the final voltage is determined.

This is done by holding a bunch of different values for a bunch of different times. Each of these values for an instantaneous moment of time is called a sample.

Zooming into an audio file. We can see each sample (a dot) followed by another sample form a pattern which is our signal. Note how many samples there are and that this is viewing the timespan of only 4ms.

The time for a sample isn’t explicitly mapped. There isn’t some large directory in the data whose purpose is to map each sample to a time value. Instead, samples are often stored one after another sequentially in time. And we know how to move forwards or backward in time by moving a certain number of samples. For analog audio, moving through time in the audio data often involves traveling across the physical medium (i.e., rolling through a cassette tape, or rotating a needle across a vinyl record). For digital audio, this means jumping to a different place in memory by a certain calculated amount.

PCM Data

PCM is short for pulse code modulation. It’s how audio is stored digitally. In a nutshell, it’s a large array of numbers that are our audio samples.

The number of samples depends on the sample rate, measured in samples per second. The unit of a sample per second is a hertz, often shortened to Hz. A common high-quality rate is 44,100 Hz, which is the data rate of audio CDs. This value is based on the upper limit of human hearing, which is ~20,000 Hz.
The sample rate needs to be at least double the frequency of audio you wish to capture.

For example, an audio signal of 3 minutes, with 44,100 Hz data, recorded with 24bit (3 byte) samples…

seconds = 3 * 60 = 180\ seconds\\ 180 * 44100 = 7938000\ samples\\ 7938000 * 3 = 23814000 \ bytes\\

23,814,000 bytes, or 23.81 megabytes.

Conclusion

To recap:

Digital audio is PCM data that represent varying voltage values to the speaker over time.
Values are tightly packed and sequentially ordered in time.
More memory in the form of wider-byte numbers and higher sample rates results in higher quality.

Below is a demo. Three different sample sounds. You can simulate degrading the audio, hear the difference, and see the waveform difference. The bitness has been exaggerated to go down to 2 bits so you can really experience degraded audio quality.

Fullscreen

– William Leu

Explore more articles here.
Explore more audio articles here.

_{“Vinyl Record” by wuestenigel is licensed under CC BY 2.0
“Philips CC-1 Continuous Cassette – Tape – As New – ISO” by stuart.childs is licensed under CC BY 2.0
“favourite gift – mp3 player” by Ambernectar 13 is licensed under CC BY-ND 2.0
“cds freejpg.com.ar” by Freejpg.com.ar is licensed under CC BY 2.0}