Audio Basics: Wechat Voice as an Example to Explain the Entire Process of Sound Digitization (ADC)

Research in the direction of voice AI inevitably need to contact some of the sample rate, bit depth, PCM, WAV and other concepts, analog-to-digital conversion (Analog-to-Digital Conversion, referred to as ADC) and digital-to-analog converter (DAC) is an important means of sound signal processing!
The knowledge related to audio on the Internet feels a bit messy. This article organizes some basic knowledge to facilitate reading and understanding by novices!
Basics of acoustics: what is sound?
Sound is generated by the vibration of the object sound waves, sound waves transmitted to the human ear to recognize the sound, sound waves is an important form of transmission of sound information.
Generation & Propagation of Sound Waves
The short version of the explanation is: air vibration produces sound waves, which are then transmitted through media such as air, water, metal, etc., and finally enter our ears to be heard by us.
Next is the detailed version of the explanation: we know that air is actually a large number of molecules: oxygen molecules, nitrogen molecules, etc., the molecules in the air will continue to move, collision, and thus in any one of the substances in contact with the air to form a static pressure, this pressure depends on the density of the air, the temperature, and due to the reason of the gravitational pull of the air close to the earth's surface is squeezed together to form a pressure of 100,000 Newtons (unit of mechanics) per square meter. The pressure of 100,000 Newtons (mechanical unit) is called atmospheric pressure.
When the air is disturbed by the movement and vibration of an object (sound source), the density of the air changes continuously. When the vibrating object moves outward, the air molecules in the vicinity are pushed and squeezed together, which makes the density and pressure of the molecules here increase slightly, forming the dense part. When the vibrating object moves inward, the air molecules spread out to fill the empty space, which produces the sparse part with a slight decrease in the density and pressure, and these changes in pressure are what we call sound waves. are what we call sound waves.
The particles of the sound wave (such as the various molecules in the air) oscillate forward and backward in the direction of the wave motion. A graphic analogy is to compare the air to a chain of golf balls connected by springs: when the leftmost golf ball is pushed from the left to the right, the spring will be compressed, which in turn will cause the golf ball next to it to move to the right, which will cause the next spring to be compressed, and so on and so forth, travelling along the chain to all the golf balls. This is the process of sound wave propagation.
Particle motion of sound waves
The Three Elements of Sound Waves
The waveform diagram of a sound wave can completely describe a piece of sound. Through the waveform diagram, we can know what this piece of sound is like, including the fact that we have to store, transmit, and even synthesize a piece of sound later on, essentially figuring out how to store, transmit, and synthesize the waveform diagram corresponding to this piece of sound. With this waveform information, we can play the sound again at any time and any place.
Let's look at the three elements of a sound wave: pitch, loudness, and timbre.
Pitch
Pitch represents the level of the sound, which depends on the frequency of the vibration of the sound source, i.e. the number of times the object vibrates in 1 second, in hertz. The higher the frequency, the higher the pitch. For example, when you fill a thermos with boiling water, its sound changes gradually. Because when filling, the sound source is the air column above the water surface, as the water level continues to rise, the air column will continue to become shorter, the vibration continues to accelerate, the pitch is getting higher and higher, and sounds more and more shrill. The human ear can hear the frequency range of 20 Hz ~ 20,000 Hz.
Corresponding to the waveform, is the corresponding waveform of the degree of sparseness, the denser the greater the frequency, the higher the pitch.
Loudness
Loudness, also called volume, refers to the strength of the sound, which depends on the amplitude of the vibration of the sound source (the distance between the highest and lowest points of the waveform) and the distance from the sound source.
Corresponding to the waveform, the distance between the highest and lowest points is the amplitude, and the greater the distance, the greater the loudness. Decibels are a measure of loudness; the louder the sound, the higher the decibel count.
In life, we think of decibels as a measure of loudness, and the louder the decibel number, the louder the sound. There is nothing wrong with this, but here is an additional point: decibels do not reflect the absolute loudness of a sound, but rather describe the relative relationship between the loudness of a sound, taking a certain sound as a benchmark. According to theory, the intensity of sound corresponds to the sound pressure, but the sound pressure of the range of changes is very large, the difference of thousands of times, such as the sound pressure of a rifle is the sound pressure of a car to be tens of thousands of times larger, so if the direct use of the sound pressure to describe the loudness of the fact that neither intuitive and inconvenient. So the concept of "decibel" was derived: artificially set a sound pressure as a standard value, and then any other sound, and this standard value of division, and take the logarithm of the result (to 10 as the bottom), and then multiplied by 20, in this way, an exponential growth of physical quantities into a linear growth of physical quantities. In this model, a rifle is 171 dB and a car is 80 dB, which looks much more intuitive. Where the standard value of 0 dB is 2×10-52 \times 10^{-5} 2 (20μPa), which is equivalent to a mosquito flying 3 meters away. So 0 dB is not really the absence of sound, and even negative decibels can exist.
Analogies for various levels of volume
Tone color
Tone refers to the quality of sound characteristics, for example, different people, even if they speak with the same pitch, volume, we sound different. This is the difference in timbre.
This is the difference in waveforms, which are not the same for different sound generators. It has the concept of fundamental frequency, harmonic frequency and so on. Through some processing means to obtain the spectrogram and calculate the similarity between the spectrograms, in order to achieve the purpose of voiceprint recognition. (e.g. to recognize whether it is the same person talking, etc.)
Sound waves of different tones
Audio Digitization
Above we understood how the natural sound is, next we further look at how computers process the sound.
Computers need to digitize sound, which means turning real-world sounds (e.g., songs, speeches, etc.) into audio files (e.g., forget-me-not.mp3) in the computer, and there are three main steps to do this: sampling, quantization, and encoding. We explain each of them next.
Sampling
What is meant by sampling?
First we need to understand two concepts: analog and digital signals . Sound in the real world is an analog signal, is continuous, while the computer to deal with is a digital signal, is discrete, is a limited number of 0,1 data stream. And sampling is all about converting analog signals into digital signals. Sampling is actually an approximate description because continuous signals are infinite, and dealing with infinite data is not good, so it retreats to interval recording to approximate the thing.
Analog and digital signals
As a real life example, sampling is a lot like "spot checking". Let's say you're a middle school teacher, and your no-good principal assigns you 500 students, and at the end of the winter break, when hundreds of thick winter break homework assignments have been handed in, the principal asks you to fully assess and understand how well the students know what they're doing. How do you assess this? Theoretically the best way is to carefully correct and grade every page of every student's homework, but that could take weeks, and the cost is too high, and you still have to prepare, lecture, assign, and correct new homework for the new semester, and you can't spend all your time on that. So you do the next best thing, you randomly sample the work of 50 students, and then that sampling gives some indication of the overall mastery of the students.
Sampling in audio is the same thing. In audio processing, sampling is the measurement and recording of audio signals at certain intervals. Sampling rate refers to how many samples are taken per second. For example, like the sampling rate is 44.1K, meaning 441,000 sample points are taken per second, which is like sampling an analog signal point every 1/441,000 seconds. Each sample records the state of the original sound wave at a given moment.
So in fact, this sampling rate is very much like the pixels of an image, the higher the pixels per unit area, the clearer the picture, and so is the sampling rate. So what is a good sampling rate? There is a Nyquist theorem in information theory that states:
When the sampling frequency is higher than twice the highest frequency in the signal, the sampled digital signal retains the information in the original signal intact.
Combined with what we said earlier that the highest frequency that the human ear can perceive is 20K, so the mainstream sampling rate of 44.1K, 48K can theoretically meet the needs of the human ear.
The completeness of sound wave information captured by different sampling rates is not the same.
Quantization
Quantization of the word as the name suggests, is to quantize the audio sampling point, to give him a meaningful number. This quantization has the concept of precision in it. To use another analogy, or the above example of sampling assignments, quantization is like scoring each of the assignments you sample. The simplest thing is that you can be divided into two categories: 0 points and 1 point, 0 points is failing, 1 point is passing. But this is relatively rough, rough, may be the final assessment effect is not so good, so you can use the 10-point system to quantify, then here can distinguish between general, better, very good these degrees, and then further, you can use the percentage system, which each topic according to the degree of difficulty, the degree of importance of a corresponding score, and finally calculated a percent of the score, which can be more accurately reflect the degree of mastery. mastery.
It's the same with the audio. For the collected data, we also want a value to represent it, this is the audio file in the concept of bit depth, such as 16-bit on behalf of the 16th power of 2, that is, the value of 65535 system interval.
Digital signal sampling after quantization
Encoding
Encoding is the process of converting the above quantized values into a sequence of binary bytes (e.g., the sample values in the figure below). After this transformation, we get a string of "0", "1" form of the binary byte sequence, so we get the computer can "understand", processing of the digital data that the computer can "understand" and process. Such data is called PCM data (Pulse Code Modulation), this is the most primitive bare data, in the actual storage, transmission, we will often be based on a variety of coding algorithms to do further compression of this data, such as we are familiar with mp3, wav, etc. are some of the coding algorithms.
Sound encoding
Real-world example: WeChat voice
Finally, let's take the example of Xiaoming sending a voice with WeChat, and then listening to his own recording again to sort out the whole process of audio digitization.
Step 1: Vocal Cord Vocalization
When Xiaoming opens his mouth to speak, his vocal cords will vibrate to produce sound waves, which will then pass through the vocal tract, which will play the role of a filter to correct the spectrum, and finally spread the sound waves through the open mouth.
Step 2 The cell phone microphone receives the analog signal
The transmitted sound wave is converted into an electrical signal when it reaches the cell phone microphone. Note that this converted electrical signal is also an analog signal, a continuous voltage signal.
Cell phone microphones are usually condenser microphones. When Ming's words reach the cell phone microphone, the vibration of the sound wave causes the vibration of the microphone diaphragm, which changes the distance between the diaphragm and the substrate, which in turn leads to a change in capacitance, which increases when the distance becomes small and decreases when the distance becomes large. The change in capacitance is converted into a continuously changing electrical signal (analog signal) by the subsequent processing circuit.
Analog Signal Received by Cell Phone Microphone
Step 3 Analog-to-digital conversion by sound card
Our computers and cell phones have a built-in hardware called sound card, in this step the sound card will be analog-to-digital conversion, that is, the analog signal (continuously changing electrical signals) into digital signals. The sampling, quantization, and encoding we mentioned above happens in this step, and after this analog-to-digital conversion, the audio becomes a string of 0,1 binary data.
Sound card analog-to-digital conversion
Step 4 Storage Device
WeChat stores our chats and other files on our phone (so the longer we use it, the more space WeChat takes up), so the digital signals just now will be stored as files in our phone's hard disk (the hard disk of a phone is often referred to as the RAM) so that no matter if we reboot our phone or disconnect from the network, we can read and play the audio. If you disconnect from the network and select a voice from your phone, you will find that it plays normally, which means that it exists locally and not on a remote server, otherwise it would need to be transferred over the network to get the data.
Step 5 Sound card for digital-to-analog conversion
After clicking this recording, WeChat program will read out the corresponding audio file (a bunch of 0101 binary audio data) from the hard disk of the phone, and then send the data to the sound card through the operating system to carry out the digital-to-analog conversion contrary to the second step, i.e., the D/A converter in the sound card will convert the digital signal (a bunch of 0101 binary data) into an analog signal (a continuously changing electrical signal).
Step 6 Cell Phone Speaker
The analog signal converted by the sound card will be converted into sound waves through the speaker. The following figure shows a typical cell phone speaker structure, which has four main components: magnet, coil, diaphragm, and bracket.
Typical cell phone speaker structure
The continuously changing electrical signal converted by the sound card passes through the copper coil, a temporary induction magnetic field will be generated inside the coil, at this time the temporary magnetic field and the permanent magnetic field of the magnet will interact, and the coil and the diaphragm wrapped around the coil will move with it, with the continuous change of the current, the coil and the diaphragm's position will be constantly changing (which the bracket is to play a role in protecting and stabilizing the diaphragm). In this process, the moving diaphragm is the sound source, which causes fluctuations in the surrounding air and generates corresponding sound waves.
Induction coil diaphragm moving to generate sound waves
Step 7 Perception by the human ear
The sound wave generated by the loudspeaker is transmitted to the human ear through the air, the eardrum will vibrate and drive the movement of the auditory ossicles, and the cochlea will be struck (like playing a drum), the structure on the right of the picture like a snail shell is the cochlea, which is full of liquid, and under the percussion, the liquid in the cochlea will be shocked, and the cochlea is dotted with hair cells, and when the liquid is shocked, the hair cells will generate neuroelectrical signals, which will be sent through the auditory nerve and enter into our brain, and then analyzed by our brain. When the fluid vibrates, the hair cells generate electrical nerve signals that pass through the auditory nerve and enter our brain, which then analyzes and recognizes the meaning of the sound.
The human ear perceives sound waves
At this point, the whole process is over.