“Speech recognition systems work well in laboratory environments, but not in real-world environments. For actual use, this problem must be solved. Otherwise, they will be nothing more than simple toys.”
This was the thought D. Yook had in mind when he began his research in the field of Automatic Speech Recognition (ASR) in the late 1990s. Therefore, he focused on implementing robust speech recognition using various techniques such as microphone arrays (1995), neural networks (1996), and adaptation (1998). Later, in his doctoral research, he discovered that noisy speeches in real-world environments are non-linearly distorted, and he revealed that by utilizing neural networks, which are highly suitable for such non-linear distortion, the distortion of Hidden Markov Model (HMM)-based speech recognition systems can be compensated in real-world environments.
D. Yook joined Korea University in 2001, established the Speech Information Processing Laboratory (SIPL), and began research on robust speech processing technology. Later, around 2010, as the research field of this laboratory expanded into various related fields of artificial intelligence and knowledge accumulated over more than a decade became applicable to the realm of general artificial intelligence, the name of the laboratory was changed to the Artificial Intelligence (AI) Laboratory, which has been used to this day.
The 1st doctor, Donghyun Kim (2008) continued the research of the D. Yook on efficient adaptation techniques for HMM-based speech recognition systems. Through in-depth analysis of noise distortion, he discovered that non-linear distortions caused by noise are non-linear in the cepstral domain, which was used in speech recognition at the time, but can be modeled as linear distortions when transformed into the spectral domain. Instead of non-linear modeling requiring many parameters, he researched an HMM-based speech recognition adaptation technique that models and compensates noise with a small number of parameters even in various noisy environments. This was achieved by converting the cepstral domain back to the spectral domain, efficiently and robustly estimating and correcting the linear distortion in the spectral domain with a small number of parameters, and then converting it back to the cepstral domain.
The 2nd doctor, Youngkyu Cho (2011) focused on microphone arrays. By virtually manipulating microphones through filter operations, sound can be amplified at any location within a given space, thereby improving the signal-to-noise ratio (SNR) and enhancing speech processing. He concentrated on improving the GCC-PHAT and SRP-PHAT algorithms, which combine Phase Transformation (PHAT) with Generalized Cross-Correlation (GCC) and Steering Power Response (SRP) algorithms. Although SRP-PHAT demonstrated much higher accuracy than GCC-PHAT, it was computationally excessive because it searched the given space using a brute-force method by dividing it into numerous small blocks. On the other hand, GCC-PHAT found the target location immediately but was vulnerable to reverberation or severe noise. He aimed to bridge the gap between the two algorithms by increasing the accuracy of the GCC-PHAT algorithm and reducing the computational load of the SRP-PHAT algorithm.
The 3rd doctor, Hyeopwoo Lee (2015) worked on multimodal information processing. This was based on the fact that humans utilize both visual and auditory information to improve accuracy in noisy environments. Since people make eye contact during conversation, this study investigated a method to first recognize the user’s face and gaze direction to detect the direction of speech and the state of the mouth, and to integrate this into voice activity detection. Additionally, when it was confirmed that the user was speaking toward the recognition system, a beamforming algorithm was applied to amplify the voice signal from that direction through a microphone array, thereby improving sound quality.
The 4th doctor, Inchul Yoo (2015) turned his attention to the problem of Voice Activity Detection (VAD). He believed that for the convenient use of voice-based interfaces, they should not require separate input devices like push-to-talk systems and should be operable solely through voice. Voice itself is a very complex signal, and consonants, such as ‘sh’, are particularly difficult to process as they resemble white noise. However, he noted that consonants always appear alongside vowels, and that vowels possess distinct spectral peaks that are resistant to noise. Therefore, he utilized this to develop a new VAD algorithm that operates stably even in noisy environments.
The 5th doctor, Hyeontaek Lim (2016) took an interest in the practical application of voice processing technology applicable to various devices, including smartphones and robots. In real-world environments, situations involving various noise sources and multiple speakers speaking simultaneously occur frequently. Traditionally, in such environments, microphone arrays were used to detect the location where the highest energy is generated, beamforming was applied to improve the sound quality at that location, and voice processing technologies such as speech recognition were applied later. However, this step-by-step processing simply finds the direction with the loudest sound and emphasizes it, which carries the risk of emphasizing unwanted noise or voices other than the intended speaker. He combined speech recognition (VAD) and speaker recognition algorithms to identify sound sources, thereby enabling the system to emphasize the most likely location of the desired speaker.
The 6th doctor, Taewoo Lee (2016) recognized the potential of GPU-accelerated computing and took an interest in maximizing GPU efficiency by leveraging the characteristics of algorithms and GPU architecture. In particular, regarding the acceleration of the SRP-PHAT algorithm to GPUs, noting that existing research applied general parallelization methods without properly considering the hardware characteristics of GPUs, he utilized various knowledge of GPU architecture, such as memory hierarchy, registers, thread blocks, and context switch overhead. He demonstrated the performance of the proposed method by applying it to both frequency domain SRP-PHAT and time domain SRP-PHAT using three GPU hardware systems with different core counts and memory capacities.
The 1st external doctor, Seonggyun Leem, completed his work on UT Dallas (2024). His research focused on recognizing emotional speech in severely noisy environments. Interestingly, despite studying abroad, he inherited the approach to the noisy speech recognition problem addressed by D. Yook in his doctoral dissertation. This approach involves suppressing noise during both feature extraction in the preprocessing stage and model adaptation. In particular, they demonstrated that utilizing domain knowledge can significantly improve the accuracy of the recognition system compared to simply training a model with noisy speech.
Currently, as of 2026 March, the next external doctor candidate Hyunwoo Oh, who researched DNN parallel computation using pipeline algorithms during his master’s course, is currently studying various topics such as audio-image conversion and video captioning. Having been honed through the incredibly difficult pipeline techniques during his master’s course, any research he undertake will likely feel easy compared to that time.
The 7th doctor candidate Hyungpil Chang is dedicated to researching methods for collecting large-scale training data using semi-supervised learning. He determined that as the scale and complexity of DNN models grow, human-labeled training data alone will eventually be insufficient to meet the requirements for model training. Therefore, he is devising a new method to generate reliable, labeled training datasets while minimizing human intervention.
This text summarizes 25 years of research in our lab. Although we are few in number, I believe that every member of our lab is an exceptional talent capable of making groundbreaking discoveries in their respective fields. Thank you for reading. I would like to conclude by quoting a famous passage from Shakespeare’s Henry V.
From this day to the ending of the world,
But we in it shall be remember’d;
We few, we just few, we band of people;
May the 4th be with you, whatever that means😉.
Written by the 4th doctor on a warm day in March 2026.