Sound is an important communication medium in human society. It can not only express emotions but also reflect one’s physical conditions.

In our interview with Professor Wu Mengyue from the Department of Computer Science and Engineering of Shanghai Jiao Tong University, we have jointly delved into the world of speech, from multi-modal interaction to medical applications, to explore the mysteries of sound. Please find below the podcast of the interview.

 

Question: Please introduce your research background. Why are you interested in this research area?

Professor Wu: My main research direction now is rich audio analysis. When we listen to a certain sound, if we are listening to a language, we not only care about what the person said but also his/her ways of speech, that is: what was his/her mood and emotion when speaking. Thinking further, speaking can reflect one’s mental state or cognistive status. This actually regards speech or language functions as external manifestations of the brain’s cognitive functions. Therefore, we can do a lot of pathological analysis from a phonetic perspective.

On the other hand, the sounds we hear include not only speech but also all sounds in nature or our environment. For a long time, people who traditionally study speech will think that these natural sounds are “noises”. However, in fact, when we process all auditory information, every small sound provides a lot of information. Now we call this field “rich audio analysis”. The so-called “richness” comes from two aspects: firstly, it indicates that the human voice has many layers from which adundant information can be extracted; secondly, it refers to environmental abundance. My work at the moment is to think about how can we combine the two aspects.

Question: What are the application scenarios of rich audio analysis?

Professor Wu: In fact, some corresponding application scenarios can be clearly found in the research content we just talked about. For example, speech analysis, especially when combined with pathology, has a wide range of application scenarios in the medical field.

Speech research from a pathological perspective is divided into several categories. One category is related to organic disorders. For example, adenoid hypertrophy may affect the overall airflow, thus blocking pronunciation. So these organic lesions can cause differences in the speech signals. Therefore, our research has many aspects related to otolaryngology. We can identify changes in a person’s voice through his/her speech, including the diagnosis of pathological legions such as adenoid hypertrophy, and even the early predictions of laryngeal cancer.

In addition to speaking, people can also produce other sounds, and some are also related to organic changes, such as snoring. Now there are many studies that monitor sleep by detecting snoring, or checking whether there are problems with the sleeper’s respiratory system.

In addition, during the global COVID-19 pandemic, there are also some studies, such as identifying the root cause of a person’s cough by the sound of his/her coughing. These studies can not only be used to diagnose COVID-19 but also used in more general scenarios, especially in the field of pediatrics. Cough is a very common respiratory disease among children, and there are many reasons accountable for the coughing symptoms. We cooperated with the Shanghai Children’s Medical Center and invented a device that is easy for children to carry and can be worn for a long time. It looks like a microphone or a button so that we can monitor the changes throughout the coughing process. Based on the frequency of and all sounds generated by coughing, we can reservely deduct whether it is a dry cough or a wet one. Then, we can further analyze whether the cough is caused by a common upper respiratory tract infection or a certain type of pneumonia. These are some very specific application scenarios.

In addition to applications in organic diseases, speech research can also be applied on neurodegenerative diseases or diseases directly related to emotional disorders, such as depression, anxiety, Parkinson’s disease, and Alzheimer’s disease. When analyzing and comparing the speech of Alzheimer’s patients, it is found that it shares certain similarities with depression and Parkinson’s disease. On the one hand, most patients with Alzheimer’s disease will have symptoms of depression for a long period of time. On the other hand, this disease, like Parkinson’s disease, is a neurodegenerative disease. The internal connection between these diseases allows our system to be applied in these scenarios.

In other respects, there is a very direct application — the detection of crying babies. For example, you can place a detector at home. It can collect the crying sound of a child and then analyze it to determine what does the child need.

In addition, we have cooperated with the public security departments some time ago. When monitoring population movements, if you want to know who has returned from other places, you can place a microphone array at the door of the returnee’s home. Several households can share one microphone array. Through the recognition of the sound of opening and closing the door by the microphone array, the public security officers can determine whether someone has entered or exited the home.

This research can also be applied to determine the travel safety of passengers using car-hailing apps. In order to check the safety of passengers when taking a taxi via a cari-hailing app, the recording is turned on in real-time, but even so, no one will check all the recordings in real-time. Therefore, when processing recordings, it is necessary to detect and identify abnormal events, to check whether someone is screaming, arguing, or calling for help. These are all part of the rich audio analysis we’ve discussed.

Taking a step further, we can explore how to describe a piece of audio content with complete natural language. For example, you can use ASR to directly get a voice translation; if you use natural language to describe the current scene, it can be described as “Several people are having a webinar discussion, what are the specific contents?”, or you can also directly describe an audio as: “Someone is walking by, and a bird is chirping at the same time…” These can very well help those with hearing impairment. Even if they cannot hear the sound, they can understand what is happening in the auditory world at this moment through text. Some mobile phone manufacturers have already begun research in this area, aiming to further meet the needs of those with hearing impairment or weaking hearing.

These are the application scenarios I can think of that directly correspond to rich audio analysis.

Question: In the research process, data is the basis of everything. What types of data do you primarily work with? How is this data collected and analyzed?

Professor Wu: This is a very critical issue. Whether it is in the medical field or the field of environmental sound, compared with the speech we have studied for a long time, this type of sound data is still relatively scarce. For sound data in the medical field, we will cooperate with hospitals, but such cooperations focus more on inventing, creating, or using existing technologies to transform it into a form more suitable for the application scenarios, and then collect the audio data before analyze it in the laboratory.

As for the sound of environmental audio, first of all, there are many environmental sounds, but the biggest problem is how to label them. When it comes to labeling, there’re new scientific questions to be answered, for instance, whether it is possible to describe environmental audio in a weakly supervised way. The largest dataset on environmental audio is the AudioSet launched by Google in 2017, which contains 527 different types of sound events. Each piece of audio contains multiple labels. However, there is actually no way to accurately locate the labels. For example, in a piece of audio, there is an event from the first to the third second, or there is another event from the fourth to the eighth second. This strong labeling method is very time-consuming, labor-intensive, and resource-intensive. There is now a paragraph-level labeling method. A big challenge in our research field is how to use weak supervision to label first, and then use strong supervision to label each frame.

In addition, we first proposed the task of audio caption in 2018, that is, how to describe audio content with a paragraph of natural language text. Compared with previous labeling studies, this method is closer to human auditory perception.

If you just heard a loud noise, you wouldn’t describe it by saying “bang, semicolon, cry for help, semicolon”. Instead, you would describe it in a natural sentence. This is what we expect the future machines to output directly in the process of auditory perceptions. Of course, when we create such a new task, we also need a new dataset to support it.

In short, the data we study either comes from authentic scenarios, through cooperations with hospitals or from nature, or we invent some new labeling methods based on some basic datasets to solve our current problems.

Question: One of your recent studies mentioned a model called clap. What are the key datasets used to train such a model? And how are they constructed?

Professor Wu: In the past few years, there have been many large-scale pre-training models that combine vision and natural language, but there are very few in the audio field. The main reason is the lack of datasets. But last year, there were three papers including ours that referred to a model called clap. Since the previous clip model is used for imaging captioning, model mentioned in these papers is called clap because we replaced image with audio.

In fact, our training method is very similar to the original clip. The key is how to solve the problem of where the datasets in the audio field—especially the ones corresponding to text—come from.

One method is to train a model based on the original audio caption dataset, and then use this model to put end labels on all other applicable audio.

Before the end label, there is another method. You can combine all discrete labels as a guide, and then use these labels to guide the audio caption model so that the generated caption itself will be more in line with the original audio content. When end-labeling massive data in this way, to a certain extent, a dataset with corresponding audio and text has been constructed.

On this basis, we use the method of contrastive learning. For example, use two encoders to input audio while inputting text, and add a contrastive loss, so that the trained pre-training model can improve its performance greatly in many downstream tasks related to audio or text.

In short, if you want to do pre-training, the source of data and the quality and quantity of data are very important. On the one hand, a model can be trained to label tags, and on the other hand, ChatGPT can be used to generate natural language descriptions for more audio data.

Question: Many experiments are facing the problem of application in the real world where speech signals may be interfered with by various factors, such as background noise, the speaker’s accent, speed of speaking, and intonation changes, etc. The use of different recording equipment and microphones may also cause differences in speech signals. So, how do lab-trained speech recognition systems process speech signals in the real world?

Professor Wu: Compared to natural language processing, the hardest part of audio analysis is really coordinating all the different audio signals. Much of the data in our research comes from real scenes. When collecting sounds in hospitals, we will specify a unified model or sampling rate to obtain an optimized model. In the final model training, we will also use different methods to equip the model with better adaptability or robustness. For example, we may simulate different noises, or blend in additional noises, which will also make the original dataset for training more complex.

In this way, any situation that may be encountered in the real test is included in the distribution of the original training dataset. However, to apply this work in the real world no matter who is around or how noisy the environment is, to achieve the performance as it does in the lab — is still difficult. Therefore, the key is to determine the acceptable range of declined performance in authentic settings.

For this problem, traditional speech recognition research also faces real-world challenges—how to get better research results in non-cooperative environment. We have made a lot of efforts and attempts, but so far this problem has not been solved yet.

Question: You just mentioned that an important step of research is the labeling and description of environmental sounds. With the launch of GPT, AI models have also become a powerful tool in scientific research. We know that GPT-4 can realize the analysis, understanding, integration, and output of multi-modal data. So can it be helpful for labeling and describing environmental sounds?

Professor Wu: This is a very interesting question. If you ask a person to use words to describe the difference in the sound of a violin and a cello, or how different the sounds are in a cafe and a restaurant, it will be difficult for a person to describe the diffirences clearly. But if you pop the question to ChatGPT, both GPT-3.5 and GPT-4 can provide very reasonable answers. This proves that ChatGPT actually makes up for the shortcomings of the acoustic encoder through its powerful text capabilities. we believe that ChatGPT may do a better job than humans in describing environmental sounds.

The key to the problem now is what kind of prompts should be given to ChatGPT so that it can conform to our requirements and description habits, and at the same time be able to accurately describe the specific characteristics of the sound. Some time ago, the University of Surrey published a relevant study. Although this study only used ChatGPT to assist the research in the first step, I think this is overall a very promising direction.

However, in a speech model, even if ChatGPT is used, images or speech cannot be directly used as materials for multi-modal training. In the future, we may need to fine-tune or do joint training in our own laboratory. However, there are indeed application scenarios in this regard. ChatGPT’s current ability to understand information of different modalities can assist us in the partial analysis and processing of information media.

Question: Based on ChatGPT, what other attempts has your research team made?

Professor Wu: The application of ChatGPT is still text-based. If there are few samples in the process of model training, ChatGPT can be used to label the data, especially when dealing with very subtle differences in emotional relationships. In addition to the analysis of the voice itself, ChatGPT can also be used to do other research, such as letting the robot simulate the entire dialogue-based consultation scene between the doctor and the patient — using ChatGPT as two simulators, one imitating the patient and the other imitating the doctor, and then compare the simulated clinial interviews with the real psychiatric interview process to explore the limitations of ChatGPT in understanding and processing natural language when compared with the real scenarios.

Among all the AI models we have trained, ChatGPT’s capability of understanding natural languages has reached its limit. In the next step. We want to use ChatGPT to study how can models achieve the same effect for human-robot clinical interviews as those conducted in real world. If ChatGPT can no longer improve its capability of understanding natural language, then what are the differences in factors between natural dialogues and model-simulated dialogues? These are what we are very concerned about now.

Question: You mentioned that ChatGPT can simulate the clinical consultations between doctors and patients. Can the simulated data it generates be used in authentic research? Are the findings based on such synthetic data meaningful?

Professor Wu: At present, it doesn’t really work. It can simulate some relatively basic cases, but there is still a certain gap between simulations and reality.

Specifically, when it comes to simulating doctors, there are certain differences in the questioning format or style between ChatGPT and doctors. ChatGPT may use more formal expressions, while during normal consultations, doctors are likely to ask questions more casually in order to make patients relax. In reality, when a patient sees a doctor, he/she will not tell the doctor some answers frankly, or many patients do not know what their symptoms are, but when ChatGPT plays the role of a patient, for example, if we ask it to resist at first, it may resist once or twice. If you ask again in the opposite direction, it will immediately tell you the answer. It gives you the impression that “I have an answer, but since you told me not to give this answer directly, I will hide it for a while”, the psychological gap between ChatGPT and real patients is still very large.

 

So, I think it can be used to augment data to a certain degree. However, if such simulated data is used as complete training data, it may be too far from the actual application scenarios.

When it comes to the application of ChatGPT, you can compare the difference between the data simulated by ChatGPT as a patient and the data of real patients. This part of the work has already produced preliminary results and will be published soon. The more intuitive conclusion that can be made at present is: if a better prompt is given to ChatGPT, and when the patient is cooperative, the simulated scene can be very close to the real consultation scene, but when a patient is not cooperative, there’re difficulties to be dealt with. So, the differences depend on the complexity of the actual scenarios that ChatGPT has to simulate.

Question: At the “AI Helps Tackle Brain Diseases Symposium” held some time ago, you mentioned that you have been doing research on the diagnosis of depression, Parkinson’s disease and other disorders based on language function for a long time. What is the connection between speech and brain diseases? How to use voice to detect diseases?

Professor Wu: For example, Parkinson’s disease is a neurodegenerative disease that affects motor control in the brain. Motor control not only affects the control of hands and feet but also affects the speech preparation stage. There is still a buffering process between the two steps when the brain generates the idea of “speaking” and controls the vocal organs to produce sounds. When the motor control is affected, although the brain has come up with the words to be said, it is still impossible to make a sound in time since the brain can not control the vocal organs. Therefore, many Parkinson’s patients may have unclear pronunciation or keep repeating a certain sound when pronouncing. They may also have long pauses in their speech as preparation for the next sound.

Therefore, Parkinson’s patients have some representations in acoustic performance, such as slower speaking speed, smaller overall vocabulary, longer pauses between words and more repetitions than the more normal people. These are actually features that can be quantified and calculated. By incorporating these quantified contents, the final detection model will provide feedback on many disease-related characteristics through voice.

Question: What is the current accuracy rate of speech-based diagnosis? Has some research been applied in the medical field? Are there potential ethical issues?

Professor Wu: In fact, there are reports on the accuracy of such research applications in domestic and foreign news, such as the dataset of the University of Southern California used in the detection of depression. Using this dataset as the benchmark, the accuracy rate of speech based diagnosis can reach 80% to 90% after the experimental parameter adjustment. However, when it is placed in a real scene or a similar scene where data is collected in different ways, its migration ability is still very poor. If different datasets are tested without any parameter optimization, the accuracy rate may reduce to 60% to 70%. Faced with this situation, on the one hand, different modalities can be combined for detection; on the other hand, it may be necessary to further search for features that are not affected by environmental factors or dataset factors to finally achieve a more robust or transferable detection method.

Certain ethical issues will arise during this process. The first is the question of whether such model-based detection can replace doctors. First of all, this technology itself can help doctors. For example, a person receiving treatment can check his or her recent mental status through a psychological status screening mini-program. There is no need to go to the hospital for every review, which will improve the convenience of diagnosis substantially. But even if it has achieved good accuracy rate in experiments, it cannot replace the test results of a doctor’s consultation.

In addition, the reason why we emphasize the use of voice for detection is that many other aspects of information, such as facial information and gait, etc., may involve more private content than voice, even though voice detection will still be risky in terms of privacy. For example, more face-to-face consultations are used to diagnose depression or other mental illnesses. If the diagnosis is based solely on the patient’s description of his or her condition, the objectivity will decrease. Therefore, we are considering whether wearable devices can be used to monitor the sleep and activity of patients in the long run as references to the actual status of the patient. Even this measure will involve another type of ethical issues: Do doctors have the right to obtain the patient’s daily life trajectory for condition monitoring? Therefore, I think from a macro perspective, there may be certain conflicts and contradictions in the management of medical care, personal care, and public health.

Technology itself is developing forward, but there are many restricting factors. There are many factors that need to be considered before technology can be applied to real life.

Question: With the rapid development of AI technology, what breakthroughs do you think will be achieved in the voice field?

Professor Wu: A doctor who graduated from our laboratory is now working on a multi-language speech recognition project at Google. This project hopes to achieve multi-language speech recognition and build a speech recognition system that can recognize multiple languages or even 100 different languages, which also makes use of the correspondence between sound and text. During the speaking process, there is a strong correspondence between phoneme and language (character or letter). Phoneme + duration can help materialize the correspondence between text and speech.
There is also strong correspondence in the analysis of rich audio. For example, there is a strong directivity between bird chirp and a type of audio containing bird chirp. This directivity can be used in reverse to encode audio. Therefore, the relation between text and speech can also help us understand or analyze sounds in multiple modalities.

So l would think that a very promising development direction in the future is to use language as a clue to better knowledge to assist research, which may be very helpful in any research field related to speech.

Question: After the advent of ChatGPT, which direction do you think will be the next stage of development for AGI? Will artificial intelligence eventually evolve to be like real humans?

Professor Wu: There was a sci-fi movie called Her released a long time ago. In the movie, everyone has a visual system, and people can talk with each other through headphones. There is no gap in information understanding between machines and people. This is my preliminary assumption of the future AGI functions. Another example is the accompanying robot dog that Boston Dynamics wants to make, which is also a research direction. The information processing that can realize these functions must be multimodal. If the gap is too deep between the information obtained by the machine and by humans, there is no way that AGI will help people make decisions. Therefore, technically speaking, there are still some parts of the model that need to be corrected. Only by exploring and bridging the gap between humans and machines can AI become more human-like.

 

Now in the interaction process between humans and machines, the machine itself exists more in the form of a tool. The human-machine interaction will be more similar to human-human interaction with machines can proacvtively engage in conversations instead of being limited to prompted answers.

Besides, when you know that the other party is a robot, will you say “thank you” or “sorry” to the robot?

During our simulation, we found that if the doctor knew in advance that the patient was played by ChatGPT, the doctor would not have empathy for the “patient” and would be more inclined to go through the process to confirm during the diagnosis process that whether ChatGPT acts a qualified patient. The same is true when ChatGPT plays the role of a doctor to deal with patients. Therefore, it is also necessary to understand the gaps between human-human interactions and human-machine interactions. Exploring the gap is also key to achieving genuine AGI.

Question: Do you regard it a good thing or a bad thing to develop more human-like machines?

Professor Wu: I think making machines more similar to humans can help machines achieve better performance on the one hand, and on the other hand, when machines have capabilities similar to humans, humans can communicate more naturally with machines, otherwise there is still gap between the two. As for whether we want robots to be more human-like in our research, this is an ethical discussion to a larger extent. For example, the AI robot Moss in the movie The Wandering Earth may have begun to show its own consciousness. Is the appearance of consciousness a good thing or a bad thing for robots, and what is the value and significance of the existence of robots? I think these will be discussed by philosophers.

From a technical point of view, we definitely hope that AGI will be more like humans. When robots have abilities similar to humans, it will be of great help to humans by liberating them from a lot of repetitive labor. As for whether human beings will
improve or reduce their behavioral capabilities after the liberation, that’s the
consequence no one can predict at the moment.

 

Interview by Ashely Wang
Edited by Xu Yunke