It has taken decades for computer scientists to finally materialize automatic speech recognition AI models that are closer to the human level. Such purely engineered AI models have completely abandoned earlier modeling frameworks based on linguistic theory, and entirely adopted data-driven, end-to-end, large-scale pre-trained deep neural networks. So how much does such a model resemble the auditory pathways of the human brain?


To address this issue, Professor Yuanning Li’s team at the School of Biomedical Engineering, ShanghaiTech University, in collaboration with Professor Edward Chang at the University of California, San Francisco, and Professor Jinsong Wu/Junfeng Lu’s team at Fudan University, integrated a variety of technological approaches such as self-supervised pre-trained in-depth speech modeling, high-density intracranial electroencephalograms, and single-neuron simulation modeling, and thoroughly investigated the computational and representational similarities between AI speech models and the auditory pathways of the human brain utilizing a cross-language (English and Chinese) control experiment paradigm.


Researchers can extract features at different levels – from purely acoustic spectrogram features to phonetic vowels and consonants and articulations, to relative pitches containing contextual information, etc. – and then use sliding time windows to predict neural responses. If a particular type of feature accurately predicts neural activity in a particular region, it is usually assumed that the neural activity encoded in that region expresses that type of feature.


Over the past decade or so, using intracranial electrophysiological recording experiments as well as neural coding models, researchers have identified several important features of neural coding, e.g., the activity of different neural populations in the secondary auditory cortex of the superior temporal gyrus encodes features ranging from envelopes of voice to specific vowel and consonant phonemes, and so on.


In this study, in addition to the secondary auditory cortex, which is closely related to speech, using intracranial high-density EEG recording technology and high-precision single-neuron biophysical simulation modeling, the researchers obtained neural responses covering the entire auditory pathway from the auditory nerves to the brainstem to the auditory cortex.


Given that AI models and brain auditory circuits are capable of receiving the same speech input and performing similar cognitive functions, are there computational and representational similarities between the two? This is the key question on which this study focuses.


To address this problem, the researchers have constructed a new deep neural coding model. This is a purely data-driven model that extracts feature representations from deep neural networks pre-trained with speeches, applies these data-driven features to construct a new linear coding model, and performs correlation analysis with authentic brain auditory response signals to investigate the similarity between the intrinsic feature representations of deep neural networks and the activities of different neural populations within the brain auditory pathway.


By comparing the performance of neural activity prediction at different nodes of the auditory pathway of the neural coding models built based on these models, it was found that there is indeed a great similarity between the hierarchical structure of the end-to-end speech pre-training network, and that of the auditory circuit.


First, for the entire auditory pathway, the encoding prediction model based on deep neural network features is comprehensively better than the traditional linear feature model based on linguistic theory. This indicates that the entire auditory pathway has strong nonlinear features. Second, models with different levels of complexity correspond to different regions in the auditory pathway. In addition, it was found that for the same self-supervised speech model, its overall hierarchical structure corresponds to the AN-IC-STG hierarchical structure of the auditory pathway.


Having established the representational similarities between the deep speech model and the auditory pathway, the researchers further explored the computational mechanisms driving these representational similarities, focusing on the best-performing HuBERT model.


The results show that as the network deepens, the attentional weights aligned to long-range contextual structures become progressively larger. It is worth emphasizing that the HuBERT model used here is fully self-supervised, and the training process does not include any explicit information about the context structure as well as the speech content information. This result suggests that a self-supervised trained speech model can learn key context structure information related to language and semantics in natural speech.


In the secondary auditory cortex of the superior temporal gyrus, which is closely related to speech processing, the higher the alignment of the self-attentional weights with the contextual structures in speech, the better the neural network’s prediction of brain activity, while conversely, in the primary auditory cortex as well as in these regions of the auditory nerve and brainstem, the greater the expression of localized transient information in the temporal domain, and the greater the resemblance between neural networks and brain signals.


Last but not least, the study further analyzed whether the self-supervised model was able to learn more advanced contextual information. The results show that the self-supervised model is able to learn higher-level contextual information related to linguistic specificity and that this specificity information is significantly correlated with computation and representation in the phonetic cortex of the brain.


This study provides a new biological perspective on unlocking the “black box” of deep neural networks, especially the self-attention model called Transformer.


Read the paper on Nature Neuroscience