Voice production is a complex neuromuscular coordination process. Air moves out of the lungs towards the vocal folds via a coordination of the diaphragm, abdominal/chest muscles, and rib cage etc. Vocal fold vibrates and modulates airflow through the glottis, producing voiced sound. Voiced sound then travels through the vocal tract, where it is selectively amplified or attenuated at different frequencies. Prior clinical researches have shown that mental health disorders like depression affects the voice production process, for example the voice from a depressed person was summarized as slow, monotonous and disfluent with high jittering and shimmering. Such characteristics (or features/representations) in the voice, so called voice biomarker, can be used to assess or diagnose a condition/disease. Wonder Tech owns the most cutting edge AI technology for depression assessment using voice biomarkers.
To ensure high model accuracy, we follow the highest standard when collecting training data. Our multi-center reasearch was designed and led by Peking University Sixth Hosipital, one of the best leading mental health institutes in China. Patients were diagnosed and recruited by psychiatrists from six different mental health hospitals across the country, following DSM-5 standards. Patients were given an H5 miniprogram for the voice sample collection, and the collection process was carefully designed, covering long vowels,number counting, rainbow passages, speech under cognitive load, open questions etc. Our mental health dataset Oizys now contains more than 43000 audio sessions, collected from depression patients, anxiety patients, non-depression non-anxiety people etc, and it's by far the biggest audio dataset from DSM-5 diagnosed patients.
Leveraing the most advanced deep learning and transfer learning AI technology, Wonder Tech owns the most advanced AI model for depression assessment using voice biomarkers. Our AI model can give accurate assement results based on 30 second voice recordings (16KHz, 16-bit).
We first use self-supervised learning to learn latent voice feature representations from unlabled voice data. These latent feature representations are then used as the input for another neural network which was trained on Oizys dataset. Compared to AI models from tranditional feature engineering (MFCCs etc.), our model achieves much higher performance (AUC 0.902) and provides better robustness in real world scenarios.
Clinical study: A deep learning-based model for detecting depression in senior population
Read More