語音訊號處理研討會是中華民國計算語言學學會，一年一度定期舉辦的學術交流盛會，本次會議所邀請之演講者，包括 香港中文大學 李丹教授、中國 南方科技大學 陳霏教授、臺灣大學資訊工程學系 陳縕儂教授、台北榮總醫研部 鄭彥甫醫師、聯發科 Hantao Huang 博士、中央研究院資訊科技創新研究中心 王緒翔博士、亞東紀念醫院 王棨德醫師。演講內容涵蓋語音信號處理、語音技術硬體開發、自然語言處理及語音與醫學的應用等，是所有台灣學術界與產業界對這方面有興趣的專家學者們不容錯過的一場盛會。
國立嘉義大學資訊工程學系（所）葉瑞峰 教授 : 應用視覺與聲學標註之深度編碼多模技術於影像描述生成(第1年)
國立台灣科技大學 陳冠宇 教授 : 基於類神經網路之語言模型: 革新, 未來與應用
國立臺灣大學電機工程學系暨研究所 李宏毅 教授 : 邁向非督導式語音理解
國立交通大學電機工程學系（所） 冀泰石 教授 : 基於深層神經網路感知模型的雙耳聽覺場景分析模型(1/3)
中央大學電機工程學系 李龍豪 教授 : 基於腦電圖之小波分析於中風病人癲癇偵測
國立成功大學 吳宗憲 教授 : 銀髮族口語互動式居家陪伴及推薦系統
國立臺北大學通訊工程學系 江振宇 教授 : 利用深度學習技術開發之文字轉語音系統
臺北科技大學電子系 廖元甫 教授 : 基於深度學習之進階語音致能應用開發 - 多語言電視與廣播節目自動文字轉寫、摘要擷取，語料庫建立與內容檢索
中央研究院資訊科技所 王新民 教授 : 語音轉換及其應用
中央研究院資訊科技創新研究中心 曹昱 教授 : 研發新穎的目標函數及模型簡化技術於深度學習之語音增強系統
08/07 SWS 2020!
Tan Lee is currently an Associate Professor at the Department of Electronic Engineering, the Chinese University of Hong Kong (CUHK). He has been working on speech and language related research for over 20 years. His research covers spoken language technologies, speech enhancement and separation, audio and music processing, speech and language rehabilitation, and neurological basis of speech and language. He led the effort on developing Cantonese-focused spoken language technologies that have been widely licensed for industrial applications. His current work is focused on applying signal processing and machine learning methods to atypical speech and language that are related to different kinds of human communication and cognitive disorders. He is an Associate Editor for the IEEE/ACM Transactions on Audio, Speech, and Language Processing and the EURASIP Journal on Advances in Signal Processing. He is the Vice Chair of ISCA Special Interest Group of Chinese Spoken Language Processing, and an Area Chair in the Technical Programme Committees of INTERSPEECH 2014, 2016 and 2018.
Yun-Nung (Vivian) Chen is currently an assistant professor in the Department of Computer Science & Information Engineering at National Taiwan University. She earned her Ph.D. degree from Carnegie Mellon University, where her research interests focus on spoken dialogue systems, language understanding, natural language processing, and multimodality. She received Google Faculty Research Awards, MOST Young Scholar Fellowship, FAOS Young Scholar Innovation Award, Student Best Paper Awards, and the Distinguished Master Thesis Award. Prior to joining National Taiwan University, she worked in the Deep Learning Technology Center at Microsoft Research Redmond.
Dr. Syu Siang Wang received the Ph.D. degree (2018) in the Graduate Institute of Communication Engineering, National Taiwan University. The topic of his Ph.D. research is on wavelet speech enhancement and feature compression. He won the PhD Thesis Award at ACLCLP. In addition, he gained twice opportunities to be an summer intern in National Institute of Information and Communications Technology, Japan, in Sep. 2015 and Department of Electrical and Electronic Engineering, SUSTC, China in Jun. 2016.
From August 2018 to July 2019, he was the postdoctoral researcher in MOST Joint Research Center for AI Technology and All Vista Healthcare, where he engaged in research on developing algorithm for healthcare applications . Several papers were published based on his research achievements.
Currently, he is the postdoctoral researcher in the Research Center for Information Technology Innovation, Academia Sinica. His research interests include speech and speaker recognition, acoustic modeling, audio-coding, and bio-signal processing.
Fei Chen received the B.Sc. and M.Phil. degrees from the Department of Electronic Science and Engineering, Nanjing University in 1998 and 2001, respectively, and the Ph.D. degree from the Department of Electronic Engineering, The Chinese University of Hong Kong in 2005. He continued his research as post-doctor and senior research fellow in University of Texas at Dallas and The University of Hong Kong, and joined Southern University of Science and Technology (SUSTech) as a faculty in 2014. Dr. Chen is leading the speech processing research group in SUSTech, with research focus on speech perception, speech intelligibility modeling, speech enhancement, and assistive hearing technology. He published over 80 journal papers and over 80 conference papers in IEEE journals/conferences, Interspeech, Journal of Acoustical Society of America, etc. He received the best presentation award in the 9th Asia Pacific Conference of Speech, Language and Hearing, and 2011 National Organization for Hearing Research Foundation Research Awards in States. Dr. Chen is now serving as associate editor/editorial member of "Frontiers in Psychology" "Biomedical Signal Processing and Control" "Physiological Measurement".
Yen-Fu Cheng is a surgeon-scientist at the Department of Medical Research and director of research and attending doctor at the Department of Otolaryngology-Head and Neck Surgery, Taipei Veterans General Hospital. He is also an adjunct assistant professor of Institute of Brain Science/Faculty of Medicine, National Yang-Ming University. He is currently the Principal Investigator of the Laboratory of Auditory Physiology and Genetic Medicine.
Yen-Fu’s research focuses on auditory neuroscience and clinical otology. For basic research, he is dedicated in applying cutting edge gene transfer and gene editing methods to understand and develop therapy for inner ear disorders. For clinical research, he is interested in using state-of-the-art methods to approach clinical otology issues, such as next-generation sequencing for genetic medicine and artificial-intelligence for hearing-related diseases.
Yen-Fu received his medical degree from Taipei Medical University, and doctoral degree from Massachusetts Institute of Technology, where he studied Speech and Hearing Bioscience and Technology at the Harvard-MIT Division of Health Sciences and Technology. He was a post-doctoral research fellow at Harvard Medical School prior he started his lab at VGH-TPE/NYMU.
Hantao Huang (S’14) received the B.S. and Ph.D. degrees from the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, in 2013 and 2018, respectively. Since 2018, he has been a Staff Engineer with MediaTek, Singapore, where he is involved in natural language processing algorithms, neural network compression and quantization for edge devices. His current research interests include speech recognition, machine-learning algorithms, and low power systems.
Dr. Chi-Te Wang received his MD degree from the National Taiwan University, Taipei, Taiwan, in 2003. After resident training from 2003 to 2008, he joined Far Eastern Memorial Hospital as an attending physician. He received PhD degree from the Institute of Epidemiology and Preventive Medicine at National Taiwan University in 2014. During his professional carrier, he visited Mount Sinai Hospital (NYC, 2009), Mayo Clinic (Arizona, 2012), Isshiki voice center (Kyoto, 2015), UC Davis voice and swallow center (Sacramento, 2018), and UCSF voice and swallow center (San Francisco, 2018) for continual exposure on the expertise practice. He is a corresponding member of the American Laryngological Society and member of councils on the Taiwan Otolaryngological Society and Taiwan Voice Society. He has a wide clinical and academic interest, and has published a dozen papers on different fields, including phonosurgery, automatic detection and classification of voice disorders, real time monitoring of phonation, and telepractice. He is the inventor of multiple international patents on voice detection, classification, and treatments. He co-hosted Big Data Cup Challenge on 2018 and 2019 IEEE International Conference on Big Data. He is the winner of Society for Promotion of International Oto-Rhino-Laryngology (SPIO) Award on 2015, Best Synergy Award of Far Eastern Group on 2018, and National Innovation Award of Taiwan in 2019.
|08:30 - 09:00||報到|
|09:00 - 09:10||開幕致詞||郭旭崧 校長||-|
|09:10 - 09:50||Deep Learning Approaches to Automatic Assessment of Speech and Language Impairment||李丹 教授||王新民 教授|
|09:50 - 10:00||Intermission|
|10:00 - 10:40||Towards Superhuman Conversational AI||陳縕儂 教授||曹昱 教授|
|10:40 - 10:50||Intermission|
|16:00 - 16:40||Single- and Multi-channel Speech Enhancement System||王緒翔 博士||方士豪 教授|
|11:30 - 13:30||MOST Outcomes Presentation|
|13:30 - 14:10||Minimum Acoustic Information Required for an Intelligible Speech||陳霏 教授||李祈均 教授|
|14:10 - 14:20||Intermission|
|14:20 - 15:00||A New Era of Otology and Hearing Research: NGS, CRISPR, App, AI and Beyond||鄭彥甫 博士||廖文輝 博士|
|15:00 - 15:10||Intermission|
|15:10 - 15:50||Make a Power-efficient Voice UI
on Edge Devices
|Hantao Huang 博士||廖元甫 教授|
|15:50 - 16:00||Intermission|
|10:50 - 11:30||Ambulatory Phonation Monitoring Using Wireless Microphone Based on Energy Envelope||王棨德 博士||力博宏 博士|
|16:40 - 17:00||閉幕||賴穎暉 教授 王坤卿 教授||-|
Speech is a natural and preferred means of expressing one’s thoughts and emotions for communication purpose. Speech and language impairments are negatively impacting the daily life of a large population worldwide. Speech impairments are manifested in the aspects of atypical articulation and phonation, while language impairments could be present across multiple linguistic levels in the use of spoken or written language. Timely and reliable assessment on the type and severity of impairment is crucial to effective treatment and rehabilitation. Conventionally speech assessment is carried out by professional speech and language pathologists (SLPs). In view of the shortage of qualified SLPs with relevant linguistic and cultural background, objective assessment techniques based on acoustical signal analysis and machine learning models are expected to play an increasingly important role in assisting clinical assessment. This presentation will cover a series of our recent studies on applying deep learning models to automatic assessment of different types of speech and language impairments. The types of impairments that we have tackled include voice disorder in adults, phonology and articulation disorder in children, and neurological disorder in elderly people. All of our works are focused on spoken Cantonese. The use of Siamese network and auto-encoder model has been investigated to address the challenges related to the scarcity of training speech and the absence of reliable labels. The findings in attempting the end-to-end approach to speech assessment will also be shared.
Even conversational systems have attracted a lot of attention recently, the current systems sometimes fail due to the errors from different components. This talk presents potential directions for improvement: 1) we first focus on learning language embeddings specifically for practical scenarios for better robustness, and 2) secondly we propose a novel learning framework for natural language understanding and generation on top of duality for better scalability. Both directions enhance the robustness and scalability of conversational systems, showing the potential of guiding future research areas.
Real-world environments are always contain stationary and/or time-varying noises that are received together with speech signals by recording devices. The received noises inevitably degrade the performance of human--human and human--machine interfaces, and this issue has attracted significant attention over the years. To address this issue, an important front-end speech process, namely speech enhancement, which extracts clean components from noisy input, can improve the voice quality and intelligibility of noise-deteriorated clean speech. These speech-enhancement systems can be split into two categories in terms of the physical configurations: single- and multi-channels. For single-channel-based speech enhancement systems, the speech waveform was recorded essentially from an microphone, and then enhanced through the enhancement system, which is derived based on the temporal information of the input. Multiple microphones are used to record the input speech in a multi-channel-based speech enhancement system. The system is designed by simultaneously exploiting the spatial diversity and temporal structures of received signals. In this talk, we present our recent research achievements using machine learning and signal processing on improving speech perception abilities for both configurations.
Speech signal carries a lot of redundant information for speech understanding, and many studies have showed that the loss of some acoustic information did not significantly affect speech intelligibility if important acoustic information was preserved. Due to their hearing loss, hearing-impaired listeners are unable to recognize some acoustic information (e.g., temporal fine structure). Hence, studying the important acoustic information minimally required for an intelligible speech in different listening environments could guide our design of novel assistive hearing technologies. In this talk, I will first introduce early work on the relative importance of commonly-used acoustic cues for speech intelligibility, particularly on a vocoder model for speech intelligibility. Then, I will present recent studies towards reconstructing an intelligible speech with cortical EEG signals, including Mandarin tone imagery and speech reconstruction.
The fields of clinical otology and hearing research are advancing at the forefront of innovation in medicine and technology. Promising progress in genetic medicine and digital technology have started to change the traditional medical and hearing research. Next-generation sequencing, novel gene therapy vectors, CRISPR-Cas9 gene editing technologies, mobile-phone apps and artificial intelligence all generates enormous creative energy. In this talk, I will introduce how these revolutionary technologies change physician’s practice and research.
As privacy is getting more and more concerned, voice user interface (UI) is in the process of transition from the cloud to the edge device. However, to land a neural network based voice/language model on edge devices with efficient power consumption is very challenging. In this talk, we will first introduce MediaTek NeuroPilot from the platform level to tackle this challenge. Then, more specifically, we investigate it from the algorithm perspective including the algorithm trend and landing opportunity. Finally, we show some preliminary results on speech recognitions and natural language understanding.
Voice disorders mainly result from chronic overuse or abuse, particularly for teachers or other occupational voice users. Previous studies have proposed a contact microphone attached to the anterior neck for ambulatory voice monitoring; however, the inconvenience associated with taping and wiring, and the lack of real-time processing has limited its daily application.
Starting from 2015, we founded a research group collaborating with experts from National Yang-Ming University, Yuan Ze University and Far Eastern Memorial Hospital. We proposed an system using wireless microphone for real-time ambulatory voice monitoring. We invited 10 teachers to participate in the pilot study. We designed an adaptive threshold (AT) function to detect the presence of speech based on energy envelope. All the participant wore a wireless microphone during a teaching class (around 40-60 minutes), in quite classroom (background noise < 55dB SPL). We developed a software for manually labeling speech segments according to the time and frequency domains. We randomly selected 25 utterance (10 s each) from the recorded audio files for calculating the coefficients for AT function via genetic algorithm. Another five random utterances were used for testing the accuracy of ASD system, using manually labeled data as the ground truth. We measured phonation ratio (speech frames / total frames) and the length of speech segments as a proxy of phonation habits of the users. We also mimicked scenarios of noisy backgrounds by manually mixing 4 different types of noise into the original recordings. Adjuvant noise reduction function using Log MMSE algorithm was applied to counteract the influence of detection accuracy.
The study results exhibited detection accuracy (for speech) ranging from 81% to 94%. Subsequent analyses revealed a phonation ratio between 50% and 78%, with most phonation segments less than 10 s. Although the presence of background noise reduced the accuracy of the ASD system (25% to 79%), adjuvant noise reduction function can effectively improve the accuracy for up to 45.8%, especially under stable noise (e.g. white noise).
This study demonstrated a good detection accuracy of the proposed system. Preliminary results of phonation ratio and speech segments were all comparable to those of previous research. Although wireless microphone was susceptible to background noise, additional noise reduction function can overcome this limitation. These results indicate that the proposed system can be applied to ambulatory voice monitoring for occupational voice users.