Paralinguistic and Nonverbal Information Extraction from Speech Signal towards Empathetic Dialogue Systems

FUJIMURA, Hiroshi; 藤村, 浩司

doi:https://doi.org/10.15002/00025229

このアイテムのアクセス数:153件（2025-07-12 01:14 集計）

Permalink : https://doi.org/10.15002/00025229

Permalink : https://hdl.handle.net/10114/00025229

閲覧可能ファイル

ファイル	フォーマット	サイズ	閲覧回数	説明
21_thesis_fujimura	pdf	5.08 MB	259	博士論文全文
21_point_fujimura	pdf	369 KB	89	論文および審査の要旨

論文情報

ファイル出力

アイテムタイプ	学位論文
タイトル	Paralinguistic and Nonverbal Information Extraction from Speech Signal towards Empathetic Dialogue Systems
著者	著者名 FUJIMURA, Hiroshi
著者	著者名藤村, 浩司
言語	eng
DOI	https://doi.org/10.15002/00025229
開始ページ	1
終了ページ	84
発行年	2022-03-24
著者版フラグ	Not Applicable (or Unknown)
学位授与番号	32675甲第546号
学位授与年月日	2022-03-24
学位名	博士(理学)
学位授与機関	機関名法政大学 (Hosei University)
キーワード	Speech recognition
	Speech emotion
	Speaker attributes
	Phoneme recognition
内容記述	情報科学研究科情報科学専攻; 主査教授内田薫, 副査教授伊藤克亘, 副査教授黄潤和
抄録	In this research, we aim to extract paralinguistic and nonverbal information such as emotions, speaking style, and speaker attributes towards a human-like empathetic dialogue system. Empathy is the ability to project the other person’s feelings and thoughts onto the other person’s knowledge. It plays an important role in human communication. In particular, personalization and understanding emotion are essential for an advanced dialogue system. This research focuses on methods for estimating speaker attributes, personal speaking-style and emotion category that are related to personalization and emotion in real-time from a small amount of speech information like a human agent. By integrating the methods proposed in this paper, it is possible to realize more human-like recognition of paralinguistic and nonverbal information for automatic dialogue systems using speech. This doctoral dissertation consists of five chapters. In chapter 1, the introduction is described. In chapter 2, we propose a method for identifying speaker attributes, which are nonverbal information in speech. We specially focus on the identification of male and female speeches as speaker attributes in this chapter. In order to extract speaker attributes, it is necessary to first detect a speech segment from a sound signal sequence, which is a mixture of speech and non-speech segments, and then to identify them in the speech segment. In conventional speaker attribute identification, the endpoint of speech with a certain length of continuous speech is detected, and then the features to identify speaker attributes are extracted, and an identification process is performed for the segment. However, a delay time occurs to identify speaker attributes since the process starts after the end of speech is detected. In our method, the speaker attributes and the probabilities for the speech and non-speech segments are calculated simultaneously for each time frame using a single neural network. The framework can identify speaker attributes sequentially based on their accumulated probabilities. This method made it possible to classify male and female speech with high accuracy while maintaining the accuracy of speech segment detection. In Chapter 3, we propose a phoneme identification method that leads to the extraction of low-intelligibility speech. When low-intelligibility speech occurs, phonemes in a relevant part of a speech are unclear and differ significantly from the nature of phonemes in ordinary speech. Since features of the phonemes depend on a relative phoneme position, it is necessary to cluster them depending on the phoneme position, and a discriminative model for each cluster is trained to determine whether the phoneme is clearly uttered or not. Therefore, we propose a discriminator that contains phoneme environment-dependent clusters inside, which enables to discriminate phonemes without pre-clustering and to calculate a score for the intelligibility. In chapter 4, we propose a method for extracting paralinguistic and nonverbal information, such as fillers and word fragments. There are many variations of fillers and word fragments, and it is not easy to keep all patterns as a dictionary of language in advance. Therefore, existing methods use two-pass decoding to detect fillers and word fragments based on a confusion network output from first-pass recognition and sub-word language model to deal with various fillers and word fragments. However, this method is unsuitable for real-time applications because it can only start processing after decoding the end of the utterance. To solve this problem, we propose a method of learning filler and word fragment acoustic patterns as filler symbols and word fragment symbols, respectively, and incorporating a detection process using filler symbols and word fragment symbols into a WFST decoder for speech recognition, thereby processing them in a single pass of the decoder. There is no need to register all speech patterns of filler, and word fragment in a language dictionary since the proposed method treats filler and word fragment as a single acoustic symbol. By this method, fillers and word fragments can be detected in real-time. Simultaneously, the speech is recognized in one pass without degrading the accuracy. As for fillers, the number of occurrences can be controlled by using a confidence score based on the number of occurrences of filler symbols. In chapter 5, we propose a method for recognizing emotions, which are paralinguistic and nonverbal information. At present, the accuracy of emotion classification for 7 or 8 emotions is only 70 or 80%, even when the emotions are uttered intentionally. Therefore, performance improvement is desired. Emotional features in speech are contained in both a short speech signal and a long speech signal. Therefore, many efforts have been made to improve the performance of emotion classification by incorporating features of various temporal resolutions. Conventional emotion recognition methods tried to improve the performance by using a single neural network encompassing multiple temporal resolutions. However, they have not been able to significantly improve the performance due to the small emotional speech database. We consider that the performance of emotion classification methods using high-level statistical functions (HSFs), which show high accuracy in emotion classification, can be improved by extracting and combining HSFs from windows with multiple temporal resolutions instead of a single fixed window length. In this paper, we aim to improve the accuracy by extending the HSFs extracted from a single fixed window in the existing methods to HSFs generated from multiple windows with temporal resolutions of 30 or more. In addition, to reduce the number of parameters to be learned simultaneously for a small amount of data, stacking with Gradient Boosting Decision Trees (GBDT) is applied when combining features of multiple temporal resolutions. As a result, we obtained the highest emotion classification performance for the American emotional speech database. In addition, although the method initially uses multiple temporal resolutions of more than 30, it is found that the same classification performance can be obtained with only 15 temporal resolution features based on analyzing by GBDT.
資源タイプ	Thesis
インデックス	資料タイプ別＞学位論文＞博士論文＞情報科学研究科
インデックス	109 情報科学部・情報科学研究科＞学位論文＞博士論文