US 11,894,008 B2
Signal processing apparatus, training apparatus, and method
Naoya Takahashi, Kanagawa (JP)
Assigned to SONY CORPORATION, Tokyo (JP)
Appl. No. 16/769,122
Filed by SONY CORPORATION, Tokyo (JP)
PCT Filed Nov. 28, 2018, PCT No. PCT/JP2018/043694
§ 371(c)(1), (2) Date Jun. 2, 2020,
PCT Pub. No. WO2019/116889, PCT Pub. Date Jun. 20, 2019.
Claims priority of application No. 2017-237401 (JP), filed on Dec. 12, 2017.
Prior Publication US 2021/0225383 A1, Jul. 22, 2021
Int. Cl. G10L 21/00 (2013.01); G10L 25/00 (2013.01); G10L 21/007 (2013.01); G10L 21/028 (2013.01); G10L 21/013 (2013.01)
CPC G10L 21/007 (2013.01) [G10L 21/013 (2013.01); G10L 21/028 (2013.01)] 15 Claims
OG exemplary drawing
 
1. A signal processing apparatus, comprising:
a central processing unit (CPU) configured to:
receive first acoustic data of a sound of an input sound source;
receive a voice quality converter parameter, wherein
the voice quality converter parameter is trained based on a discriminator parameter, a speaker ID of a target sound source, and first training data of the sound of the input sound source,
the discriminator parameter is trained based on the first training data of the sound of the input sound source, second training data of a sound of the target sound source, and third training data of a sound of a sound source different from the input sound source and the target sound source,
the target sound source is different from the input sound source,
the discriminator parameter discriminates the input sound source of the first acoustic data,
the first training data and the second training data are based on second acoustic data of a mixed sound,
the mixed sound includes the sound of the input sound source and the sound of the target sound source, and
the second acoustic data is different from parallel data and clean data; and
convert the first acoustic data of the input sound source to third acoustic data of voice quality of the target sound source, wherein the conversion of the first acoustic data to the third acoustic data is based on the voice quality converter parameter.