US 12,254,893 B2
Electronic device for recognizing sound and method thereof
Jubum Han, Suwon-si (KR); Hosang Sung, Suwon-si (KR); Yeaseul Song, Suwon-si (KR); and Jeonghoon Lee, Suwon-si (KR)
Assigned to SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed by SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed on Feb. 8, 2023, as Appl. No. 18/107,185.
Application 18/107,185 is a continuation of application No. PCT/KR2023/000604, filed on Jan. 12, 2023.
Claims priority of application No. 10-2022-0032999 (KR), filed on Mar. 16, 2022; and application No. 10-2022-0122409 (KR), filed on Sep. 27, 2022.
Prior Publication US 2023/0298614 A1, Sep. 21, 2023
Int. Cl. G10L 25/51 (2013.01); G10L 15/06 (2013.01); G10L 15/22 (2006.01); G10L 21/12 (2013.01); G10L 21/14 (2013.01); G10L 25/18 (2013.01); G10L 25/30 (2013.01)
CPC G10L 25/51 (2013.01) [G10L 15/063 (2013.01); G10L 21/12 (2013.01); G10L 21/14 (2013.01); G10L 25/18 (2013.01); G10L 25/30 (2013.01); G10L 15/22 (2013.01); G10L 2015/223 (2013.01)] 15 Claims
OG exemplary drawing
 
1. A sound recognition method comprising:
sampling input sound based on a preset sampling rate; and
performing Fast Fourier Transform (FFT) on the sampled input sound based on at least one of random FFT numbers or random hop lengths, and generating a two-dimensional (2D) feature map, with a time axis and a frequency axis, from the sampled input sound on which FFT is performed,
wherein the generating of the 2D feature map comprises:
transforming the sampled input sound into first FFT data based on at least one of a first FFT number among the random FFT numbers or a first hop length among the random hop lengths, generating a first 2D feature map including a first feature from the first FFT data, transforming the sampled input sound into nth FFT data based on at least one of an nth FFT number among the random FFT numbers and an nth hop length among the random hop lengths, and generating an nth 2D feature map including an nth feature from the nth FFT data, where n is greater than 1; and
training a neural network model, which recognizes sound, with a plurality of 2D feature maps including the first 2D feature map and the nth 2D feature map as training data.