US 12,008,336 B2
Multimodal translation method, apparatus, electronic device and computer-readable storage medium
Mei Tu, Beijing (CN); Fan Zhang, Beijing (CN); and Lijie Wang, Beijing (CN)
Assigned to SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed by SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed on Sep. 20, 2021, as Appl. No. 17/479,195.
Claims priority of application No. 202011003004.7 (CN), filed on Sep. 22, 2020.
Prior Publication US 2022/0092276 A1, Mar. 24, 2022
Int. Cl. G06F 40/58 (2020.01); G06F 18/25 (2023.01); G06F 40/30 (2020.01); G06F 40/44 (2020.01); G06N 3/04 (2023.01); G06V 30/262 (2022.01)
CPC G06F 40/58 (2020.01) [G06F 18/25 (2023.01); G06F 40/30 (2020.01); G06F 40/44 (2020.01); G06N 3/04 (2013.01); G06V 30/274 (2022.01)] 20 Claims
OG exemplary drawing
 
1. A method for providing multimodal translation of a content in a source language, the method comprising:
receiving a user input with respect to a translation request of text included in the content;
in response to receiving the user input, acquiring a multimodal input from the content, the multimodal input including location information related to the content and other multimodal inputs, the location information corresponding to a geographical location from where the content is derived;
generating scene information representing the multimodal input related to the content by using a fusion layer configured to fuse location entity semantic information extracted based on the location information, and source text semantic information and a multimodal feature extracted based on the other multimodal inputs, into a fusion result such that the scene information is generated based on the fusion result;
identifying a candidate word set in a target language;
determining at least one candidate word from the candidate word set based on the scene information; and
translating the text included in the content into the target language using a translation model based on the determined at least one candidate word.