US 12,148,063 B2
	Compute optimizations for low precision machine learning operations
Elmoustapha Ould-Ahmed-Vall, Chandler, AZ (US); Sara S. Baghsorkhi, San Jose, CA (US); Anbang Yao, Beijing (CN); Kevin Nealis, San Jose, CA (US); Xiaoming Chen, Shanghai (CN); Altug Koker, El Dorado Hills, CA (US); Abhishek R. Appu, El Dorado Hills, CA (US); John C. Weast, Portland, OR (US); Mike B. Macpherson, Portland, OR (US); Dukhwan Kim, San Jose, CA (US); Linda L. Hurd, Cool, CA (US); Ben J. Ashbaugh, Folsom, CA (US); Barath Lakshmanan, Chandler, AZ (US); Liwei Ma, Beijing (CN); Joydeep Ray, Folsom, CA (US); Ping T. Tang, Edison, NJ (US); and Michael S. Strickland, Sunnyvale, CA (US)
Assigned to Intel Corporation, Santa Clara, CA (US)
Filed by Intel Corporation, Santa Clara, CA (US)
Filed on Oct. 5, 2022, as Appl. No. 17/960,611.
Application 17/960,611 is a continuation of application No. 17/720,804, filed on Apr. 14, 2022, granted, now 11,468,541.
Application 17/720,804 is a continuation of application No. 16/983,080, filed on Aug. 3, 2020, granted, now 11,308,574, issued on Apr. 19, 2022.
Application 16/983,080 is a continuation of application No. 16/446,265, filed on Jun. 19, 2019, granted, now 11,138,686, issued on Oct. 5, 2021.
Application 16/446,265 is a continuation of application No. 16/197,821, filed on Nov. 21, 2018, granted, now 10,853,906, issued on Dec. 1, 2020.
Application 16/197,821 is a continuation of application No. 15/789,565, filed on Oct. 20, 2017, granted, now 10,242,423, issued on Mar. 26, 2019.
Application 15/789,565 is a continuation of application No. 15/581,167, filed on Apr. 28, 2017, granted, now 10,726,514, issued on Jul. 28, 2020.
Prior Publication US 2023/0061331 A1, Mar. 2, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G06T 1/20 (2006.01); G06F 7/483 (2006.01); G06F 9/30 (2018.01); G06F 9/38 (2018.01); G06F 9/50 (2006.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/063 (2023.01); G06N 3/084 (2023.01); G06N 20/00 (2019.01); G06T 1/60 (2006.01); G06F 3/14 (2006.01); G06T 15/00 (2011.01)

CPC G06T 1/20 (2013.01) [G06F 7/483 (2013.01); G06F 9/30014 (2013.01); G06F 9/30185 (2013.01); G06F 9/3863 (2013.01); G06F 9/5044 (2013.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/063 (2013.01); G06N 3/084 (2013.01); G06N 20/00 (2019.01); G06F 3/14 (2013.01); G06T 1/60 (2013.01); G06T 15/005 (2013.01)]

27 Claims

1. A multi-chip module accelerator usable to execute tensor data processing instructions, the multi-chip module accelerator comprising:

a multi-chip module comprising:

an interconnect to a host processor;

a plurality of distinct chips integrated on the multi-chip module;

a memory stack including multiple memory dies; and

parallel processor circuitry communicatively coupled to the memory stack, the parallel processor circuitry comprising a plurality of multiprocessor cores distributed across the plurality of distinct chips, each of the plurality of multiprocessor cores configured to execute a single instruction to perform multiple matrix multiplication and accumulate operations;

wherein:

the matrix multiplication and accumulate operations comprise floating-point operations;

the floating-point operations are configurable to comprise two-dimensional matrix multiply and accumulate operations involving inputs that have differing floating-point precisions, the two-dimensional matrix multiply and accumulate operations including a plurality of concurrent multiply operations;

the floating-point operations comprise a first operation at a first precision and a second operation at a second precision; and

the first operation comprises a multiply having at least one 16-bit floating-point input and the second operation comprises an accumulate having a 32-bit floating-point input.