US 11,669,329 B2
	Instructions and logic for vector multiply add with zero skipping
Supratim Pal, Bangalore (IN); Sasikanth Avancha, Bangalore (IN); Ishwar Bhati, Bangalore (IN); Wei-Yu Chen, San Jose, CA (US); Dipankar Das, Pune (IN); Ashutosh Garg, Folsom, CA (US); Chandra S. Gurram, Folsom, CA (US); Junjie Gu, Santa Clara, CA (US); Guei-Yuan Lueh, San Jose, CA (US); Subramaniam Maiyuran, Gold River, CA (US); Jorge E. Parra, El Dorado Hills, CA (US); Sudarshan Srinivasan, Bangalore (IN); and Varghese George, Folsom, CA (US)
Assigned to Intel Corporation, Santa Clara, CA (US)
Filed by Intel Corporation, Santa Clara, CA (US)
Filed on Apr. 18, 2022, as Appl. No. 17/723,312.
Application 17/723,312 is a continuation of application No. 16/724,831, filed on Dec. 23, 2019, granted, now 11,314,515.
Prior Publication US 2022/0326953 A1, Oct. 13, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 9/38 (2018.01); G06F 9/30 (2018.01)

CPC G06F 9/3802 (2013.01) [G06F 9/3001 (2013.01); G06F 9/30018 (2013.01); G06F 9/30145 (2013.01)]

20 Claims

1. A graphics processor comprising:

an instruction cache to store a set of instructions for execution;

a plurality of processing resources configured to execute instructions; and

circuitry configured to:

fetch a hardware macro instruction having a predicate mask, a repeat count, and a set of initial operands, wherein the hardware macro instruction is to cause the plurality of processing resources to perform a set of multiply and add operations on input associated with a set of matrices;

atomically execute the set of multiply and add operations via the plurality of processing resources in response to the hardware macro instruction, the set of multiply and add operations executed based on the predicate mask and the repeat count, wherein to atomically execute the set of multiply and add operations includes to execute a first multiply and add operation associated a first active bit within the predicate mask, bypass execution of a second multiply and add operation for a first inactive bit within the predicate mask, and execute a third multiply and add operation for a second active bit within the predicate mask; and

retire the hardware macro instruction upon completion of the set of multiple multiply and add operations.