US 12,293,174 B1
Method and system for memory management within machine learning inference engine
Nikhil Bernard John Stephen, Sunnyvale, CA (US); Senad Durakovic, Palo Alto, CA (US); Chien-Chun Chou, Morgan Hill, CA (US); Pranav Jonnalagadda, San Jose, CA (US); and Ulf Hanebutte, Gig Harbor, WA (US)
Assigned to Marvell Asia Pte Ltd, Singapore (SG)
Filed by Marvell Asia Pte Ltd, Singapore (SG)
Filed on Jul. 26, 2023, as Appl. No. 18/226,719.
Claims priority of provisional application 63/467,915, filed on May 19, 2023.
Int. Cl. G06F 8/41 (2018.01)
CPC G06F 8/453 (2013.01) [G06F 8/443 (2013.01); G06F 8/451 (2013.01)] 58 Claims
OG exemplary drawing
 
1. A computer implemented method, comprising:
receiving a machine learning (ML) network model comprising a plurality of ML operations in high-level code;
partitioning the ML network model into a plurality of sub-graphs;
generating an internal representation (IR) for each sub-graph of the plurality of sub-graphs, wherein the IR is mapped to one or more components in a multi-processing tile device;
identifying two more processing tiles of the multi-processing tile device having a same dimension for an input tensor data as one another performing a same primitive function based on the IR;
determining whether the two more processing tiles of the multi-processing tile device have a same dimension for their respective output tensor data for the same primitive function based on the IR;
responsive to determining that the two or more processing tiles have the same dimension for their respective output tensor data for the same primitive function, allocating a same memory address range within a respective on-chip memory (OCM) of the two or more processing tiles for the same primitive function;
linking the memory address range within the respective OCM of the two or more processing tiles to one another to form a grouped memory space within the respective OCM of the two or more processing tiles; and
compiling the each sub-graph of the plurality of sub-graphs based on the linking.