US 11,989,888 B2
	Image sensor with integrated efficient multiresolution hierarchical deep neural network (DNN)
Kevin Chan, San Jose, CA (US); Ping Wah Wong, Sunnyvale, CA (US); and Sa Xiao, Valkenswaard (NL)
Assigned to Sony Semiconductor Solutions Corporation, Kanagawa (JP)
Filed by Sony Semiconductor Solutions Corporation, Kanagawa (JP)
Filed on Aug. 4, 2021, as Appl. No. 17/393,652.
Prior Publication US 2023/0039592 A1, Feb. 9, 2023
Int. Cl. G06T 7/207 (2017.01); G06F 3/01 (2006.01); G06F 18/214 (2023.01); G06N 3/045 (2023.01); G06T 7/246 (2017.01)

CPC G06T 7/207 (2017.01) [G06F 3/017 (2013.01); G06N 3/045 (2023.01); G06T 7/246 (2017.01); G06F 18/214 (2023.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01)]

21 Claims

1. A stacked image sensor comprising:

a pixel array layer configured to capture an image and transfer image data of the captured image; and

a logic and deep neural network (DNN) layer, the logic and DNN layer including a first DNN that is a preliminary DNN, and a second DNN that is different than the first DNN, wherein

the logic and DNN layer is configured to:

receive the image data of the captured image directly from the pixel array layer;

process first image data using the first DNN to determine whether the first image data includes a predetermined object of one or more predetermined objects and to produce first output data; and

based on determining that the first image data contains the pre-determined object:

process second image data in combination with the first output data using the second DNN to produce second output data; and

output the second image data in combination with the second output data to a communication bus of an electronic device, wherein the first image data includes or is decomposed from the received image data of the captured image, and the second image data is decomposed from the received image data of the captured image or includes image data of another captured image that is different than the captured image, wherein

the first DNN is configured to detect or predict motion between frames and the second DNN is configured to recognize a gesture in the frames where motion was detected, and

the logic and DNN processing layer is configured to execute the second DNN, which is the gesture recognition DNN, only when the number of image regions where motion is positively detected is above a first predefined threshold and/or below a second predefined threshold that is different than the first predefined threshold.