US 12,395,590 B2
	Reduction and geo-spatial distribution of training data for geolocation prediction using machine learning
Sunil Bhat, Plano, TX (US)
Assigned to NetScout Systems Texas, LLC, Westford, MA (US)
Filed by NetScout Systems Texas, LLC, Plano, TX (US)
Filed on Sep. 9, 2021, as Appl. No. 17/470,900.
Prior Publication US 2023/0075690 A1, Mar. 9, 2023
Int. Cl. G06N 20/00 (2019.01); H04M 15/00 (2006.01); H04W 4/029 (2018.01); H04W 64/00 (2009.01)

CPC H04M 15/41 (2013.01) [G06N 20/00 (2019.01); H04W 4/029 (2018.02); H04W 64/006 (2013.01)]

17 Claims

1. A method of limiting an amount of training data for a machine learning (ML) model implemented by a computing system, the method comprising:

receiving first configuration parameters including a grid box dimension X and a maximum number of entries per grid box N, wherein X>0;

defining grids, each of the grids having multiple grid boxes and covering a corresponding geographic area defined by a cell list of a communication network, each of the grids corresponding to a different cell list, wherein each grid box of a grid covers a different portion of the corresponding geographic area;

receiving call records from a control plane in association with user equipment (UE) events for communication by user equipment via the communication network;

selecting truth call records from the call records received that include truth data, wherein the truth data includes reported geolocation (GL) data that indicates a GL at which the call record was generated;

for each truth call record, determining a grid box of the multiple grid boxes covering a geographic area that includes the GL indicated by the GL data included in the truth call record;

calculating a density metric for each grid box, which quantitatively evaluates a concentration of truth call records assigned to that grid box relative to its geographic area, thereby enabling efficient data management;

selectively assigning the respective truth call records to the grid box determined for the truth call record based on the density metric and in a fashion to not exceed the maximum number of entries per grid box N, wherein the selective assignment control mechanism prevents disproportionate truth data from clustered UEs that could introduce bias;

updating the assignment of truth call records as environment or cell locations change to maintain accurate geographic distribution using re-evaluation algorithms that dynamically respond to environmental changes; and

outputting as training data for training the ML model the truth data and signal detail data for only the truth call records that are assigned to any of the grid boxes of the multiple grids;

training a machine learning model using the output truth data and signal detail data to generate a trained model configured to predict geolocation of user equipment based on call records that do not include truth data, wherein the trained model provides improved prediction accuracy through decreased bias and enhanced data processing;

wherein selectively assigning the truth call records leverages optimization techniques to reduce processing and storage requirements and balance geographic distribution of training data;

receiving second configuration parameters including a division parameter Ng and a factor ƒ;

when defining the grids:

determining whether a size of a particular geographic area covered by a particular common area defined by one of the cell lists exceeds a threshold; and

when determined that the particular geographical area exceeds the threshold:

defining a second grid that covers the particular geographic area;

dividing the second grid using the division parameter into multiple second grid boxes and, wherein the amount of second grid boxes is determined by the division parameter and the second grid boxes have a dimension X1 that is larger than the grid box dimension X, wherein X1>X and a value of X1 is obtained based on a size of the particular geographic area and the division parameter;

for each truth call record having a cell list that defines the particular geographic area:

determining a second grid box of the multiple second grid boxes covering a geographic area that includes the GL indicated by the GL data included in the truth call record; and

instead of selectively assigning the respective truth call records to the grid box determined, selectively assigning the truth call record to the second grid box determined in a fashion to not exceed a new maximum number N1, wherein the new maximum number N1 is a function of the factor ƒ, and

outputting as training data for training the ML model the truth data and signal detail data for only the truth call records that are assigned to any of the second grid boxes.