US 11,699,434 B2
	Systems, computer-implemented methods, and computer program products for data sequence validity processing
Daniel da Silva De Paiva, Brighton (GB); Gowri Somayajulu Sripada, Westhill (GB); and Craig Thomson, Balmedie (GB)
Assigned to ARRIA DATA2TEXT LIMITED, Aberdeen (GB)
Filed by ARRIA DATA2TEXT LIMITED, Aberdeen (GB)
Filed on Dec. 4, 2020, as Appl. No. 17/112,670.
Prior Publication US 2022/0180863 A1, Jun. 9, 2022
Int. Cl. G10L 15/183 (2013.01); G06F 16/33 (2019.01); G10L 15/26 (2006.01); G10L 15/06 (2013.01)

CPC G10L 15/183 (2013.01) [G06F 16/3346 (2019.01); G10L 15/063 (2013.01); G10L 15/26 (2013.01)]

18 Claims

1. An apparatus comprising at least one processor and at least one memory, the at least one memory having computer-coded instructions stored thereon, wherein the computer-coded instructions, in execution with the at least one processor, configures the apparatus to:

for each data sequence of a plurality of data sequences, each data sequence comprising a token sequence:

generate, utilizing a language model, a perplexity value set associated with the data sequence, wherein the perplexity value set comprises a perplexity value for each data token in the token sequence of the data sequence, wherein the language model comprises a trained machine learning model configured to generate the perplexity value for each data token; and

generate a probabilistic ranking set for the plurality of data sequences, the probabilistic ranking set including a probabilistic ranking for each data sequence in the plurality of data sequences, and the probabilistic ranking set generated based at least in part on at least one sequence arrangement metric and the perplexity value set for each data sequence of the plurality of data sequences, wherein generating the probabilistic ranking set comprises:

generating a bucket-based sequence perplexity value set including a bucket-based sequence perplexity value for each data sequence of the plurality of data sequences by, for each data sequence of the plurality of data sequences:

determining an unacceptable bucket token count associated with the data sequence;

determining the bucket-based sequence perplexity values for the data sequence based at least in part on the unacceptable bucket token count associated with the data sequence; and

generating the probabilistic ranking set based at least in part on the bucket-based sequence perplexity value set; and

generate an arrangement of the plurality of data sequences based at least in part on the probabilistic ranking set.