US 11,681,825 B2
	Digital watermarking without significant information loss in anonymized datasets
Jason McFall, London (GB); and Paul Mellor, Tonbridge (GB)
Assigned to PRIVITAR LIMITED, Cambridge (GB)
Appl. No. 15/780,801
Filed by PRIVITAR LIMITED, Cambridge (GB)
PCT Filed Dec. 1, 2016, PCT No. PCT/GB2016/053776 § 371(c)(1), (2) Date Jun. 1, 2018, PCT Pub. No. WO2017/093736, PCT Pub. Date Jun. 8, 2017.
Claims priority of application No. 1521134 (GB), filed on Dec. 1, 2015.
Prior Publication US 2020/0250338 A1, Aug. 6, 2020
Int. Cl. G06F 7/04 (2006.01); H04N 7/16 (2011.01); G06F 21/62 (2013.01); G06F 21/16 (2013.01); H04L 9/06 (2006.01); H04L 9/08 (2006.01); H04L 9/32 (2006.01)

CPC G06F 21/6254 (2013.01) [G06F 21/16 (2013.01); H04L 9/0643 (2013.01); H04L 9/0869 (2013.01); H04L 9/3213 (2013.01); H04L 9/3239 (2013.01); H04L 2209/42 (2013.01); H04L 2209/56 (2013.01); H04L 2209/608 (2013.01)]

28 Claims

1. A computer-implemented process of altering original data in a dataset, comprising the steps of:

(a) anonymising the original data, in which anonymizing the original data is achieved using a non-hashing algorithm, and in which the original raw data comprises text strings;

(b) including a digital watermark in the anonymised data to generate a watermarked data release, the digital watermark being taken from a source that is extrinsic to the dataset, and in which the digital watermark comprises a text string and is embedded in the anonymized data text strings and not in any metadata or redundant data; and

and in which the digital watermark associates the watermarked data release with an audit trail of which user or users are authorised to use that watermarked data release;

in which watermark carriers are at the cell level and depend on whether the cell data type is for numeric values or tokenised values;

in which, where the cell data type is for numeric values, N*M digits of the watermark are used to mutate the N least significant bits of each value, with a precision of M, and for each of the N watermark digits, the digit is divided by 10^Mto derive the probability that this bit will be set in one of the values, and values in the cell are then mutated, setting this bit with the required probability; and when reading the file back, the process is to stream through the values and derive the probability of zero for each of the N carrier bits, to a precision of M, to reveal the N*M original digits;

and in which, where the cell data type is for tokenised values, tokenised cell values are generated consistent with some regular expression and analysis of this regular expression gives a lexicographically ordered list of all possible output tokens.