US 11,681,825 B2
Digital watermarking without significant information loss in anonymized datasets
Jason McFall, London (GB); and Paul Mellor, Tonbridge (GB)
Assigned to PRIVITAR LIMITED, Cambridge (GB)
Appl. No. 15/780,801
Filed by PRIVITAR LIMITED, Cambridge (GB)
PCT Filed Dec. 1, 2016, PCT No. PCT/GB2016/053776
§ 371(c)(1), (2) Date Jun. 1, 2018,
PCT Pub. No. WO2017/093736, PCT Pub. Date Jun. 8, 2017.
Claims priority of application No. 1521134 (GB), filed on Dec. 1, 2015.
Prior Publication US 2020/0250338 A1, Aug. 6, 2020
Int. Cl. G06F 7/04 (2006.01); H04N 7/16 (2011.01); G06F 21/62 (2013.01); G06F 21/16 (2013.01); H04L 9/06 (2006.01); H04L 9/08 (2006.01); H04L 9/32 (2006.01)
CPC G06F 21/6254 (2013.01) [G06F 21/16 (2013.01); H04L 9/0643 (2013.01); H04L 9/0869 (2013.01); H04L 9/3213 (2013.01); H04L 9/3239 (2013.01); H04L 2209/42 (2013.01); H04L 2209/56 (2013.01); H04L 2209/608 (2013.01)] 28 Claims
OG exemplary drawing
 
1. A computer-implemented process of altering original data in a dataset, comprising the steps of:
(a) anonymising the original data, in which anonymizing the original data is achieved using a non-hashing algorithm, and in which the original raw data comprises text strings;
(b) including a digital watermark in the anonymised data to generate a watermarked data release, the digital watermark being taken from a source that is extrinsic to the dataset, and in which the digital watermark comprises a text string and is embedded in the anonymized data text strings and not in any metadata or redundant data; and
(c) providing the watermarked data release;
and in which the digital watermark associates the watermarked data release with an audit trail of which user or users are authorised to use that watermarked data release;
in which watermark carriers are at the cell level and depend on whether the cell data type is for numeric values or tokenised values;
in which, where the cell data type is for numeric values, N*M digits of the watermark are used to mutate the N least significant bits of each value, with a precision of M, and for each of the N watermark digits, the digit is divided by 10M to derive the probability that this bit will be set in one of the values, and values in the cell are then mutated, setting this bit with the required probability; and when reading the file back, the process is to stream through the values and derive the probability of zero for each of the N carrier bits, to a precision of M, to reveal the N*M original digits;
and in which, where the cell data type is for tokenised values, tokenised cell values are generated consistent with some regular expression and analysis of this regular expression gives a lexicographically ordered list of all possible output tokens.