Skip to main content

Question 179

You are building a real-time prediction engine that streams files, which may contain PII (personal identifiable information) data, into Cloud Storage and eventually into BigQuery. You want to ensure that the sensitive data is masked but still maintains referential integrity, because names and emails are often used as join keys.
How should you use the Cloud Data Loss Prevention API (DLP API) to ensure that the PII data is not accessible by unauthorized individuals?

  • A. Create a pseudonym by replacing the PII data with cryptogenic tokens, and store the non-tokenized data in a locked-down button.
  • B. Redact all PII data, and store a version of the unredacted data in a locked-down bucket.
  • C. Scan every table in BigQuery, and mask the data it finds that has PII.
  • D. Create a pseudonym by replacing PII data with a cryptographic format-preserving token.

https://cloud.google.com/dlp/docs/pseudonymization#supported-methods

Format preserving encryption: An input value is replaced with a value that has been encrypted using the FPE-FFX encryption algorithm with a cryptographic key, and then prepended with a surrogate annotation, if specified. By design, both the character set and the length of the input value are preserved in the output value. Encrypted values can be re-identified using the original cryptographic key and the entire output value, including surrogate annotation.

Why other options are not as suitable:

A (Cryptogenic tokens and locked-down bucket): While this provides some protection, storing the non-tokenized data in a separate bucket adds complexity and risk.

B (Redaction and locked-down bucket): Redaction removes sensitive data entirely, which might limit its usefulness for analysis and other purposes.

C (Scanning and masking in BigQuery): This approach might be less efficient than masking the data during the streaming process before it reaches BigQuery.