Pii
find & fix – TagManager

RESOURCES

  • MORE INFO
  • https://clickety-clack.click/pii-removal-from-analytics/
  • ARTICLES + TOOLS
  • NUKE CURRENT: https://www.practicalecommerce.com/removing-personal-data-from-google-analytic
  • https://www.simoahava.com/gtm-tips/remove-pii-google-analytics-hits/
  • NOT IN GA4: https://brianclifton.com/blog/2017/09/07/remove-pii-from-google-analytics/#comment-160455
  • DATA STUDIO REPORT: https://datastudio.google.com/u/0/reporting/1MI0l7m79xrEo6HnSiVtvsX9yx_q9fE_w/page/Reo0
    • CURRENT JS CODE
    • https://brianclifton.com/blog/2017/09/07/remove-pii-from-google-analytics/

How to Find and Fix PII in Google Analytics Data

Find PII in analytics

Find PII in analytics

Find PII in analytics

Find PII in analytics
Find PII in analytics


what to do with current pii + nuclear option


REDACT PII in Analytics

https://brianclifton.com/blog/2017/09/07/remove-pii-from-google-analytics/
This is my PII extension to the initial post by the excellent Simo Ahava (his post: Remove PII From Google Analytics Hits).

Essentially, I had been looking for a way to block Personally Identifiable Information (PII) hits at the collection level i.e. using GTM, before the hit is sent to Google Analytics.

Why do this?

Putting the obvious requirement to not gather personal data to one side, if you are adding filters to your analytics views to delete PII, it is simply too late – the problem has already occurred and GDPR compliance has been broken! See my related post on why filters are not sufficient.

Previously, by using GTM I would simply drop any hits containing page URLs with an @ symbol i.e. in case the URL contained an email address. Apart from being quite blunt (not all URLs with an @ symbol contain an email address), this approach would not tackle email addresses being present in other hit types e.g. events, e-commerce data etc. It also did not tackle other PII types – such as telephone numbers, zip codes, usernames etc. Hence, the much better approach of Simo’s method – using GTM’s new customTask feature – was very interesting to me!

In this post, I extend his method by building out the regex more – for a more sophisticated email detection, and to capture other PII types…

Redact, rather than remove PII

The important thing here is to remember we are redacting the PII – not blocking or removing it. This is an important distinction. If PII is present, it is almost certain that the same PII is being logged elsewhere on your network – your web server logfile at the very least. Reporting this in your Google Analytics in redacted form means you have a monitoring system to flag to your web dev/IT team in order to fix and keep on top of. Essentially, to be compliant, PII issues need to be fixed at their source by your organisation. Alternatively, if you deleted the PII data from your reports is simply stopped collecting it in GA, you would metaphorically be sweeping the problem under the carpet.

Here is my adjusted code for your Custom JavaScript variable.

IMPORTANT: This is a straight replacement to Simo’s code. Replace example\.com with the domain of your website (lines 7 and 11). More on what this is for later. Thank you to the excellent David Vallejo for his JavaScript help – my skills are simply too rusty nowadays! As always, when working with code it’s up to you to test it and ensure it works correctly. No liability accepted!

UPDATE: This code was rewritten 29-Aug-2018 for better handling of the GA hit. In particular, it now works with GTM’s native YouTube trigger.  Simply swap out the original code for this new one.

function() {
  return function(model) {
    try{
      // Add the PII patterns into this array as objects
      var piiRegex = [{
        name: 'EMAIL',
        regex: /[^\/]{4}(@|%40)(?!example\.com)[^\/]{4}/gi,
        group: ''
      },{
      name: 'SELF-EMAIL',
        regex: /[^\/]{4}(@|%40)(?=example\.com)[^\/]{4}/gi,
        group: ''
      },{
        name: 'TEL',
        regex: /((tel=)|(telephone=)|(phone=)|(mobile=)|(mob=))[\d\+\s][^&\/\?]+/gi,
        group: '$1'
      },{
        name: 'NAME',
        regex: /((firstname=)|(lastname=)|(surname=))[^&\/\?]+/gi,
        group: '$1'     
      },{
        name: 'PASSWORD',
        regex: /((password=)|(passwd=)|(pass=))[^&\/\?]+/gi,
        group: '$1'
      },{
        name: 'ZIP',
        regex: /((postcode=)|(zipcode=)|(zip=))[^&\/\?]+/gi,
        group: '$1'
      }

    ];        
      // Fetch reference to the original sendHitTask
      var originalSendTask = model.get('sendHitTask');
      var i, hitPayload, data, val;


      model.set('sendHitTask', function(sendModel) {
          hitPayload = model.get('hitPayload');  
          //  Let's convert the current querystring into a key,value object
          data = (hitPayload).replace(/(^\?)/,'').split("&").map(function(n){return n = n.split("="),this[n[0]] = n[1],this}.bind({}))[0];
      //  We'll be looping thu all key and values now
          for(var key in data){

              // Let's have the value decoded before matching it against our array of regexes
              piiRegex.forEach(function(pii) {  
                var val = decodeURIComponent(data[key]);                
                // The value is matching?
                if(val.match(pii.regex)){
                  // Let's replace the key value based on the regex and let's reencode the value
                  data[key] = encodeURIComponent(val.replace(pii.regex, pii.group + '[REDACTED ' + pii.name + ']'));                
                }                        
              });  
                      
          }        
          // Going back to roots, convert our data object into a querystring again =)    
          sendModel.set('hitPayload', Object.keys(data).map(function(key) { return (key) + '=' + (data[key]); }).join('&'), true);
          // Set the value
          originalSendTask(sendModel);
      });    
    }catch(e){}
  };
}

Edit Your Tags

In order to function as intended, the customTask field needs to be added to ALL Google Analytics tags. That of course is cumbersome and does not scale with the volume of tags used. Therefore it is much better to apply this as a one-time fix in a Google Analytics settings variable. You can read more about the power of the Universal Analytics settings variable approach from Simo.

Now any hits sent by these tags will be parsed by this variable, which replaces the instances of PII with the string [REDACTED pii_type]. For example, a URL with path:

/test?tel=+44012345678&email=brian@me.com&other=bclifton@DOMAIN.com&firstName=brian&password=hello

would be replaced with:

/test?tel=[REDACTED TELEPHONE]&email=b[REDACTED EMAIL]om&other=bcli[REDACTED SELF-EMAIL]OMAIN.com&firstName=[REDACTED NAME]&password=[REDACTED PASSWORD]

The Regex Changes Explained

-Extending the Email regex

For the EMAIL check, I make two changes to Simo’s original regex:

regex: /[^\/]{4}@(?!domain\.com)[^\/]{4}/gi,

Firstly, this matches any character that is not a forward slash / 4 times, followed by @. Then, so long as this is not followed by domain.com, it matches the next 4 characters which are not a forward slash.

So apart from looking for an email address, I am doing two extra things:

1. I exclude any “innocent” links that may be captured as outbound links containing an @. Common examples are Google Maps and Flickr links, which contain a forward slash – the [^\/] part. Example links:

  • www.google.com/maps/place/University+of+San+Francisco+-+Folger+Bldg,+101+Howard+St,+San+Francisco,+CA+94105/@37.7908871,-122.3925594,17z/data=!3m1!
  • www.flickr.com/photos/123456@N06/sets/721576344/Other PII data types

2. I exclude the domain of the website itself from this check using a negative look ahead – the (?!….) part. Remember to replace domain\.com with your own domain e.g. brianclifton\.com in my case. I match for this separately next.

My suggestion for a separate regex is to catch and redact any payloads containing the SAME email domain as the site itself, with a different “name” value to the regular email redaction. That way such emails will be reported differently in Google Analytics, allowing the site owner to ignore these and monitor real PII infringements.

For example:

  • If a visitor comes to my site and I capture their email address as simo@hissite.com, that is redaction_message [REDACTED EMAIL]
  • If a visitor comes to my site and I capture my own email address as an outbound click-through to the site owner e.g. mysite@brianclifton.com, that is redaction_message [REDACTED SELF-EMAIL]

As the site owner, the first message is the one I should be paying attention to. The second message (not really PII as it belongs to the site owner) keeps me compliant with Google’s terms of service.

For the SELF-EMAIL check, the regex is almost identical:

regex: /[^\/]{4}@(?=domain\.com)[^\/]{4}/gi,

The difference now is that I do wish to include my own domain in the match and this is achieved via a positive look ahead – the (?=….) part.

-Extending the regex to capture other PII

The original post by Simo was a simple pattern match – easy to use and maintain when you know the structure of the match you are looking for e.g. an @ symbol to match email addresses, or a well structured set of characters and numbers for strings like personal ID and social security numbers. However, I want to extend this to match less structured PII, for example people’s names, addresses, telephone numbers, zip codes etc.

To do this, we need a regex anchor. That is, a common string likely to contain such PII. I am assuming all such matches are contained within URL strings as query parameters (though name=value pairs in the URL path are also matched) e.g.

/test?tel=+46(0)12398765&firstname=Brian&zip=abc123

The anchor is the query name and we match for common PII culprits – these are tel, firstname and zip in my example. Of course these should be adjusted for your particular language. Anchors are the reason why the group key is required:

name: 'ZIP',
regex: /((postcode=)|(zipcode=)|(zip=))[^\/\?&]+/gi,
group: '$1'

In this case, $1 is the value of the string (our anchor) just before and including the = sign. We keep this in place for the data hit, and redact what follows. Without applying the grouping, the entire name=value pair would be redacted making troubleshooting difficult. I use [^&\/\?] in order to conclude the match within paths, or query parameters…

Happy compliance testing 

BTW, you do you know I am building a data auditing and compliance tool to measure and monitor Google Analytics data quality, right?

  1. GDPR – Request Consent Before Tracking? (or, what defines personal data)
  2. GDPR Consent & Google Analytics Guide
  3. Google Analytics Data Retention Settings Explained
  4. Using Google Analytics Anonymize IP – An Impact Study
  5. Google Analytics Audit Study – An Enterprise Research Study


IMPORTANT NOTE IN COMMENTS

  1. https://brianclifton.com/blog/2017/09/07/remove-pii-from-google-analytics/#comment-160455
    May 4, 2021 at 6:55 am
    Hi Brian,
    How do I do it for GA4 property? Should I just replace the page_location and page_referrer in the “Fields to Set” section in the GTM configuration tag?


    • http://www.advanced-web-metrics.com/ – Brian Clifton
      May 4, 2021 at 8:05 am
      Hello Prabhu – note that customTask is not available in GA4 and I suspect is unlikely to ever be available. Essentially, the customTask method was an unsupported and undocumented feature of Universal Analytics – it was a hack, albeit a very powerful one.
       



simoahava.com – remove PII (original article)

https://www.simoahava.com/gtm-tips/remove-pii-google-analytics-hits/

Sending personally identifiable information (PII) to Google Analytics is one of the things you should really avoid doing. For one, it’s against the terms of service of the platform, but also you will most likely be in violation of national, federal, or EU legislation drafted to protect the privacy of individuals online.

In this #GTMTips post, I’ll show you a way to make sure that any tags you configure this solution with will not contain strings that might be construed as PII. The tip is for Google Tag Manager, but with very little modifications it will work with Universal Analytics, too.

(UPDATE 8 September 2017: Check out Brian Clifton’s great extension of this solution: Remove PII from Google Analytics)

TIP 64: REMOVE PII FROM HITS TO GOOGLE ANALYTICS

The solution hinges around customTask, which has fast become my favorite new feature in the analytics.js library. See the following articles to understand why I think so:

Anyway, to make the whole thing run, create the following Custom JavaScript variable:

function() {
  return function(model) {
    // Add the PII patterns into this array as objects
    var piiRegex = [{
      name: 'EMAIL',
      regex: /.{4}@.{4}/g
    },{
      name: 'HETU',
      regex: /\d{6}[A+-]\d{3}[0-9A-FHJ-NPR-Y]/gi
    }];
    
    var globalSendTaskName = '_' + model.get('trackingId') + '_sendHitTask';
    
    // Fetch reference to the original sendHitTask
    var originalSendTask = window[globalSendTaskName] = window[globalSendTaskName] || model.get('sendHitTask');
  
    var i, hitPayload, parts, val;
    
    // Overwrite sendHitTask with PII purger
    model.set('sendHitTask', function(sendModel) {
      hitPayload = sendModel.get('hitPayload').split('&');
      for (i = 0; i < hitPayload.length; i++) {
        parts = hitPayload[i].split('=');
        // Double-decode, to account for web server encode + analytics.js encode
        try {
          val = decodeURIComponent(decodeURIComponent(parts[1]));
        } catch(e) {
          val = decodeURIComponent(parts[1]);
        }
        piiRegex.forEach(function(pii) {
          val = val.replace(pii.regex, '[REDACTED ' + pii.name + ']');
        });
        parts[1] = encodeURIComponent(val);
        hitPayload[i] = parts.join('=');
      }
      sendModel.set('hitPayload', hitPayload.join('&'), true);
      originalSendTask(sendModel);
    });
  };
}

Once you add this variable to your Universal Analytics tags as the customTask field, any hits sent by these tags will be parsed by this variable, which replaces the instances of PII with the string [REDACTED pii_type].

At the beginning of the code snippet, you’ll see the configuration object piiRegex. It’s an array of object literals, where each object has two properties: name and regex. The first is what will be used in the replace string after “REDACTED”. So if name is “EMAIL”, you’ll see “[REDACTED EMAIL]” in your Google Analytics reports wherever PII data was removed.

The second parameter, regex, is where you’ll add the regular expression literal for whatever PII pattern you want to redact. In the example above, I have two patterns:

  • /.{4}@.{4}/g – this matches all @ symbols plus the four preceding and four following characters. So if ANY part of the payload (URL, Custom Dimension, Event Label, etc.) has the @ symbol, then the string will be obfuscated. Thus, simo.s.ahava@gmail.com becomes simo.s.a[REDACTED EMAIL]l.com.
  • /\d{6}[A+-]\d{3}[0-9A-FHJ-NPR-Y]/gi – this is a reasonably good abstraction of the Finnish personal identity code. It’s not perfect, because the personal identity code is actually a calculation, so you can’t use simple pattern matches to only find valid codes. This regular expression will probably result in many false positives, especially if your GA hits include UUIDs or any type of alphanumeric hashes. But it’s still better than collecting this sensitive data.

You can add your own regular expression patterns as new objects of the array.

When you add this variable into the customTask field of a Universal Analytics tag, the code will run through the entire payload, looking for matches to the regular expressions you provide in the configuration array. If any matches are made, they are redacted.

Do you have other, useful regular expressions for finding and weeding out personally identifiable information?

Scroll to Top