What is data hygiene and why is it important?
Definition
Data hygiene is defined by Wikipedia as “the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.”
Data hygiene is not generally focused on the user. Searches for data hygiene provide articles about databases and ensuring business logic processing has accurate data. While obviously important for the running of business, this post instead will focus on the other side of data hygiene – user data. Where it goes, how to minimize spread across services and ensuring more, rather than less, of the data stays within a user’s control.
User data is gold
User data is considered the new ‘gold’ of the modern age. Services and products that purport to be ‘free’ harvest user-data for sale through advertisement networks. The rapid advancement of cloud connected devices and societal mobile handset acquisition has shaken industries to calibrate their offerings and compete in the global space. Some industries were ripe for this disruption – such as the publishing industry, which had to rapidly transform from a ‘product’ (i.e.: newspapers, magazines) to a ‘service’ (i.e.: ad clicks on articles). The gaming industry is another example of this type of transition, with advertising networks hooked directly into digital marketplaces. Giving them direct access to the purchasing decisions of the user, to then turn around and offer targeted micro-transactions. If user data is gold in the modern age, then users should protect their data as if it were indeed a precious metal. How does one go about protecting their data spread and leaks?
Data spread
To be able to decrease data spread, one first requires an understanding of where that data is stored and who has access to it. One shocking statistic indicates the average person uses up to 37 cloud-services in one day. Each node that sends and receives user data involves another endpoint, company policy and potential leak risk. The effect of data spread is compounded for every additional service.
Every data node that sends or receives user data can be a potential risk for data leaks.
A hypothetical but typical email signup workflow demonstrates this scenario.
- User A signs up for an email service that works directly in the browser
- Email service’s website requires cookies/opt-in for browser enhancements
- Email service utilizes Google’s REcaptcha to prevent brute-force attacks
- Email service uses multiple external libraries for browser functionality
- When User A sends an email, the service injects a footer to help advertise for the free plan
- The footer includes an image and tracking link
- At login, User A provides their information to the email service
- IP Address - geolocation
- Times of use
- User-agent - the browser in-use
- User A is then enrolled in advertisement networks
Now the question is: who controls User A’s data?
Answer: every single service in-use by the email provider.
(Don’t think this is a big deal? Look at just how much information your browser leaks about you)
Understanding the compounding effect of data spread is essential to controlling where your data goes, who has access to it, how long it’s stored and knowing what happens when the controlling entity exits...
Limiting Data Spread
With a firm grasp of how far and reaching data spread can occur, user’s can be empowered to limit where their data ends up.
- Do you know what happens if the service disappears?
- What breach policy/policies does the service have?
- How many external libraries does the service use?
- Are they maintained?
Make a list of every service you use and take some time to review their policies. Limit the number of services you use and lock down each privacy leak. For example, browser tools exist that can plug JavaScript leaks. Utilize services that have strong privacy policies. It may also be sensible to sign-up for a breach notification tool, as well, to ensure if your data is leaked you will be able to react accordingly.
If a company depends on your data for revenue, then your activity, your life, and your thoughts are considered gold to them.