Data minimization

With Big Data being as influential and important as it currently is, the concept of data minimization is also becoming more prevalent, both from a business and from a regulation standpoint.


The hoarding of maximum amounts of data is turning out to be inefficient – both for storage and for future usage. Even though storage is relatively low cost, as compared to the past, keeping huge volumes of data is still costly. Not only that, but keeping virtually useless data makes it difficult to find and aggregate useful data. Another issue is security – a huge concern on its own. Hoarding data increases potential damage in the case of a breach, which could lead to much bigger fines, and for what – something that was never of any use to you in the first place.

The GDPR is a great example of the regulation aspect of data minimization. Article 5 of the European regulation states that:

(1) Personal data shall be:

b) Collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes; further processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes shall, in accordance with Article 89(1), not be considered to be incompatible with the initial purposes (‘purpose limitation’);…

c) adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed (‘data minimisation’);...”.

This means that data should have a predetermined purpose and the types of data collected should be limited to the ones needed to achieve that purpose. The concept promotes a sort of serious responsibility tied to collection of data and applies to different IT environments, such as production, test and data warehousing environments.

For the production environment, the question is common one – “is the data needed”? As for the data used in testing environments, it is a principle of IT security that test environments should never use real data, due to their overall nature (less security controls, unpredictability, accessible by a wider range of individuals, etc). This means that data minimization in test environments should be of no concern, as the data there would, in the perfect case, be masked in an appropriate way, so that it still creates a realistic scenario. In data warehouses, considering the gather information has a purpose, information becomes outdated at some point and while it loses a big part of its usefulness, it retains the risk, so it should be properly managed.

