Data purging strategy

VIJAY AGRAWAL
2 min readMay 16, 2021

--

Organizations would want to have their data life defined by some means.

Some would like to completely erase the data from disk, or some would like to move the older data set temporarily to archive location(may be to cheaper storage device such as tape)

Most modern cloud-based systems have inbuilt housekeeping (background running) tasks which will periodically does this job on your behalf

Why would someone need data purge?

Majorly there are few reasons why data purge becomes must have task are

1. Higher Cost of storing the historical data which you may not need any more.

2. Your Organisational, Client or Regional data storage contracts (GDPR, CCPA and others.) and retention policies which forces you to delete data on demand or with expiration date tagged while creating (downloading) dataset.

3. As the data grows your system performance deteriorate.

In this process Identifying the data object which needs to be purged becomes key activity. The factors involved could be expiration date for the stored data, owner (a User/Org etc.) of the who has asked his/her data to be erased.

Implementing an auto purge system for a file/dir

- Define data retention schedule at the time creation i.e., how long this will be present in system. This life span can mandate by system admins or can be overridden by the creator.

- Once you have decided when to purge then the next question which will strike you where and how my data is stored? Is that stored such a way I can erase them easily?

- When I say where/how it is stored: look is that data scattered to multiple location (dirs). Do you have option to bucket them so that you know which bucket is to be deleted when if not then you got to run a find query (command) to look for the data files which are eligible for deletion.

- Now you have what to be deleted and when to be deleted — go ahead and run your data purging tool (commands)

--

--

No responses yet