Couldn’t attend Transform 2022? Check out all the top sessions in our on-demand library now! Look here.
Data is a valuable business asset, which is why many organizations have a strategy of never deleting anything from it. But as data volumes continue to grow, keeping all the data around can become very expensive. An estimated 30% of data stored by organizations is redundant, obsolete, or trivial (ROT), while a survey of splunk found that 60% of organizations say half or more of their data is dark, meaning the value is unknown.
Some outdated data can pose a risk as businesses face the increasing threats of ransomware and cyber attacks; this data may not be sufficiently protected and valuable to hackers. In addition, internal policies or industry regulations may require organizations to delete data after a certain period of time, such as ex-employee data, financial data or PII data.
Another problem with storing large amounts of outdated data is that it can overcrowd file servers, reducing productivity. A 2021 study by Wakefield Research found that 54% of office professionals in the US agreed that they spend more time searching for documents and files than replying to emails and messages.
As responsible stewards of the enterprise IT budget, every file must be preserved down to the last byte. It also means that data should not be deleted prematurely if it has value. A responsible deletion strategy should be implemented in stages: inactive cold data should consume less expensive storage and backup resources, and when data becomes obsolete, there is a methodical way to limit and delete it. The question is: how do you efficiently create a data erasure process that identifies, finds and deletes data in a systematic way?
Barriers to Data Deletion
Cultural: We are all data hoarders by nature and without any analysis to help us understand what data is really outdated, it is difficult to change the mindset of an organization to keep all data forever. Unfortunately, this is no longer tenable, given the astronomical growth in recent years of unstructured data – from genomics and medical imaging to streaming video, electric cars and IoT products. While deleting data that has no current or potential future purpose isn’t a data loss, most storage administrators have faced the ire of users accidentally deleting files and then blaming IT.
Legal/Regulatory: Some data must be kept for a period of time, although usually not forever. In some cases, data may only be retained for a specified period of time in accordance with company policies, such as PII data. How do you know which data falls under which rule and how do you prove that you comply with it?
Lack of systematic tools to understand data usage: Manually figuring out what data is out of date and getting users to do something about it is tedious, time consuming and therefore never done.
Tips for Deleting Data
Create a well-defined data management policy
Developing a sustainable data lifecycle management policy requires the right analyses. You want to understand data usage to determine what data can be deleted based on data types, such as interim data, and data usage, such as data that has not been used for a long time. This also helps to gain buy-in from business users, as removal is based on objective criteria rather than a subjective decision.
With this knowledge, you can map out how data will move over time: from primary storage to cooler tiers, possibly in the cloud, to archive storage, then out of user space in a hidden location and, finally, deletion.
Considerations that may affect policy include regulation, potential long-term value of data, and the cost of storage and backups at every stage from primary storage to archive storage. These decisions can have huge consequences if, for example, data sets are deleted and later needed for analysis or forecasting.
Develop a communication plan for users and stakeholders
For a given workload or data set, data owners need to understand the costs versus benefits of retaining data. Ideally, the data lifecycle policy decision is one agreed upon by all stakeholders, if not dictated by an industry regulation. Communicate data usage analytics and policies to stakeholders to ensure they understand when data expires and whether there is a grace period for data to be kept in a restricted or “undeleted” container. Confinement makes it easier for users to agree to data deletion workflows when they realize that if they need the data, they can “unclear” it and get it back within the grace period.
For long-term data that needs to be retained, make sure users understand the costs and any additional steps required to access data from deep archive storage. For example, it can take several hours for data to be accessible to AWS Glacier Deep Archive. Exit fees often apply.
Plan for technical issues that may arise
Deleting data is not a free operation. We usually only think of R/W speeds, but removal also consumes system performance. Take this example from a theme park: guest photos (100K) per day are kept for 30 days after the customer leaves the park. On day 30, the load on the storage system is double; it needs the capacity to record 100,000 photos and delete 100K.
Deletion performance workarounds, also known as “lazy deletes,” can reduce the delete workload, but if the system can’t delete data at least as quickly as new data is ingested, you’ll need to add storage space to hold expired data. In scale-out systems, you may need to add nodes to handle deletions.
A better approach is to extract cold data from the primary file system and then restrict and delete it, reducing the problem of unwanted load and performance impact on the active file system.
Put the data management plan into action
After the policy has been determined for each dataset, you will need an implementation plan. An independent data management platform provides a unified approach for all data sources and storage technologies. This can provide better visibility and reporting on business datasets, while also automating data management actions. Collaboration between IT and LOB teams is an integral part of execution, resulting in less friction as LOB teams feel they have a say in data management. Department heads are often surprised to find that 70% of their data is infrequently accessed.
Given the current trajectory of data growth worldwide — data is expected to almost double from 97 ZB in 2022 to 181 ZB in 2025 — enterprises have little choice but to rethink data erasure policies and find a way to delete more data than in the past.
Without the right tools and collaboration, this can become a political battlefield. But by making data deletion a well-planned tactic in the overall data management strategy, IT gains a more manageable data environment that delivers better user experiences and value for money spent on storage, backups, and data protection.
Kumar Goswami is CEO and co-founder of Komprise.
Welcome to the VentureBeat Community!
DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.
If you want to read about the very latest ideas and up-to-date information, best practices and the future of data and data technology, join DataDecisionMakers.
You might even consider contributing an article yourself!
Read more from DataDecisionMakers