How to build your secure data chain

There is a great need for securing your data chain, but it isn't easy. Here is a guide on what you need to consider.

The need for an end-to-end data chain

Effective management of data within any organization today has grown in criticality due to exponential increases in needs, ever-faster deployments of new applications, the emergence of “always on” mobile platforms, improved technology to harness data, decreasing data-storage costs, and the sheer amount of new data being generated all the time.

The artificial intelligence (AI) market is large and growing, with machine learning (ML) as its main driver. The rising deployment of AI and ML raises the stakes for data management, since decision algorithms generate prodigious amounts of new data.

Since the late 1990s, we have been increasing the rate of new data creation at an exponential rate. For example, data needs across the US federal government are expanding rapidly to address needs ranging from taxation to national defense.

This growth will not slow down anytime in the foreseeable future. The International Data Corporation predicts information generated will increase 10-fold by 2025. It’s important to understand the real meaning of such statistics. The data revolution is not, at its core, really about the volume of data or the plummeting costs of electronic storage, meaningful as those trends might be. The crucial consideration for enterprises – public sector and private sector alike – is not the quantity of data, but its quality. The organizations that thrive in this era will be those that identify the optimal uses of their data and derive maximum value from it.

While data is being increasingly recognized as an enterprise asset its management as a distinct discipline is nascent. This is especially true across US federal departments and agencies (e.g., Army) that require them to:

  • Operationalize solutions to cost-effectively move data from on-premise to cloud
  • Embed frameworks to assess data investments to support IT portfolios (existing solutions are applicable to IT projects that are application centric)
  • Achieve zero trust controls for information technology (IT), operational technology (OT) networks, weapon system platforms

The increasing relevance of data and the relatively limited know-how of managing it requires data management professionals to share their knowledge. This paper reflects the knowledge and practices we have gained by making our clients’ data chains efficient and effective.

The links in the chain

The data chain involves the steps required to identify, collect, and process data as effectively and efficiently as possible.

The first step is to identify what kind of data is needed to solve a problem, answer a question, or monitor a process. Next, one must establish a process for collecting data efficiently at scale – and decisions here can affect quality and usability of the data in downstream steps.

The last phase involves processing data to ensure it is correctly recorded, classified, and stored in formats that allow further use. Masses of raw data are worth nothing if they are inconsistent, incomplete, or unstructured. Most machine-learning models are not able to work with data flaws. Hence, data preparation is key. This includes data conversion, cleansing, enhancement, formatting, and labeling. These steps are also crucial since data models need input and related output to learn.

Managing costs

Organizations are unaware of just how much they are spending on data capabilities because costs are spread out across enterprises. Third-party data expenditures might come out of one unit’s budget, for example, while data-storage spending is managed within the IT function.

The price tag for such efforts can run from hundreds of millions of dollars for a midsize organization to billions of dollars for the largest ones. These range from 5 percent to 8 percent of operating costs, and from $200 million to as much as $1 billion for major organizations. Managing data is a large source of costs and must be managed efficiently.

The costs of data spending across the end-to-end data capability chain can be viewed in the table below.

Chart 1Chart 2

Targeted improvements across each area (data sourcing, architecture, governance, and consumption) can help minimize waste and put high-quality data within easier reach. These efforts can typically cut annual data spending by about 5 percent in the short term.

Over the longer term, organizations can nearly double savings by redesigning and automating core processes, integrating advanced technologies, and embedding new ways of working.

A better approach

Based on lessons learned from client experiences across public and private sector organizations globally, there are some effective measures that may be taken to achieve better results with organizational data – while also saving money.

Rationalize third-party data sourcing. A handful of third-party data sources generally account for the majority of use cases. By eliminating unused and underused feeds, defining clearer permissions for data access, and allowing data to be reused for longer periods, organizations can save big on data-sourcing costs. It is important to apply procurement discipline to data-vendor contracts. These practices include setting up a central vendor-management team with business-unit and function-level gatekeepers to oversee data subscriptions, usage terms, renewal dates – and to compare vendor contracts on an ongoing basis. Instituting usage caps for the most commonly used feeds can provide additional gains. Through these measures, organizations can typically cut data costs between 15 percent and 20 percent.

Simplify data architecture. Organizations must revisit their core data architecture to protect themselves against the problem of fragmented data stores, which can eat up between 10 percent and 20 percent of the average IT budget. The lack of standardization in data-management protocols can create a validation nightmare, resulting in lost productivity as teams chase down needed information (or find out too late they are using the wrong data). In the short term, organizations can generate savings by optimizing infrastructure – for example, by offloading historical data to lower-cost storage and increasing server utilization. More widespread use of application programming interfaces (APIs) allows companies to retrieve the data within their legacy systems without having to design custom access mechanisms. Over the longer term, migrating data repositories to a common data platform (for example, a data lake) and evolving to a cloud-centric model can allow a company to reduce the capacity required to handle spikes in data computation and storage.

Differentiate data governance. Data users can spend between 25 percent and 40 percent of their time searching for data if a clear inventory of available data is not available – and another 20 percent to 30 percent of their time to data cleansing if robust data controls are not in place. Effective data governance can eliminate this waste. Establishing data dictionaries, creating traceable data lineage, and implementing quality controls can improve productivity and performance significantly.

At the same time, a broad-based approach does not work. Companies must focus on high-value data priorities based on up-to-date assessments of needs, value, and risk.

Leading organizations, for example, often restrict the scope of data governance to no more than 50 reports covering a maximum of 2,000 data elements.

Rationalize data reporting and analytics. In our experience, between 20 percent and 40 percent of the data reports that businesses generate are duplicative, and others go unused. To manage consumption more effectively, companies should map reports linking reporting to a handful of metrics and tie those metrics to clearly defined actions. They should then redesign their data-gathering processes, automate pipelines, and explore new ways to model and visualize data. Prototyping and rapid testing cycles with business stakeholders help eliminate excess fat. This approach ensures that reports and metrics generated are useful, non-duplicative, and relatively easy to curate. Using methods such as these, clients eliminate the number of reports by 60 percent to 80 percent, and reporting-related costs by about 60 percent.

Take a strategy-first (not technology-first) approach. Prioritize critical use cases based on organizational priorities, determine the required enterprise data-analytics capabilities, and focus on value creation. Outline a strategic enterprise-data chain roadmap. Consider input from a cross-section of departments, units, and geographic locations.

Think big, start small, scale up quickly. Design a holistic future-state view of the data-chain approach but get started right away with a single tightly focused, highpriority use case. Do not try to build for the perfect, fully scaled solution, as this will take too long to achieve. Moreover, there will have to be pivots from the initial direction.

Focus on enabling outcomes, crafting the information flow, deploying key foundational pieces of the future-state architecture, and learning lessons along the way.

Start with a data proof of concept (POC). Get going with a POC that delivers tangible value. Determine tenets for POC use cases. Select use cases based on business priorities and relevant timelines. Articulate sources of value and what success looks like. Select use cases that pressure-test the POC, existing enterprise capabilities, and overall organizational readiness in order to maximize lessons for the future.

Cheaper, cleaner, faster, better results

The speed at which an organization can make the right decisions and change direction if needed is an increasingly significant determinant of success. If an organization takes too long to react, it can lead to disaster. This is especially true in defense and national-security environments. With a good data chain, information retrieval and threat identification may proceed much faster and more smoothly.

Effective data-chain management makes organizations more productive. It avoids unnecessary duplication and makes it easier for employees to find and understand the information they need to perform their job. It also allows staff to easily validate results or conclusions based on sound data.

In the end, information is only as good as its source. If decision-makers across the organization are analyzing different data to make decisions without effective data-chain processes in place, costly errors may result.