What you will be discussing about big data in a year will be very different from today's conversations.
We’ve recently seen data science shift markedly from a peripheral capability to a core function, with larger teams tackling increasingly complex analytics problems. We’ve watched rapid advances in data science platforms and their big implications for data and analytics teams. But what surprises are in store in the realm of data, analytics and machine learning going forward?
What new developments in data science will we be talking about in a year?
Here are our three predictions:
1. Big data’s diminishing returns: De-emphasizing the size of the data. We are increasingly seeing that bigger data is often not better. Companies are realizing that extracting more data may not help them address certain problems more effectively.
While more data can be useful if it is clean, the vast majority of business use-cases experience diminishing marginal returns. More data can actually slow innovation, making it harder for data scientists to iterate quickly as testing takes longer and requires more infrastructure.
Experimenting and iterating faster will lead to better models and outcomes compared to fewer experiments with larger data sets. “If companies want to get value from their data, they need to focus on accelerating human understanding of data, scaling the number of modeling questions they can ask of that data in a short amount of time,” writes MIT researcher Kalyan Veeramachaneni.
Indeed, Fortune 500 companies will take a more agile and iterative approach by focusing on learning more from higher-quality samples of data. They will use techniques to extract more representative data examples, enabling the derivations of better conclusions from these sub-samples. For example, rather than process petabytes of call center recordings, they will sample the last 2-3 months, run dozens of experiments, and more quickly deliver a churn prediction to their team for feedback.
2. CIOs dealing with data science Wild West: IT teams bringing order to data and analytics. IT organizations have traditionally managed analytical data infrastructure, such as data warehouses and production processes. Driven by a desire to experiment, data scientists, who reside in the middle of the stack between IT and business consumers, are increasingly creating their own shadow IT infrastructure. They download and install locally on their desktops or on shared servers scattered through departments. They use RStudio, Jupyter, Anaconda and a myriad of open source packages that improve almost daily.
This Wild West of tooling creates a plethora of governance challenges. Many CIO teams are realizing the degree to which data scientists need consistent and secure tooling without constraining their ability to experiment and innovate.
Over the next year, organizations will increasingly bring data science tooling under the umbrella of IT as a service offering. By centralizing a solution that provides infrastructure for data scientists, CIOs will gain transparency into what data and tools are being used, enforce best practices, and better manage costs of the specialized hardware critical to data science workflows.
3. The need to show your work: Increasing model risk management and oversight. With the EU GDPR going into effect in May 2018, along with increased worldwide scrutiny on data model use in regulated industries, data governance is more important than ever. Many of the most data-oriented industries – finance, banking, insurance, healthcare, utilities – are among the most heavily regulated. In these sectors, with key decisions surrounding pricing, lending, marketing and product availability being increasingly driven by data science, policymakers are taking notice. The regulation extends beyond just what data is used but also how it is used and by whom, adding significant complexity.
One example is with the U.S. Federal Reserve Board issuance of SR 11-7, featuring requirements that include “[requiring] banks to separate model use and development from validation, set up a consolidated firm-wide model risk function, maintain an inventory of all models, and fully document their design and use,” writes Nazneen Sherif of Risk.net.
The same sort of increased demands and scrutiny are starting to reach other regulated industries as well. The US Justice Department blocked proposed health insurance acquisitions between Anthem-Cigna and Aetna-Humana in part because they did not do enough to prove their data-driven efficiency and pricing claims. With data science and analytics driving more organizational decisions going forward – as well as representing a bigger part of the decision-making equation – expect even more data model scrutiny, both internally and externally.
The example of SR 11-7 is illustrative. Three years after the regulation was issued, the Fed concluded that some bank holding companies (BHCs) fell short of meeting its requirements in the areas of rigorous testing of stress-test models. The Fed then mandated that BHCs “have a reliable system of record to collate the information required for CCAR submissions,” summarizes one recent Oracle blog post. “If the system of record provides auditability, traceability, and security to the confidential information, it serves as an assurance to the top management and the regulators as well.”
Accordingly, organizations will need to better document and show their
data science work.
Going forward, many companies and data scientists will continue to focus on the predictable: big data, the latest new algorithms, and relentless expansion of machine learning. But we think the new developments outlined above represent a wake-up call, a key inflection point in the maturing of organizational data science.