Why we need more productive data teams?

Dominik Vach

Founder & CEO

Product

September 17, 2021

We built a tool to solve the big problem.

"If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team" — Andrew Ng

Data has changed how businesses are being built and run. With the explosion of data available, the ability of companies to extract insights efficiently serves today as the bedrock for competitive advantage and the creation of new business models.

Now, more and more companies are transitioning to the next step of being “data-driven” — using machine learning (ML) and artificial intelligence (AI).

This change is happening fast, and it is happening now. ML is defining personalized user experiences, allowing for proactive decision making and eliminating repetitive business processes.

Despite the huge potential of ML and AI, the realized business value for most companies is yet to be imagined. As a matter of fact, most ML projects fail before they get to production¹. Even for projects that are initiated — the gap to put ML into actual business use - remains to be bridged. To illustrate, 80% of ML projects are stuck in the proof-of-concept stage².

The consequence of this is a huge gap between the “AI-native” companies and the rest.

The problem is not the models, it is the data

In response, many solutions have emerged to solve the process of building and deploying ML models. But commonly, these solutions have been mostly focused on the model itself — not the more challenging part of ML: the data.

Today, the process to go from raw data to ML in production is slow, costly and error-prone. Underpinned by a higher level of complexity and data needs — 70% of companies struggle in their ability to extract value from data, even for analytics³.

Data Scientists are not productive (and it is not their fault)

The main reason behind the struggle of slow, failure-prone and expensive ML projects is, in our view, that the main actor here — the Data Scientist — is not productive enough. Below we break down why this is the case and how we intend to solve it with Forloop.ai.

I: Knowledge gap between data workers

A reason why data science teams are not productive is the lack of standardization in the process to go from raw data to models in production. It is a complex effort involving multiple steps and stakeholders, illustrated below. This often causes delays for the Data Scientist due to a lack of skills or ownership.

Fig 1: The ML lifecycle where the Data Scientist need to coordinate their work across teams and disciplines

Recent years have given rise to the role of Data Engineer. Serving a very important function of making data consumable for analysts and Data Scientists — through their expertise in working efficiently with databases, creating scalable pipelines to extract and load data to a destination (typically a data warehouse or a data lake).

While Data Scientists typically require data from the same source systems as business analysts, the data must be organized and structured very differently to work with machine learning algorithms. Further, Data Scientists often require additional data, some of which is external.

However, the preparation work to get the data model ready does not squarely fall into either domain. As a result, many Data Scientists are forced to learn how to work with databases, APIs and build data pipelines. When they really should be building machine learning models and exploring data for analysis.

We believe the answer is not more Data Engineers, but rather better tools for Data Scientists. To be more independent to drive the data to value process while freeing up Data Engineers.

II: Time and effort spent on data preparation (rather than modelling)

Another reason for the lack of data science productivity is the time and effort needed for data preparation and assure data quality. It is not uncommon to see Data Scientists devoting 70–80% of their time to these tasks, rather than modelling and analysis⁴.

Though a very important step, it’s a mundane, time-consuming, and manual process to — gather the data, structure and clean it, join sources, feature engineer and validate the data quality. Not surprisingly, one can often find quotes such as:

"… data preparation takes valuable time away from real data science work and has a negative impact on overall job satisfaction"⁵.

This frustration is most tangible when data is streamed from different sources, which may have different formats, structures, and metadata. Combining this into a single dataset suitable for ML is often a very tedious process. Especially when data needs to be prepared continuously.

To make things worse, Data Scientists are often expected to solve data challenges based on their own experience and intuition. Often done by each person in a team separately.

Furthermore, lack of data quality and avoidance of getting hands dirty on data preparation may lead to data cascades (see fig 2). Meaning, compounding negative effects of bad data that not only affect the productivity of the data science team but also the businesses directly.

Fig 2: Data cascades - How upstream data issues (e.g. collection) impact downstream outputs (models)⁶.

III: Lack of tooling that is unifying, but still flexible

Another bottleneck we see is the growing proliferation of tools and languages (well illustrated here). As a data manager or startup founder, how do you choose what tools you should have?

We see two problems here.

First, many tools are focused on a specific role or a step. As a result, even within a single company, there might be multiple data stacks — as different roles and units create their own. This may lead to a lack of cohesion between Data Scientists, Analysts, Engineers, and Businesses users. Despite the fact that they are all working towards the same business initiatives.

Second, is the idea of a unified platform for all doesn’t really hold. The different personas, skills and use-cases will simply not suit it. The most prominent tools that undertake a “unified platform approach” lack adoption from many Data Scientists. Except for costs, many common reasons are — lack of flexibility and control, lock-in, slow iterations, black-box and, difficulty to move from the test to the production environment.

We think businesses should invest in tools and platforms that bridge, rather than fight, the openness and fragmentation between tools and personas. Thus, it’s important that tooling is closely connected to how data science is done today.

Future of data science

Our vision is a future, where the current paradigm of the 80/20 rule will be flipped — where a Data Scientist spends 80% of their time not on fixing data — but rather on the creative aspects that drive the business forward.

For this to happen, what is required is:

Decrease the complexity and lack of collaboration in the data to value process.
Augment the data worker with machine intelligence to make data preparation less repetitive and pipelines less error-prone.

Introducing Forloop.ai

With Forloop.ai we are introducing a more autonomous approach to data preparation, in a no-code data pipeline environment. Ultimately, making Data Scientists and data teams more productive, and enable companies to fully leverage their data.

We call it: An easy-to-use data pipeline and data preparation tool, with augmented intelligence.

Forloop.ai is based on two main principles:

I) Decrease time for data preparation with augmented intelligence

Forloop.ai embeds deep-learning algorithms and statistical methods to make the data cleaning, joining and feature engineering less manual.

By inserting data into the platform, it understands and learns from — the structure of the data, the sequence of transformations made and the high-level objective of the user. Over time, automating more and more of the mundane data preparation tasks and data quality assurance.

This can be used for:

Accelerate data preparation by identifying and recommending actions for data quality issues, such as: missing values, outliers, value range, imputation, formatting of values and more.
Identify joins between datasets: not only on keys or column names but also based on the value ranges within them.
Establish a standard with common definitions of features and methods to engineer them towards that.
Monitor the data quality and get alerts when values fall outside expected thresholds.
Help users automatically synthesize data preparation processes and data pipelines.

II) Collaborative data workflow environment

To decrease the complexity and lack of collaboration in the data pipeline process, we believe in a combined visual and code approach.

The visual, drag-and-drop, environment is suitable for:

Collaborate between multiple actors (business people to engineers) to create and oversee the data to value process.
Overview of the processes. From connecting data sources, data procedures (e.g. clean, join and transform) to model integration (e.g. Sagemaker, Jupyter notebooks).
Orchestrate and schedule data processes for ML and metrics.

And, the code environment is more suitable for:

Full flexibility and transparency of what happens “under the hood”.
Fast iterations between data and ML prototyping. With support for import, export and sharing of code (Python).
Version control and data lineage to track the history, edits and relationship between the code and the data behind the models and metrics.

Ultimately, bring the best of both worlds in terms of team collaboration and flexibility to get the job done.

Fig 3: Overview of the Forloop.ai platform

We want to hear from you

We are currently working closely with a few companies, helping them to go from raw data to ML models in production. If your team is working through similar challenges and is interested in a hands-on partnership, we’d love to chat (sebastian@forloop.ai).

We are also hiring! If you love to move fast, enjoy the challenge of flipping the current 80/20 paradigm, please reach out or check our website.

References

‍