[Tech Dive] A Practical Guide to AI Project Development, Part 2: Data

[Tech Dive] A Practical Guide to AI Project Development, Part 2: Data
Yinghua Guan

Yinghua Guan

Jan 10, 2020

The lifeblood of AI is data. Without data, there is nothing to train AI models with. However, data is a challenge for AI projects. According to IBM Corp. executive Arvind Krishna’s speech at The Wall Street Journal’s Future of Everything Festivalabout 80% of the work in an AI project is collecting and preparing data. In order to properly prepare the dataset for AI projects, the data stage is broken into three parts: data requirementsdata collection, and exploratory data analysis (EDA).

 

Data Requirements

 

Below are some good starting questions to ask when preparing data for a machine-learning project:

 

  • What data will the project need?
  • Where will the data come from?
  • What format will that data be in? Structured (rows of columns), or unstructured (images and text)?
  • How often will we need the data? (Real-time, batched, manually, etc.)
  • How will the data be used?

 

Data requirements are hard to define as they can often change over time, especially as projects mature. Data requirements failures are most common in projects that have multiple stages like a proof of concept and a production phase. Shortcuts and ‘hacks’ that make a proof of concept work often don’t scale in production.

 

Data Collection

 

 

Often, data collection has a tendency to go wrong within the process itself. As many processes are – or can be – automated, data collection may silently fail without anybody noticing – until they try to use the data. For example, a piece of software may fail, a scheduler may fail to run, or other forms of unexpected data errors may be encountered. If the software that’s collecting data hasn’t been programmed to alert users when it runs into errors, its users may never know. Another common reason for failure is a change in the data format that’s unannounced or poorly documented by a third party. Software must be adapted to match and anticipate these changes accordingly. The best way to avoid data collection mistakes is through regular maintenance and validation.

 

Exploratory Data Analysis

 

After data collection, the formal process of exploratory data analysis (EDA) begins.

 


What is EDA?

 

As the name implies, it’s all about exploring your data to learn its characteristics and anomalies. This step essentially validates data requirements:

 

  • Was the proper dataset collected according to our data requirements?
  • Is the data “good” – i.e. statistically valid and free of errors or missing data?
  • Does the data fit defined needs?
  • What are the main characteristics and attributes of the data?
  • What else came along with the data that could be useful in our approach?

 

Where companies go wrong here is by skipping one or more of these steps – assuming this process is adopted at all. This is especially common for companies that don’t have trained data scientists and data analysts on staff. When a machine learning project first kicks off, a thorough EDA is essential. Until an EDA is performed, the company doesn’t know what it doesn’t know, which creates an enormous risk of failure for the project.

 

In Part 3 we shall go over the next step of the AI Development project lifecycle: Feature Definition. Stay tuned!