Yinghua Guan
Jan 10, 2020
The lifeblood of AI is data. Without data, there is nothing to train AI models with. However, data is a challenge for AI projects. According to IBM Corp. executive Arvind Krishna’s speech at The Wall Street Journal’s Future of Everything Festival, about 80% of the work in an AI project is collecting and preparing data. In order to properly prepare the dataset for AI projects, the data stage is broken into three parts: data requirements, data collection, and exploratory data analysis (EDA).
Below are some good starting questions to ask when preparing data for a machine-learning project:
Data requirements are hard to define as they can often change over time, especially as projects mature. Data requirements failures are most common in projects that have multiple stages like a proof of concept and a production phase. Shortcuts and ‘hacks’ that make a proof of concept work often don’t scale in production.
Often, data collection has a tendency to go wrong within the process itself. As many processes are – or can be – automated, data collection may silently fail without anybody noticing – until they try to use the data. For example, a piece of software may fail, a scheduler may fail to run, or other forms of unexpected data errors may be encountered. If the software that’s collecting data hasn’t been programmed to alert users when it runs into errors, its users may never know. Another common reason for failure is a change in the data format that’s unannounced or poorly documented by a third party. Software must be adapted to match and anticipate these changes accordingly. The best way to avoid data collection mistakes is through regular maintenance and validation.
After data collection, the formal process of exploratory data analysis (EDA) begins.
What is EDA?
As the name implies, it’s all about exploring your data to learn its characteristics and anomalies. This step essentially validates data requirements:
Where companies go wrong here is by skipping one or more of these steps – assuming this process is adopted at all. This is especially common for companies that don’t have trained data scientists and data analysts on staff. When a machine learning project first kicks off, a thorough EDA is essential. Until an EDA is performed, the company doesn’t know what it doesn’t know, which creates an enormous risk of failure for the project.
In Part 3 we shall go over the next step of the AI Development project lifecycle: Feature Definition. Stay tuned!
Let's have a conversation
Contact Us
Start a Project
Join Us
careers@cctech.io
Visit Us
777 Hornby Street, Suite 1500, Vancouver, BC, Canada, V6Z 1S4
© 2025 Convergence Concepts Inc.