Machine learning algorithms require training data. Changing algorithms can affect accuracy. But, the quality of training data has more profound implications on the accuracy of the model.
However, access to quality data is still hard, despite data abundance. In this blogpost, let’s uncover proven methods to assess the quality of your dataset.
Significance of Data Quality
The purpose of data is to extract meaningful information from it; e.g., to train machine learning models. High-quality data leads to high-quality information. Conversely, bad data leads to a significant deviation from the intended result.
The problem with low-quality data is that it still produces results. These results, though wrong, merely appear as “different” to those obtained from high-quality data. This leads to poor decision making and sometimes even economic losses for the business.
Therefore, vetting data before using it to extract information is of prime importance.
What Does High Quality Data Look Like?
Data used to train machine learning algorithms is full of labeled and unlabeled image, video, audio or CSV files. It’s so varied that it often looks like there can’t be a set definition for what high quality data should look like. Fortunately, there are clear attributes that all high quality data share – and it’s not “accuracy”.
“High-quality training data is data that is secure, ethically sourced and free of errors that might compromise the intelligence of your algorithm.” – Via Samasource
Let’s dive deep into what the factors of quality training data are.
Factors of Quality Training Data
Let’s first try to understand what the characteristics of high-quality training data are.
1. Completeness
Machine learning algorithms can work on semi-incomplete data by filling it up with an average of the column, or uncovering a pattern from other complete columns, or skipping the row all together. However you should try to procure clean data that is largely complete. Incomplete data will lead to skewed results and reduce the quality of your training data.
For example, the Numpy library in Python (a popular data analytics tool) allows for multiple null value fill-ins.
# Example of a null value row
data = pd.Series([1, np.nan, 2, None, 3], index=list(‘abcde’))
a 1.0
b NaN
c 2.0
d NaN
e 3.0
dtype: float64
# Fills all null values with 0
data.fillna(0)
a 1.0
b 0.0
c 2.0
d 0.0
e 3.0
dtype: float64
# Forward fills all null values
data.fillna(method=’ffill’)
a 1.0
b 1.0
c 2.0
d 2.0
e 3.0
dtype: float64
2. Validity
Gather data from valid sources. For example, to create a summary that assesses customer experience, you must use consistent and comparable scales. If one question asks users to mark in a 5-point scale and another asks users to mark in a 3-point scale, the normalized values will vary significantly.
Data also needs to represent unique instances of time. This means no duplicates – duplicates cause bias!
3. Timeliness
It is important to have data that is both accurate and current. Data should be available as close as possible to the time it is needed. Fresher data will be more representative of the problem you’re trying to solve and will uncover more actionable insights. This is an important characteristic of quality training data.
4. Eventless
When harvesting data from IoT devices or system logs, it is important to look for data that represents the average system state. There could be instances where the device faces extreme usage or an attack. Gleaning information at this point would result in skewed data.
5. Consistency
Structured data is made up of rows and columns. Oftentimes, each column has a data type associated with it. For example, a column that measures weight is numerical, relative to a specific unit. All rows for the data must use a consistent unit of weight measurement. For example, in cases where the column has weight represented in both kilograms (kg) and pounds (lb), results derived from it will be all over the place!
Data Quality Assurance Methods
Data used for training machine learning models needs to be both accurately labeled and validated before used for learning. The assurance process involves both manual validations by human operators and some automated procedures.
- Overlap or Consensus Method
This is a popular method used to accurately label training data. The overlap – or consensus method – is used to measure the consistency and agreement amongst a group. It does so by dividing the sum of agreeing data annotations by the total number of annotations.
This method is the most commonly used method to assure data quality with quick turn around times. It can be performed by distributing data to human groups and tallying the total number of agreements. This can be used to increase the accuracy of labels.
- Auditing Method
Use this method to measure the accuracy of training data. It does so by having an expert review either a cross-section or all of the labels. This is useful for data on which a consensus can be reached. But whether it is right or wrong needs to be objectively determined.
This method achieves high levels of accuracy, but takes a lot of time.
- Multi-Layered Quality Evaluation Metrics
By routinely evaluating the health of multiple metrics of your training data, you can ensure its accuracy. You can completely automate this process, hence taking less overall time.
- Weekly Data Deep Monitoring Process
Data quality doesn’t always mean accuracy and consistency. Sometimes you need to take a step back and see if the data is good enough to deliver values. It sets a powerful precedent for all future applications that use the same training data.
You can set up a project management team to investigate data weekly to set goals and oversee results.
- Sourcing Data
Ensure that you source training data from qualified organisations. On the spectrum of data analytics, data warehousing and labelling is often seen at the bottom of the pyramid and is given least importance. However, quality of data has a direct impact on the quality of the machine learning model. Source data from reliable and vetted sources.
Oftentimes, data specific to your requirements might not be available off-the-shelf, and you might have to do the grunt work of collecting and labelling data, as well.
Best Practices for Data Quality Management
Now that you know what high-quality data looks like, it is important to set up a process that ensures you use only high-quality data for analysis.
1. Making data quality a priority
This might seem obvious, as data quality is the ultimate goal we’re working toward. However, a mindset change may be in order. As a first step, you need to communicate to your team that there are high standards set in place for data.
Actionable steps toward this would look like:
- Designing an enterprise-wide data strategy
- Creating clear user roles with rights and accountability
- Having a dashboard to monitor data quality metrics in real time
2. Automating data entry
Data entry, like every other manual process, is prone to human errors and biases. Datasets that produce high quality results are large in size, upwards of tens of thousands of rows. It is ridiculous to expect that such large quantities of data will be consistent with manual entry. Set up processes that allow systems to speak to each other so that data gathering becomes easier to automate.
3. Define data quality thresholds and rules
It is better to prevent data errors and redundancies rather than set up a process to sift through the data to weed out invalid entries. When implementing automated data gathering, be sure to set rules for each data point; such as: customer must have purchased more than X amount of goods, product must be live for X number of days, etc.
4. Data storage guidelines
Depending on the nature of data, you might require different data storage options. You can use a high-performance, multiple-write data storage for data you need to continually update. In some cases, you will need to keep track of different versions of data that only differ by a delta.
Conclusion
Data quality speaks volumes to how accurate your analysis is and how effectively you can train your model. Uploading quality standards requires a large amount of manual work. Therefore, it is often overlooked. However, as seen in this article, data quality has a direct reflection on business output. It’s never too late to start, and never too hard!