Solving Statistics Problems: How to Work with Difficult Data

From histograms to other tools, here is how you can make use of historical statistics or other forms of difficult data.
BY:
Carol Parendo

Q: I have a difficult data set that I want to use to help solve a problem. I am finding it difficult to find something of potential value. Can you help?

A: Difficult data goes beyond what is called messy data, which typically entails inconsistent, incomplete, or improperly labeled data. A difficult data set is often not created with the problem in mind. However, it is readily available. Difficult data can come from large historical data sets or smaller data sets from (undesigned) experimentation. It may also be volatile (highly variable or unstable). It is created in an unplanned manner, typically involves multiple variables, and frequently requires a multi-faceted approach. For these reasons, a framework along with techniques to wrangle difficult data will be presented.

While the goal is to extract meaning, the first step is to identify the problem. Let’s take an example of low yield, with a particular defect being the one of interest for improvement without negatively affecting other aspects. This is a common request for extracting meaning from historical data. Taking this problem one step further, you might ask:

  • Were the yields always low?
  • If not, is there an approximate time when the problem started?
  • Is it intermittent? Does it come and go?

Clarifying questions may not only help further define the problem, they may also ensure you look at an appropriate data set for the problem and identify the shortcomings of the available data set.

Next, understanding and cleansing the data is key. This typically involves rectifying messy data issues such as inconsistent, incomplete, or improperly labeled data. Cross-industry processes for data mining (CRISP-DM) methodology can be referenced for further insight into these steps.¹

Now that we understand the problem and have cleansed our dataset, a few simple techniques can be applied to gain insights that are hidden in the data.

1. View Distributions with Histograms

If the data is continuous, and using what you know about the subject, would you expect normally distributed data (Figure 1) or right-skewed data (Figure 2)? Or are you finding bimodal data – indicated by two adjacent peaks – that you need to understand better? Viewing histograms may also help locate bad data that was previously hidden.

Figure 1 A normally distributed histogram (E2586) shows the classic bell-shaped curve.

Figure 2 A right skewed histogram (E2586) is common for variables that are heavily weighted toward zero and have a lower bound of zero (cannot physically have a value lower than zero). Non-experts commonly mistake higher values from right skewed data as outliers and delete valuable data.

 

Other data may be discrete, which can take the following forms:

  • Nominal: without order, such as {male, female} or {supplier #1, supplier #2, supplier #3}
  • Ordinal: can be ordered, such as {high school, bachelor’s, master’s, doctoral degrees}

For each category of a discrete data feature, do all the levels have a reasonable amount of data? Or are certain levels missing?
The power of this seemingly simple technique is that it is easy to perform, and graphics provide a variety of insights.

2. Explore the influence of time

You may already be aware of the influence of time when you asked clarifying questions (e.g., When did this issue start?). Perhaps you discovered a bimodal histogram on a response variable that may be further explained by time. Perhaps you know of a special event that may influence the data, such as the COVID-19 pandemic. Creating a graph or control chart to plot the data in time order can be useful. To have this option available, dates and times must not be deleted in the cleansing step.

This technique is powerful because understanding known or potentially undiscovered changes over time helps to unpack the data properly. Sometimes treating all available data in the set as a homogeneous group may not be beneficial. Older data or a certain subset may not be relevant to the current issue.

3. Unpack relationships between variables

This may mean predictor variables (x’s) as they pertain to response variables (y’s) or how the response variables relate to each other. This can be performed by plotting data. If possible, use scatterplot matrices as an efficient way to display several plots at once.

Let’s illustrate this with a manufacturing data example where one of the predictor variables is supplier and the responses are two different types of mechanical tests. This concept is taken one step further by color-coding best of best (BOB) parts and worst of worst (WOW) parts when reviewing response plots (Figure 3). Then those BOBs and WOWs are disseminated into remaining plots that also contain predictor variables. In this example, the plot with Supplier (Figure 4) indicates that material from Supplier #3 is correlated to lower mechanical test results.

 

Figure 3 The purpose of this example plot is to first understand the correlation of responses. Additionally, the best of best (BOBs) and worst of worst (WOWs) parts are color-coded. 

 

Figure 4 The best of best (BOBs) and worst of worst (WOWs) parts are disseminated onto plots that contain predictor variables.

 

What is next after you apply these techniques? Two common options for building on insights are:

  • Choose to perform analysis, which may guide you in using a reduction in timeframe or a reduction in variables
  • Conduct further experimentation or validation in the region (design space) of interest

Difficult data may present a variety of challenges to meaningful analysis. Before diving straight into analysis, you must first gain insights. This is key to determining the best path forward. Now go play with your data and see what it may be telling you.

References

¹  Hotz, N. “What is CRISP DM?” Data Science Process Alliance. March 26, 2024. https://www.datascience-pm.com/crisp-dm-2.

 

Carol Parendo is senior technical fellow for Enterprise Quality at Collins Aerospace. She has over 30 years of experience as a mechanical engineer and statistician in both the aerospace and medical device fields. Carol is member-at-large for the committee on quality and
statistics (E11).

John Carson, Ph.D., is senior statistician for Neptune and Co. and coordinator of Data Points. He is a member of the committees on quality and statistics (E11), petroleum products, liquid fuels, and lubricants (D02), air quality (D22), and more.

Industry Sectors

Issue Month
May/June
Issue Year
2024