Modern industrial equipment is being outfitted with an ever larger number of sensors. This means that gathering performance measurements is easier than ever before. This high frequency of data acquisition gives rise to “big data”—absolutely vast amounts of information. In theory, so much data should make the task of anomaly detection easier, but it doesn’t. Why not?
There are actually several reasons. First (and this is undoubtedly a good thing), most of the time equipment works. From a data science standpoint, this means the cases we’re interested in predicting (those in which equipment fails) are rare. This creates an issue in casting the problem as prediction: the anomalous cases are severely underrepresented in the data. With perhaps a single recorded instance of an incident, devising a sensible validation strategy for a model becomes extremely cumbersome. Second, mechanical equipment failures occur for a variety of reasons (such as machine characteristics, weather conditions and human error, to name just a few). This means we can’t treat all incidents as instances of the same phenomenon, which further compounds the difficulty in applying a supervised learning apparatus.
In practical terms, there’s also a third issue: labeled data is often difficult to obtain. The level of data maturity varies wildly among companies interested in predictive maintenance, and clean, labeled incident data (where we know whether each measurement represents a normal or abnormal reading) is difficult to come by.
Ultimately, we want to be able to detect anomalies in the data without explicitly defining what an anomaly is. So how can we do that?
A rather direct way is to reverse the problem. Start simply by learning what normal data looks like. This allows us to relax the limitation of a typical prediction problem and use weakly labeled data (a semisupervised approach). For this, we don’t need to have a label for every single datapoint; we only need to identify the time period when system behavior was deemed to be within acceptable bounds. The system uses this window to learn a normality criterion that is then applied to the remaining data, enabling us to separate the anomalous from the normal, generate warning signals using thresholding, and extract other types of useful information.
How do we summarize the information on the normal behavior of a multivariate time series that consists of measurements describing different aspects of the equipment of interest? From probability theory we know that if we can construct a joint cumulative distribution function of a multivariate series, we can extract all the necessary characteristics, particularly the probabilities associated with the likelihood of different patterns observed in the data. Such a distribution can be used to detect rare events: test instances falling within a low-density region, which can be considered outliers. This doesn’t automatically mean they’re indicative of failure— after all, even one-in-a-million events will happen every so often—but they certainly qualify for further inspection. In particular, the probabilities can act as a warning signal: if the observed probability is shrinking prior to a critical event (less and less likely patterns are manifesting), this can serve as a useful sign that something’s going wrong.
Historically, a typical approach to this problem was to fit a parametric multivariate distribution (usually Gaussian) and use it to calculate the pattern probabilities. There are, however, issues with applying this technique at scale:
- Empirical data is often asymmetric and possesses fat tails. These characteristics can’t be captured by a Gaussian distribution. To a certain degree, this problem can be mitigated by using copulas (decoupling joint from marginal behavior), but it still requires you to make parametric assumptions.
- By construction, fitting a multivariate Gaussian distribution requires estimating a correlation matrix parametrizing the distribution. In the case of “wide” data (a large number of columns, which is frequently the case in a multisensor environment), problems can arise with numerical stability because the correlation matrix is ill-defined.
- In addition, multivariate Gaussian has the property of asymptotic independence in the tails—in plain English, this means that input parameters that produce extreme responses occur independently. These extremes happen together quite frequently in mechanical systems, but under the assumption of Gaussianity, those phenomena are practically independent. This can lead to an excessive proportion of false positives in the signals generated by the system.
Fortunately, we can still apply a generative approach (in the sense of focusing on distributional properties of the time series of interest) if we combine it with dimensionality reduction techniques. If our features of interest are continuous (or we can reasonably approximate them as such), it’s not too much of a stretch to assume that the joint distribution belongs to an elliptical class and therefore a decomposition based on principal components analysis can be applied in a meaningful way.
An elegant example of that approach is a PCA scorer proposed by Shyu et al. (2003). To detect anomalous observations, we begin by estimating the principal components in a time period considered normal, project the original variables to the PC space, and then reconstruct the original variables (perform an inverse transformation). If only the first few principal components (the ones that explain most of the variance in the data) are sufficient for a proper reconstruction, the associated reconstruction error will spike for anomalous examples—leading to a directly usable anomaly score that can be assigned to new, unseen observations.
We applied this method to a problem Semiotic Labs recently solved for a client: we were presented with data collected for two motors forming a single engine unit. The client knew that December 2015 was a period of normal operations in both motors and wanted our opinion on the motor performance from January 2016 onward. We trained a PCA scorer on the good period for motor 1 and evaluated the 2016 part of the data for both motors.
The reconstruction error (our anomaly score) was consistently low for motor 2, but it spiked in the first week of February 2016 for motor 1. Further examination of the internal performance data (based on measurements performed periodically with more specialized equipment) on the client side led to a confirmation of our discovery: there were indeed mechanical issues with motor 1, that have not been fully acted upon.
Resources67 See all resources
Learn about three digital tools and methodologies, powered by artificial intelligence (AI) and machine learning, to help you make your maintenance and reliability work better, smarter and more efficient.