Modern industrial equipment is being outfitted with an ever larger number of sensors. This means that gathering performance measurements is easier than ever before and with this high frequency of data acquisition, we are dealing with problems of really, really big data. In theory, that should make the task of anomaly detection easier, but it doesn’t. Why might this be the case?

There are actually several reasons. First, and this is undoubtedly a good thing, most of the time equipment works. From a data science standpoint, this means the cases we are interested in predicting (those in which equipment fails) are rare. This creates an issue with casting the problem as prediction the anomalous cases are severely underrepresented in the data with a single recorded instance of an incident, devising a sensible validation strategy for a model becoming extremely cumbersome. Second, mechanical equipment failures occur for a variety of reasons (machine characteristics, weather conditions, human error, just to name a few). This means we cannot treat all incidents as being similar, which further compounds the difficulty in applying a supervised learning apparatus.

In practical terms, there is also a third issue: labeled data is often difficult to obtain. The level of data maturity varies wildly among companies interested in predictive maintenance and clean, labeled incident data (where for each measurement point we know whether it is normal or abnormal) is difficult to come by.

Ultimately, we want to be able to detect anomalies in the data without explicitly defining what an anomaly is. So how can we go about this?

A rather direct way is to reverse the problem. Start simply by learning what normal data looks like. This allows us to relax the limitation of a typical prediction problem and use weakly labeled data (a semisupervised approach). For this, we do not need to know a label for every single data point, we only need to identify the period when system behavior was deemed to be within acceptable bounds. This period is used to learn a normality criterion which is then applied to the remaining part of the data, allowing us to discriminate against the anomalous from the normal, generate warning signals by thresholding, and extract other types of useful information.

How do we summarise the information about the normal behaviour of a multivariate time series which consists of measurements describing different aspects of the equipment of interest? From probability theory we know that if we can construct a joint cumulative distribution function of a multivariate series, we can extract all the necessary characteristics, particularly the probabilities associated with the likelihood of different patterns observed in the data. Such a distribution can be used to detect rare events: test instances falling within a low density region, which can be considered outliers. This does not automatically mean that they are indicative of failure as after all, even one in a million events are supposed to happen every so often, but they certainly qualify for further inspection. In particular, the probabilities can act as a warning signal – if the observed probability is shrinking prior to a critical event (less and less likely patterns are manifesting), this can serve as a useful warning signal.

Historically, a typical approach to this problem was to fit a parametric multivariate distribution (usually Gaussian) and use it to calculate the pattern probabilities. There are, however, issues with applying this technique at scale:

- Empirical data is often asymmetric and possesses fat tails. Such characteristics cannot be captured by a Gaussian distribution. To a certain degree, this problem can be mitigated by using copulas (decoupling joint from marginal behaviour), but it still requires that parametric assumptions be made.
- By construction, fitting a multivariate Gaussian distribution requires estimating a correlation matrix parametrising the distribution. In the case of “wide” data (a large number of columns, which is frequently the case in a multisensor environment), problems can arise with numerical stability due to the fact the correlation matrix is ill defined.
- in addition, multivariate Gaussian has a property of asymptotic independence in the tails – in plain English, it means that extreme realizations occur independently. Extreme realizations happen together quite frequently in mechanical systems, but under the assumption of gaussianity, those phenomena are practically independent. This can lead to an excessive proportion of false positives in the signals generated by the system.

Fortunately, we can still apply a generative approach (in the sense of focusing on distributional properties of the time series of interest) if we combine it with dimensionality reduction techniques. If our features of interest are continuous (or we can reasonably approximate them as such), it is not too much of a stretch to assume that the joint distribution belongs to an elliptical class and therefore a decomposition based on principal components analysis can be applied in a meaningful way.

An elegant example of that approach is a PCA scorer proposed by Shyu et al (2003). In order to detect anomalous observations, we begin by estimating the principal components on a period considered normal, project the original variables to the PC space and then reconstruct the original variables (perform an inverse transformation). If only the first few principal components (the ones that explain most of the variance in the data) are sufficient for a proper reconstruction, the associated reconstruction error will spike for anomalous examples – leading to a directly usable anomaly score that can be assigned to new, unseen observations.

The graph below demonstrates the application of this method to a problem Semiotic Labs recently solved for a client: we were presented with data collected for two motors forming a single engine unit. The client knew that December 2015 was a period of normal operations in both motors and wanted our opinion on the motor performance from January 2016 onward. We trained a PCA scorer on the good period for motor 1 and evaluated the 2016 part of the data for both motors.

As the graph shows, the reconstruction error (our anomaly score) is consistently low for motor 2, but it spikes in the first week of February 2016 for motor 1. Further examination of the internal performance data (based on measurements performed periodically with more specialized equipment) on the client side led to a confirmation of our discovery – there were indeed mechanical issues with motor 1, that have not been fully acted upon.