I skipped the afternoon tutorial sessions to work on my presentation and go around town a bit but I did attend a very interesting morning tutorial on time series. The tutorial was on Time Series and was led by Eamonn Keogh.
He makes some bold claims but it was a very informative tutorial and he seems to have a lot of very reasonable and interesting things to say. His basic claim is that a lot of the time people try to fit complex, fully general models to classification and prediction problems when their data is actually linear and simpler methods would be much more effective. A time series is any data set which has data points occurring in a fixed linear ordering. In response to someone’s question he did make clear that you need to make some reasonable assumptions about how much variation there is over time. So data where events can occur in arbitrary order won’t work. Also, if it’s equally likely that some event may occur every 0.1 seconds or occur every 10 hours then it probably won’t work either. But a lot of real data sets actually vary within a small time range. What time series methods can handle, apparently, is almost arbitrary variation in the magnitude of the number, in anomalies in the middle of a set of data, in outliers that don’t fit the average cases and several other types of variation.
This makes time series very powerful for sensor data and medical data of well understood processes. Keogh’s focus is on symbolic analysis of data which has obvious applications for genetic datasets but is apparently also very effective for continuous, real valued data. The basic idea is to cluster the time series into levels and label these levels. Patterns of these labels can then be used to discover motifs in the data which often have an understandable semantic meaning.
There are several questions that arise with this approach, some of which he answered as well. Basically, he argues that most data can be analysed using simple Euclidean distance and linear transformations on the time series with anomaly detection. The worry with this is that you may just find a pattern when there isn’t really one present if you stretch and shift the data enough. It is important to treat a pattern detected in this way as merely a theory which is then verified by going back to the original data or to some independent data source. He confirmed that this is what they always do but that when such anomalies arise in real datasets they very often are meaningful or can guide further exploration.