Recent advances in computing technology allow for collecting vast amount of data that arrive continuously in the form of streams. Mining data streams is challenged by the speed and volume of the arriving data. Furthermore, the underlying distribution of the data changes over the time in unpredicted scenarios. To reduce the computational cost, data streams are often studied in forms of condensed representation, e.g., Probability Density Function (PDF). This thesis aims at developing an online density estimator that builds a model called KDE-Track for characterizing the dynamic density of the data streams. KDE-Track estimates the PDF of the stream at a set of resampling points and uses interpolation to estimate the density at any given point. To reduce the interpolation error and computational complexity, we introduce adaptive resampling where more/less resampling points are used in high/low curved regions of the PDF. The PDF values at the resampling points are updated online to provide up-to-date model of the data stream. Comparing with other existing online density estimators, KDE-Track is often more accurate (as reflected by smaller error values) and more computationally efficient (as reflected by shorter running time). The anytime available PDF estimated by KDE-Track can be applied for visualizing the dynamic density of data streams, outlier detection and change detection in data streams. In this thesis work, the first application is to visualize the taxi traffic volume in New York city. Utilizing KDE-Track allows for visualizing and monitoring the traffic flow on real time without extra overhead and provides insight analysis of the pick up demand that can be utilized by service providers to improve service availability. The second application is to detect outliers in data streams from sensor networks based on the estimated PDF. The method detects outliers accurately and outperforms baseline methods designed for detecting and cleaning outliers in sensor data. The third application is to detect changes in data streams. We propose a framework based on Principal Component Analysis (PCA) that reduces the problem of detecting changes in multidimensional data into the problem of detecting changes in the projected data on the principal components. We provide a theoretical analysis, which is support by experimental results to show that utilizing PCA reflects different types of changes in data streams on the projected data over one or more principal components. Our framework is accurate in detecting changes with low computational costs and scales well for high dimensional data.
|Date made available
|KAUST Research Repository