Fast algorithm for cleaning of measurement data from outliers: search for the optimal solution with the minimum number of rejected measurement results

I.V. Bezmenov

FSUE “VNIIFTRI”, Mendeleevo, Moscow region, Russia
bezmenov@vniiftri.ru

Al’manac of Modern Metrology № 4 (36) 2023, pages 96–122

Abstract. This article discusses the problem of automatic detection of coarse measurements (outliers) in the time series of measurement data generated by technical devices. Solving this problem is of great importance to improve the accuracy of estimates of various physical quantities obtained in solving many applications in which the input data is observations. Since outliers adversely affect the accuracy of final results, they must be detected and removed from further calculations at the stage of data preprocessing and analysis. This can be done in various ways, since the concept of outliers does not have a strict definition in statistics. The author of the article previously formulated the problem of finding the optimal solution that satisfies the condition of maximizing the amount of measuring data remained after removal of outliers, and also proposed a robust algorithm for finding such a solution. The complexity of this algorithm is , where N is the number of source data and Nout is the number of outliers detected. For highly noisy data, the amount of outliers can be extremely large, for example, comparable to N. In this case, it will take about N2 arithmetic operations to find the optimal solution using the algorithm developed earlier. This article proposes a new algorithm for finding the optimal solution, requiring arithmetic operations, regardless of the number of outliers detected. The efficiency of the algorithm is manifested when cleaning from outliers large amounts of highly noisy measuring data containing a great many of outliers. The algorithm can be used for automated cleaning from outliers of observation data in information and measuring systems, in systems with artificial intelligence, as well as when solving various scientific, applied, managerial and other problems using modern computer systems in order to obtain promptly the most reliable final result.

Keywords: information and measuring systems, time series, data pre-processing, outliers, data cleaning from outliers, optimal solution.

Full texts of articles are available only in Russian in printed issues of the magazine.