K-nearest neighbor (KNN)

KNN is for numeric, enum and Boolean records. For intervals, other than none, this strategy replaces a missing value by calculating the majority value recorded for the item’s k nearest neighbors.

The number of neighbors to consider is k. The system selects k previous and next nearest neighbors to calculate the missing value. If a tie occurs between two values, the algorithm selects the lowest timestamp value.

Boolean data series example

For example, the system monitors if meter 1 is on or off reporting a value of true (on) or false (off) every 15 minutes (the interval). If a value for 2:15 is missing between time stamps 2 pm and 3 pm, and k = 3, the system finds the three nearest timestamps to 2:15, which are 1:45, 2:00, and 2:30 and takes the majority of these three values. The table records these values:

Timestamp Meter 1 on state
1:30 true
1:45 false
2:00 true
2:15 false (interpolated value)
2:30 false
2:45 false

In the example, the majority value, considering the three neighbors, is “false.” The system assigns this value to 2:15.

Tie example

A tie can occur with numeric, Boolean and enum data. The system handles a tie in a particular way:

  1. First, it gives preference to the highest frequency of k nearest neighbors.
  2. Next, when the same frequency (k) of nearest neighbors exists, the system gives preference to the value from the record with the timestamp nearest to the missing record.
  3. Finally, when the timestamps are equidistant from the missing data, the system gives preference to the record preceding the missing data.

In Table 1, k=4, its preceding nearest neighbors’ majority value is zero. Its succeeding nearest neighbors’ majority value is 1. Using rule 2 above, the system breaks the tie by assigning the missing value to the same value as the preceding timestamp (1:45).

K = 4, Same frequency of nearest neighbors

Timestamp Enum data
1:30 0
1:45 0
2:00 0 (interpolated value)
2:15 1
2:30 1

In Table 2, k=1, its preceding nearest neighbor’s value is 1. Its succeeding nearest neighbor’s value is 3. The system recorded both values at the same interval before and after the missing value, so it gives preference to the preceding timestamp, and sets the missing data value to: 1.

K = 1 Timestamps equidistant from the missing data

Timestamp Enum data
1:30 0
1:45 1
2:00 1 (interpolated value)
2:15 3