Anomaly Detection with Isolation Forests in R
As we mentioned in a previous post anomaly detection (outlier or novelty detection) is a task in data analysis where the goal is to identify rare items, events, or observations that deviate significantly from the majority of the data. This process is used in various domains such as fraud detection. Following up on an earlier post this post will guide you through the concept, implementation, and application of isolation forests, with code examples in R, with the isotree package.
The Isolation Forest algorithm, first introduced by Liu et al. in 2008, is a tree-based, unsupervised learning algorithm, based on the principle of isolating anomalies which are few and different compared to normal data points. The algorithm achieves this by randomly partitioning the dataset into isolation trees (iTrees). Anomalies are isolated faster due to their uniqueness, requiring fewer splits to separate them from the rest of the data. The shorter the path length in these trees, the higher the anomaly score of a point. This is in contrast, normal points require more splits due to their denser distribution. The method begins by constructing multiple isolation trees (iTrees) using random subsets of the dataset. Each tree is built through recursive partitioning, where a feature is randomly selected, and a split value is chosen randomly within the feature’s range. This process continues until every data point is isolated or a maximum tree depth is reached. The ease of isolating a data point is reflected in its path length—the number of splits required to separate it from the rest of the data, hence, the isolation forests are efficient with high-dimensional data and can handle both large datasets and complex feature interactions. Another advantage is the versatility in handling both numerical and categorical data directly, eliminating the need for extensive preprocessing or transformations. Isolation Forests do not require explicit data standardization or normalization which reduces the preprocessing overhead, making the algorithm more user-friendly and accessible. Moreover, Isolation Forests provide clear and interpretable results through anomaly scores. Each data point is assigned a score that quantifies its degree of anomaly, enabling straightforward decision-making. This transparency allows practitioners to understand the basis of anomaly detection, fostering trust in the model’s predictions and facilitating communication with stakeholders.
Implementing Isolation Forests in R
We generate a dataset with 500 points where most points are sampled from a normal distribution, and a few are sampled as outliers.
set.seed(42)
# Generate normal data
n <- 500
normal_data <- data.frame(
x = rnorm(n, mean = 0, sd = 1),
y = rnorm(n, mean = 0, sd = 1)
)
# Add anomalies
outliers <- data.frame(
x = rnorm(20, mean = 5, sd = 0.5),
y = rnorm(20, mean = 5, sd = 0.5)
)
# Combine datasets
data <- rbind(normal_data, outliers)
# Visualize the data
plot(data$x, data$y, col = "blue", pch = 20, main = "Synthetic Dataset", xlab = "Feature X", ylab = "Feature Y")
points(outliers$x, outliers$y, col = "red", pch = 20)
Training the Isolation Forest Model
We now train an isolation forest model using the isolation.forest function. This function builds a series of isolation trees, evaluating anomaly scores for each point.
library(isotree)
# Fit isolation forest model
iso_forest <- isolation.forest(
data,
ndim = 2,
ntrees = 100,
sample_size = 256
)
# Display model summary
summary(iso_forest)
Generating Anomaly Scores
After training the model, we calculate anomaly scores for each point. These scores range from 0 to 1, with higher scores indicating greater anomaly likelihood.
# Compute anomaly scores
scores <- predict(iso_forest, data, type = "anomaly_score")
# Add scores to the dataset
data$score <- scores
# Visualize the distribution of scores
hist(scores, breaks = 30, main = "Distribution of Anomaly Scores", xlab = "Anomaly Score", col = "lightblue")
Visualizing Anomalies
Points with high anomaly scores can be flagged as potential outliers. We highlight these points in the dataset visualization.
# Flag anomalies
threshold <- 0.7
data$is_anomaly <- data$score > threshold
# Visualize anomalies
plot(data$x, data$y, col = ifelse(data$is_anomaly, "red", "blue"), pch = 20,
main = "Anomaly Detection with Isolation Forests", xlab = "Feature X", ylab = "Feature Y")
legend("topright", legend = c("Normal", "Anomaly"), col = c("blue", "red"), pch = 20)
Customizing the Isolation Forest Model
The isotree package provides several options to customize the isolation forest. These options include:
ndim: Number of dimensions to consider in splits.ntrees: Number of isolation trees to build.sample_size: Subset size for building each tree.scoring_metric: Metric for scoring anomalies.
An extended isolation forest allows non-linear splits by projecting data onto random hyperplanes. This is useful for datasets with complex distributions.
# Extended isolation forest
ext_iso_forest <- isolation.forest(
data,
ndim = 3,
ntrees = 100,
sample_size = 256,
scoring_metric = "standard"
)
# Compute anomaly scores
ext_scores <- predict(ext_iso_forest, data, type = "anomaly_score")
# Compare extended and standard isolation forests
data$ext_score <- ext_scores
plot(data$score, data$ext_score, xlab = "Standard IF Score", ylab = "Extended IF Score",
main = "Comparison of Anomaly Scores")
Evaluating Performance
To evaluate the performance of an isolation forest, consider metrics like the Area Under the ROC Curve (AUC) or the precision-recall curve. These metrics require labeled data, distinguishing normal points from anomalies.
Example: ROC Curve
library(pROC)
# Simulate labels for evaluation
labels <- c(rep(0, n), rep(1, 20)) # 0 = normal, 1 = anomaly
# ROC Curve
roc_curve <- roc(labels, scores)
plot(roc_curve, col = "darkgreen", main = "ROC Curve for Isolation Forest")
Conclusion
Isolation Forests is a useful anomaly detection method with applications across fields such as healthcare. Their ability to isolate anomalies effectively makes them invaluable for identifying rare but critical patterns in medical datasets. In healthcare, Isolation Forests can identify anomalous patient data that may indicate underlying issues requiring immediate attention. For example, they can detect sudden spikes in heart rate, irregularities in glucose levels, or deviations in vital signs that might signal the onset of critical conditions like sepsis or arrhythmias. In hospital settings, they are used to monitor ICU equipment, identifying anomalies in sensor readings that could suggest malfunctions or false alarms, ensuring timely interventions and reducing risks. In public health, Isolation Forests play a role in detecting anomalies in epidemiological data, such as unusual spikes in emergency room visits or reported symptoms. These insights can help identify early signs of disease outbreaks, enabling preventive measures and resource allocation before the situation escalates. Similarly, wearable health devices equipped with anomaly detection capabilities use Isolation Forests to monitor user health, flagging irregularities like abnormal sleep patterns or activity levels, prompting users to seek medical advice. Beyond healthcare, Isolation Forests are applied in other domains like manufacturing and IoT. For instance, in pharmaceutical production, they can identify deviations in equipment performance that may compromise drug quality.
Isolation Forests are an efficient method for anomaly detection, capable of handling diverse datasets with minimal preprocessing. Although the algorithm is robust to varying feature scales, applying transformations such as log scaling to address skewed data can enhance performance in some cases. Fine-tuning parameters like the number of trees (ntrees), sample size, and the number of dimensions (ndim) allows finding the balance between computational efficiency and detection accuracy, adapting the model to specific use cases. The ‘isotree’ package in R simplifies the implementation of isolation forests, offering a customizable framework for users. However, interpreting anomaly scores requires a contextual approach, as these scores serve as relative indicators rather than absolute measures of anomalies. Domain expertise still plays an important role in determining appropriate thresholds, ensuring meaningful insights about the nature of the anomaly.
References
- Anomaly Detection Definition - TechTarget provides an overview of anomaly detection, its types, and applications across industries.
- Isolation Forest Explanation - Spot Intelligence discusses the Isolation Forest algorithm in detail, highlighting its advantages and use cases.
- GitHub Repository for Isolation Forest - A repository showcasing an implementation of Isolation Forest, including example code and documentation.
- An Introduction to Isolation Forests - The CRAN vignette for the
isotreepackage, which explains the use of Isolation Forests in R. - Resources on Anomaly Detection - A Reddit thread sharing valuable resources and insights on anomaly detection techniques.
- Isolation Forest Variants - An academic paper on advancements and variants of the Isolation Forest algorithm.
- Anomaly Detection with Isolation Forest - A GeeksforGeeks article providing an implementation of Isolation Forest for anomaly detection.
- DigitalOcean Guide on Isolation Forest - A tutorial on DigitalOcean covering the theory and application of Isolation Forests.
- Outlier Detection with Isolation Forest in R - A Kaggle notebook demonstrating the use of Isolation Forest for detecting anomalies in R.
- Isolation Forest Paper (Liu et al., 2008) - The foundational paper introducing the Isolation Forest algorithm, authored by Liu et al.
- CRAN isotree Package Documentation - Official documentation for the
isotreeR package, which implements Isolation Forests. - Introduction to Isolation Forests in isotree - A detailed introduction to Isolation Forests using the
isotreeR package. - TechTarget’s Overview on Anomaly Detection - An in-depth explanation of anomaly detection, including its methods and real-world applications.
- Spot Intelligence’s Tutorial on Isolation Forest - A guide to Isolation Forests, covering their implementation and practical examples.
- GitHub Repository for Isolation Forest in R - Source code and examples for applying Isolation Forest in R, hosted on GitHub.