quickOutlier is a comprehensive toolkit for detecting
and treating anomalies in data. It goes beyond simple statistics,
incorporating Machine Learning (Isolation Forest) and Time Series
analysis.
First, load the library:
For simple numeric vectors, use detect_outliers. You can
choose between Z-Score (parametric) or
IQR (robust).
# Create data with an obvious outlier
set.seed(123)
df <- data.frame(val = c(rnorm(50), 100))
# Detect using Z-Score (Standard Deviation)
outliers <- detect_outliers(df, "val", method = "zscore", threshold = 3)
print(head(outliers))
#> val z_score
#> 51 100 6.99We can visualize the distribution, mean, and median with a single line of code. Detected outliers are highlighted in red.
If you want a quick overview of all numeric columns and their outlier
count, use scan_data.
# Scan the entire dataframe
scan_data(mtcars, method = "iqr")
#> Column Outlier_Count Percentage
#> 1 mpg 1 3.12
#> 2 cyl 0 0.00
#> 3 disp 0 0.00
#> 4 hp 1 3.12
#> 5 drat 0 0.00
#> 6 wt 3 9.38
#> 7 qsec 1 3.12
#> 8 vs 0 0.00
#> 9 am 0 0.00
#> 10 gear 0 0.00
#> 11 carb 1 3.12Sometimes a value is normal individually but anomalous in combination with others (e.g., a person 1.50m tall weighing 100kg).
Use this for detecting outliers based on correlation structures.
If you are viewing this as HTML, you can interact with the plot (zoom, hover).
For complex shapes where correlation isn’t enough, Local Outlier Factor (LOF) is powerful. It finds points that are isolated relative to their neighbors.
# Use the same multi-dimensional data
# k = number of neighbors to consider
res_lof <- detect_density(df_multi, k = 5, threshold = 1.5)
res_lof
#> x y lof_score
#> 21 5 20 3.79For high-dimensional or complex datasets, statistical methods often
fail. quickOutlier implements Isolation
Forest.
# Generate a 2D blob of data
data_ml <- data.frame(
feat1 = rnorm(100),
feat2 = rnorm(100)
)
# Add an extreme outlier
data_ml[1, ] <- c(10, 10)
# Run Isolation Forest
# ntrees = 100 is standard. contamination = 0.05 means we expect ~5% outliers.
res_if <- detect_iforest(data_ml, ntrees = 100, contamination = 0.05)
# View the outlier score (0 to 1)
head(subset(res_if, Is_Outlier == TRUE))
#> feat1 feat2 If_Score Is_Outlier
#> 1 10.000000 10.00000000 0.8781455 TRUE
#> 13 -1.018575 3.24103993 0.6560033 TRUE
#> 21 -2.309169 0.06529303 0.5828974 TRUE
#> 46 2.187333 0.60070882 0.5776194 TRUE
#> 80 1.444551 1.95529397 0.6017643 TRUEDetecting anomalies in time series requires removing Seasonality (repeating patterns) and Trend.
# Create a synthetic time series: Sine wave + Noise + Outlier
t <- seq(1, 10, length.out = 60)
y <- sin(t) + rnorm(60, sd = 0.1)
y[30] <- 5 # Spike (Outlier)
# Detect using STL Decomposition
res_ts <- detect_ts_outliers(y, frequency = 12)
# Check the detected outlier
subset(res_ts, Is_Outlier == TRUE)
#> Original Trend Seasonal Remainder Is_Outlier
#> 1 0.7852833 1.3139834 0.1714747 -0.7001748 TRUE
#> 30 5.0000000 -0.6050792 0.0619607 5.5431185 TRUEFind categories that appear too infrequently (potential typos).
Find points that have a disproportionate influence on a linear model.
# Use mtcars and create a high leverage point
cars_df <- mtcars
cars_df[1, "wt"] <- 10; cars_df[1, "mpg"] <- 50
infl <- diagnose_influence(cars_df, "mpg", "wt")
head(subset(infl, Is_Influential == TRUE))
#> mpg cyl disp hp drat wt qsec vs am gear carb Cooks_Dist
#> Mazda RX4 50 6 160 110 3.9 10 16.46 0 1 4 4 20.59087
#> Is_Influential
#> Mazda RX4 TRUEInstead of deleting data, it is often better to “cap” extreme values to a certain threshold (Winsorization).