Getting Started with quickOutlier

quickOutlier is a comprehensive toolkit for detecting and treating anomalies in data. It goes beyond simple statistics, incorporating Machine Learning (Isolation Forest) and Time Series analysis.

First, load the library:

library(quickOutlier)
library(ggplot2)

1. Univariate Analysis (The Basics)

For simple numeric vectors, use detect_outliers. You can choose between Z-Score (parametric) or IQR (robust).

# Create data with an obvious outlier
set.seed(123)
df <- data.frame(val = c(rnorm(50), 100))

# Detect using Z-Score (Standard Deviation)
outliers <- detect_outliers(df, "val", method = "zscore", threshold = 3)
print(head(outliers))
#>    val z_score
#> 51 100    6.99

New: Educational Visualization

We can visualize the distribution, mean, and median with a single line of code. Detected outliers are highlighted in red.

plot_outliers(df, "val", method = "zscore")

Scanning the Dataset

If you want a quick overview of all numeric columns and their outlier count, use scan_data.

# Scan the entire dataframe
scan_data(mtcars, method = "iqr")
#>    Column Outlier_Count Percentage
#> 1     mpg             1       3.12
#> 2     cyl             0       0.00
#> 3    disp             0       0.00
#> 4      hp             1       3.12
#> 5    drat             0       0.00
#> 6      wt             3       9.38
#> 7    qsec             1       3.12
#> 8      vs             0       0.00
#> 9      am             0       0.00
#> 10   gear             0       0.00
#> 11   carb             1       3.12

2. Multivariate Analysis (Two or more variables)

Sometimes a value is normal individually but anomalous in combination with others (e.g., a person 1.50m tall weighing 100kg).

Mahalanobis Distance

Use this for detecting outliers based on correlation structures.

# Create correlated data and add an outlier
df_multi <- data.frame(x = 1:20, y = 1:20)
df_multi <- rbind(df_multi, data.frame(x = 5, y = 20)) # Anomalous point

res_multi <- detect_multivariate(df_multi, c("x", "y"))
tail(res_multi, 3)
#>    x  y mahalanobis_dist
#> 21 5 20            19.05

Interactive Plot (Plotly)

If you are viewing this as HTML, you can interact with the plot (zoom, hover).

# Lower confidence level to make it more sensitive for the demo
plot_interactive(df_multi, "x", "y", confidence_level = 0.99)

Density-based Detection (LOF)

For complex shapes where correlation isn’t enough, Local Outlier Factor (LOF) is powerful. It finds points that are isolated relative to their neighbors.

# Use the same multi-dimensional data
# k = number of neighbors to consider
res_lof <- detect_density(df_multi, k = 5, threshold = 1.5)
res_lof
#>    x  y lof_score
#> 21 5 20      3.79

3. Advanced Methods (Machine Learning)

For high-dimensional or complex datasets, statistical methods often fail. quickOutlier implements Isolation Forest.

# Generate a 2D blob of data
data_ml <- data.frame(
  feat1 = rnorm(100),
  feat2 = rnorm(100)
)
# Add an extreme outlier
data_ml[1, ] <- c(10, 10)

# Run Isolation Forest
# ntrees = 100 is standard. contamination = 0.05 means we expect ~5% outliers.
res_if <- detect_iforest(data_ml, ntrees = 100, contamination = 0.05)

# View the outlier score (0 to 1)
head(subset(res_if, Is_Outlier == TRUE))
#>        feat1       feat2  If_Score Is_Outlier
#> 1  10.000000 10.00000000 0.8781455       TRUE
#> 13 -1.018575  3.24103993 0.6560033       TRUE
#> 21 -2.309169  0.06529303 0.5828974       TRUE
#> 46  2.187333  0.60070882 0.5776194       TRUE
#> 80  1.444551  1.95529397 0.6017643       TRUE

4. Time Series Analysis

Detecting anomalies in time series requires removing Seasonality (repeating patterns) and Trend.

# Create a synthetic time series: Sine wave + Noise + Outlier
t <- seq(1, 10, length.out = 60)
y <- sin(t) + rnorm(60, sd = 0.1)
y[30] <- 5 # Spike (Outlier)

# Detect using STL Decomposition
res_ts <- detect_ts_outliers(y, frequency = 12)

# Check the detected outlier
subset(res_ts, Is_Outlier == TRUE)
#>     Original      Trend  Seasonal  Remainder Is_Outlier
#> 1  0.7852833  1.3139834 0.1714747 -0.7001748       TRUE
#> 30 5.0000000 -0.6050792 0.0619607  5.5431185       TRUE

5. Data Cleaning & Diagnostics

Categorical Outliers (Typos)

Find categories that appear too infrequently (potential typos).

cities <- c(rep("Madrid", 10), "Barcalona", "Barcelona", "MAdrid")
detect_categorical_outliers(cities, min_freq = 0.1)
#>    Category Count  Frequency Is_Outlier
#> 1 Barcalona     1 0.07692308       TRUE
#> 2 Barcelona     1 0.07692308       TRUE
#> 3    MAdrid     1 0.07692308       TRUE
#> 4    Madrid    10 0.76923077      FALSE

Regression Diagnostics (Cook’s Distance)

Find points that have a disproportionate influence on a linear model.

# Use mtcars and create a high leverage point
cars_df <- mtcars
cars_df[1, "wt"] <- 10; cars_df[1, "mpg"] <- 50

infl <- diagnose_influence(cars_df, "mpg", "wt")
head(subset(infl, Is_Influential == TRUE))
#>           mpg cyl disp  hp drat wt  qsec vs am gear carb Cooks_Dist
#> Mazda RX4  50   6  160 110  3.9 10 16.46  0  1    4    4   20.59087
#>           Is_Influential
#> Mazda RX4           TRUE

Treating Outliers (Winsorization)

Instead of deleting data, it is often better to “cap” extreme values to a certain threshold (Winsorization).

# Create data with an extreme value
df_treat <- data.frame(val = c(1, 2, 3, 2, 1, 100))

# Cap values at 1.5 * IQR
df_clean <- treat_outliers(df_treat, "val", method = "iqr", threshold = 1.5)
print(df_clean$val)
#> [1] 1 2 3 2 1 5