Getting Started with quickOutlier

quickOutlier is a comprehensive toolkit for detecting and treating anomalies in data. It goes beyond simple statistics, incorporating Machine Learning (Isolation Forest) and Time Series analysis.

1. Univariate Analysis (The Basics)

For simple numeric vectors, use detect_outliers. You can choose between Z-Score (parametric) or IQR (robust).

# Create data with an obvious outlier
set.seed(123)
df <- data.frame(val = c(rnorm(50), 100))

# Detect using Z-Score (Standard Deviation)
outliers <- detect_outliers(df, "val", method = "zscore", threshold = 3)
print(head(outliers))
#>    val z_score
#> 51 100    6.99

New: Educational Visualization

We can visualize the distribution, mean, and median with a single line of code. Detected outliers are highlighted in red.

plot_outliers(df, "val", method = "zscore")

Scanning the Dataset

If you want a quick overview of all numeric columns and their outlier count, use scan_data.

# Scan the entire dataframe
scan_data(mtcars, method = "iqr")
#>    Column Outlier_Count Percentage
#> 1     mpg             1       3.12
#> 2     cyl             0       0.00
#> 3    disp             0       0.00
#> 4      hp             1       3.12
#> 5    drat             0       0.00
#> 6      wt             3       9.38
#> 7    qsec             1       3.12
#> 8      vs             0       0.00
#> 9      am             0       0.00
#> 10   gear             0       0.00
#> 11   carb             1       3.12

2. Multivariate Analysis (Two or more variables)

Sometimes a value is normal individually but anomalous in combination with others (e.g., a person 1.50m tall weighing 100kg).

Mahalanobis Distance

Use this for detecting outliers based on correlation structures.

# Create correlated data and add an outlier
df_multi <- data.frame(x = 1:20, y = 1:20)
df_multi <- rbind(df_multi, data.frame(x = 5, y = 20)) # Anomalous point

res_multi <- detect_multivariate(df_multi, c("x", "y"))
tail(res_multi, 3)
#>    x  y mahalanobis_dist
#> 21 5 20            19.05

Interactive Plot (Plotly)

If you are viewing this as HTML, you can interact with the plot (zoom, hover).

# Lower confidence level to make it more sensitive for the demo
plot_interactive(df_multi, "x", "y", confidence_level = 0.99)

Density-based Detection (LOF)

For complex shapes where correlation isn’t enough, Local Outlier Factor (LOF) is powerful. It finds points that are isolated relative to their neighbors.

# Use the same multi-dimensional data
# k = number of neighbors to consider
res_lof <- detect_density(df_multi, k = 5, threshold = 1.5)
res_lof
#>    x  y lof_score
#> 21 5 20      3.79

3. Advanced Methods (Machine Learning)

For high-dimensional or complex datasets, statistical methods often fail. quickOutlier implements Isolation Forest.

# Generate a 2D blob of data
data_ml <- data.frame(
  feat1 = rnorm(100),
  feat2 = rnorm(100)
)
# Add an extreme outlier
data_ml[1, ] <- c(10, 10)

# Run Isolation Forest
# ntrees = 100 is standard. contamination = 0.05 means we expect ~5% outliers.
res_if <- detect_iforest(data_ml, ntrees = 100, contamination = 0.05)

# View the outlier score (0 to 1)
head(subset(res_if, Is_Outlier == TRUE))
#>        feat1       feat2  If_Score Is_Outlier
#> 1  10.000000 10.00000000 0.8781455       TRUE
#> 13 -1.018575  3.24103993 0.6560033       TRUE
#> 21 -2.309169  0.06529303 0.5828974       TRUE
#> 46  2.187333  0.60070882 0.5776194       TRUE
#> 80  1.444551  1.95529397 0.6017643       TRUE

4. Time Series Analysis

Detecting anomalies in time series requires removing Seasonality (repeating patterns) and Trend.

# Create a synthetic time series: Sine wave + Noise + Outlier
t <- seq(1, 10, length.out = 60)
y <- sin(t) + rnorm(60, sd = 0.1)
y[30] <- 5 # Spike (Outlier)

# Detect using STL Decomposition
res_ts <- detect_ts_outliers(y, frequency = 12)

# Check the detected outlier
subset(res_ts, Is_Outlier == TRUE)
#>     Original      Trend  Seasonal  Remainder Is_Outlier
#> 1  0.7852833  1.3139834 0.1714747 -0.7001748       TRUE
#> 30 5.0000000 -0.6050792 0.0619607  5.5431185       TRUE

5. Data Cleaning & Diagnostics

Categorical Outliers (Typos)

Find categories that appear too infrequently (potential typos).

cities <- c(rep("Madrid", 10), "Barcalona", "Barcelona", "MAdrid")
detect_categorical_outliers(cities, min_freq = 0.1)
#>    Category Count  Frequency Is_Outlier
#> 1 Barcalona     1 0.07692308       TRUE
#> 2 Barcelona     1 0.07692308       TRUE
#> 3    MAdrid     1 0.07692308       TRUE
#> 4    Madrid    10 0.76923077      FALSE

Regression Diagnostics (Cook’s Distance)

Find points that have a disproportionate influence on a linear model.

# Use mtcars and create a high leverage point
cars_df <- mtcars
cars_df[1, "wt"] <- 10; cars_df[1, "mpg"] <- 50

infl <- diagnose_influence(cars_df, "mpg", "wt")
head(subset(infl, Is_Influential == TRUE))
#>           mpg cyl disp  hp drat wt  qsec vs am gear carb Cooks_Dist
#> Mazda RX4  50   6  160 110  3.9 10 16.46  0  1    4    4   20.59087
#>           Is_Influential
#> Mazda RX4           TRUE

Treating Outliers (Winsorization)

Instead of deleting data, it is often better to “cap” extreme values to a certain threshold (Winsorization).

# Create data with an extreme value
df_treat <- data.frame(val = c(1, 2, 3, 2, 1, 100))

# Cap values at 1.5 * IQR
df_clean <- treat_outliers(df_treat, "val", method = "iqr", threshold = 1.5)
print(df_clean$val)
#> [1] 1 2 3 2 1 5