gcor package for Rgcor is an R package which provides tools for multivariate data analysis based on a generalized correlation measure.
It features a generalized version of (absolute) correlation
coefficient for arbitrary types of data, including both numerical and
categorical variables. Missing values can also be handled naturally by
treating them as observations of a categorical value
NA.
Note that this project is in an early stage of development, so changes may occur frequently.
You can install the development version of gcor from GitHub with pak:
# install.packages("pak")
pak::pak("r-suzuki/gcor-r")or with devtools:
# install.packages("devtools")
devtools::install_github("r-suzuki/gcor-r")library(gcor)Generalized correlation measure takes values in \([0,1]\), which can capture both linear and nonlinear relations.
When the joint distribution of \((X,Y)\) is bivariate normal, its theoretical value coincides with the absolute value of the correlation coefficient.
# Generalized correlation measure
gcor(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> Sepal.Length 1.0000000 0.2349075 0.8846517 0.8741873 0.7718469
#> Sepal.Width 0.2349075 1.0000000 0.3143301 0.2669031 0.6591442
#> Petal.Length 0.8846517 0.3143301 1.0000000 0.9503289 0.8323583
#> Petal.Width 0.8741873 0.2669031 0.9503289 1.0000000 0.8339534
#> Species 0.7718469 0.6591442 0.8323583 0.8339534 1.0000000The directed generalized correlation is another variation of the generalized correlation. It also takes values in \([0,1]\), reaching \(1\) when \(Y\) is completely dependent on \(X\) (i.e., when the conditional distribution \(f(Y \mid X)\) is a one-point distribution) and \(0\) when \(X\) and \(Y\) are independent.
# Dependency of Species on other variables
dgc <- dgcor(Species ~ ., data = iris)
dotchart(sort(dgc), main = "Dependency of Species")With \(r_g\) as the generalized correlation between \(X\) and \(Y\), we can define a dissimilarity measure:
\[ d(X,Y) = \sqrt{1 - r^2_g} \]
It can be applied to cluster analysis:
# Clustering
gd <- gdis(iris)
hc <- hclust(gd, method = "ward.D2")
plot(hc)Multidimensional scaling would serve as a good example of an application:
# Multidimensional scaling
mds <- cmdscale(gd, k = 2)
plot(mds, type = "n", xlab = "", ylab = "", asp = 1, axes = FALSE,
main = "cmdscale with gdis(iris)")
text(mds[,1], mds[,2], rownames(mds))