| Title: | Heterogeneous Methods for Shape and Other Multidimensional Data |
|---|---|
| Description: | Tools for geometric morphometric analyses and multidimensional data. Implements methods for morphological disparity analysis using bootstrap and rarefaction, as reviewed in Foote (1997) <doi:10.1146/annurev.ecolsys.28.1.129>. Includes integration and modularity testing, following Fruciano et al. (2013) <doi:10.1371/journal.pone.0069376>, using Escoufier's RV coefficient as test statistic as well as two-block partial least squares - PLS, Rohlf and Corti (2000) <doi:10.1080/106351500750049806>. Also includes vector angle comparisons, orthogonal projection for data correction (Burnaby (1966) <doi:10.2307/2528217>; Fruciano (2016) <doi:10.1007/s00427-016-0537-4>), and parallel analysis for dimensionality reduction (Buja and Eyuboglu (1992) <doi:10.1207/s15327906mbr2704_2>). |
| Authors: | Carmelo Fruciano [aut, cre] |
| Maintainer: | Carmelo Fruciano <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.6.1.1 |
| Built: | 2026-05-17 09:29:22 UTC |
| Source: | https://github.com/fruciano/geometricmorphometricsmix |
Permutation test of the adjusted Rand index, which quantifies the level of agreement between two partitions (e.g., two schemes of classification of the same individuals obtained with two methods)
adjRand_test(A, B, perm = 999)adjRand_test(A, B, perm = 999)
A, B
|
numerical or character vectors reflecting the assignment of individual observations to groups |
perm |
number of permutations |
The adjusted Rand index (Hubert and Arabie 1985), is an adjusted for chance version of the Rand index (Rand 1971). The adjusted Rand index has an expected value of zero in the case of random partitions, and values approaching one as the two partitions become more similar to each other (with one being perfect match of the classification). This function implements the permutation test proposed by Qannari et al. (2014) to obtain a p value against the null hypothesis of independence of the two partitions.
This function is useful in various contexts, such as in integrative taxonomy when comparing the classification of individual specimens obtained using different data (e.g., sequence data and morphometric data). For an example of the application of this technique with the classification obtained with genetic data and morphometric data for multiple traits, see Fruciano et al. 2016.
The function outputs a vector with the adjusted Rand index and the p value obtained from the permutation test
The function requires internally the package mclust.
If you use this function in the context of integrative taxonomy or similar (comparison of classification/unsupervised clustering with biological data), please cite all the papers in the references (otherwise, please use the relevant citations for the context).
Fruciano C, Franchini P, Raffini F, Fan S, Meyer A. 2016. Are sympatrically speciating Midas cichlid fish special? Patterns of morphological and genetic variation in the closely related species Archocentrus centrarchus. Ecology and Evolution 6:4102-4114.
Hubert L, Arabie P. 1985. Comparing partitions. Journal of Classification 2:193-218.
Qannari EM, Courcoux P, Faye P. 2014. Significance test of the adjusted Rand index. Application to the free sorting task. Food Quality and Preference 32, Part A:93-97.
Rand WM. 1971. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association 66:846-850.
library(mclust) set.seed(123) irisBIC = mclustBIC(iris[,-5]) mclustBIC_classification = summary(irisBIC,iris[,-5])$classification original_classification = iris[,5] # This is one of the examples in the package mclust # Here a classification algorithm is used on the iris dataset adjustedRandIndex(mclustBIC_classification, original_classification) # The mclust package allows computing the adjusted Rand index # which quantifies the agreement between the original (correct) # classification and the one obtained with the algorithm. # However, it is not clear whether the adjusted Rand index is # "large enough" compared to the null hypothesis of independence # between the two classification schemes adjRand_test(mclustBIC_classification, original_classification, perm = 999) # For that, we use the function adjRand_test, which performs # the permutation test of Qannari et al. 2014 (in this case # p<0.001, as 1000 permutations have been used). adjRand_test(original_classification, original_classification, perm = 999) # As it can be seen, in the ideal case of the exact same grouping, # the adjusted Rand index takes a value of 1 (which is obviously # significant)library(mclust) set.seed(123) irisBIC = mclustBIC(iris[,-5]) mclustBIC_classification = summary(irisBIC,iris[,-5])$classification original_classification = iris[,5] # This is one of the examples in the package mclust # Here a classification algorithm is used on the iris dataset adjustedRandIndex(mclustBIC_classification, original_classification) # The mclust package allows computing the adjusted Rand index # which quantifies the agreement between the original (correct) # classification and the one obtained with the algorithm. # However, it is not clear whether the adjusted Rand index is # "large enough" compared to the null hypothesis of independence # between the two classification schemes adjRand_test(mclustBIC_classification, original_classification, perm = 999) # For that, we use the function adjRand_test, which performs # the permutation test of Qannari et al. 2014 (in this case # p<0.001, as 1000 permutations have been used). adjRand_test(original_classification, original_classification, perm = 999) # As it can be seen, in the ideal case of the exact same grouping, # the adjusted Rand index takes a value of 1 (which is obviously # significant)
A list containing the body arching vector obtained using common principal components following the procedure described in Fruciano et al. 2020, and the GPA consensus used to align individual models when the vector was computed. The GPA consensus is provided to align new data using the same consensus), but using the consensus for downstream work is not recommended.
data(arching_vector)data(arching_vector)
A list with two components as described above.
The object is a list with the following elements:
A matrix of landmark coordinates for the GPA consensus to which the models have been aligned (brown trout dataset in this package)
A numeric vector of shape change obtained using common principal component analysis as detailed in Fruciano et al. 2020.
Fruciano C., Schmidt, I., Ramirez Sanchez, M.M., Morek, W., Avila Valle, Z.A., Talijancic, I., Pecoraro, C., Schermann Legionnet, A. 2020. Tissue preservation can affect geometric morphometric analyses: a case study using fish body shape. Zoological Journal of the Linnean Society 188:148-162.
ProjectOrthogonal for projection/removal of an effect
data(arching_vector) a = arching_vector names(a) dim(a$GPA_consensus_used) length(a$arching_vector_CPCA)data(arching_vector) a = arching_vector names(a) dim(a$GPA_consensus_used) length(a$arching_vector_CPCA)
A list containing landmark coordinates for a subset of brown trout specimens used in Fruciano et al. 2014 (Biological Journal of the Linnean Society) and Fruciano et al. 2020 (Zoological Journal of the Linnean Society). The subset originates from two rivers and was digitised by a single operator.
data(brown_trout)data(brown_trout)
A list with components described above.
Notice that the dataset has been realigned using Generalized
Procrustes Analysis (GPA) and corrected for body arching using the
arching vector in data(arching_vector).
Because this is a small subset of the fish in the original study
(Fruciano et al. 2014), and contain only data from a single
operator and treatment from Fruciano et al. 2020,
the dataset should be strictly considered as a "toy" dataset and
not used for formal statistical analyses. For conclusions about
biological variation and measurement error due to preservation,
please refer to the original studies (Fruciano et al. 2014,
Fruciano et al. 2020, respectively).
The object is a list with the following components:
Landmark coordinates for specimens without correction for body arching.
Landmark coordinates for the same specimens after correction for body arching.
A data.frame with metadata for each specimen (e.g., centroid size, sex and river).
Fruciano C, Pappalardo AM, Tigano C, Ferrito V. 2014. Phylogeographical relationships of Sicilian brown trout and the effects of genetic introgression on morphospace occupation. Biological Journal of the Linnean Society 112:387-398.
Fruciano C., Schmidt, I., Ramirez Sanchez, M.M., Morek, W., Avila Valle, Z.A., Talijancic, I., Pecoraro, C., Schermann Legionnet, A. 2020. Tissue preservation can affect geometric morphometric analyses: a case study using fish body shape. Zoological Journal of the Linnean Society 188:148-162.
data(brown_trout) d = brown_trout names(d) str(d$Factors)data(brown_trout) d = brown_trout names(d) str(d$Factors)
Performs the BTailTest, in the same spirit as implemented in the Matlab package MDA (Navarro 2003) and used in various empirical papers (e.g., Fruciano et al. 2014, 2016).
BTailTest(Reference, Test, boot = 1000)BTailTest(Reference, Test, boot = 1000)
Reference |
Matrix or data frame containing data for the reference group (observations in rows, variables in columns). |
Test |
Matrix or data frame containing data for the test group (observations in rows, variables in columns). |
boot |
number of bootstrap replicates |
This is a test of the difference in disparity between two groups: a reference and a test group. The function proceeds by computing a bootstrapped distribution of the test statistics (multivariate variance and mean pairwise Euclidean distances in this implementation) in the reference sample and then comparing the statistics observed in the test sample to this distribution to obtain p-values.
The function outputs a list with the following elements:
Estimates of both multivariate variance and mean pairwise Euclidean distance for each of the bootstrapped samples
p values obtained for the test
If you use this function, in addition to this package, please cite Navarro (2003)
Navarro N. 2003. MDA: a MATLAB-based program for morphospace-disparity analysis. Computers & Geosciences 29:655-664.
Fruciano C, Franchini P, Raffini F, Fan S, Meyer A. 2016. Are sympatrically speciating Midas cichlid fish special? Patterns of morphological and genetic variation in the closely related species Archocentrus centrarchus. Ecology and Evolution 6:4102-4114.
Fruciano C, Pappalardo AM, Tigano C, Ferrito V. 2014. Phylogeographical relationships of Sicilian brown trout and the effects of genetic introgression on morphospace occupation. Biological Journal of the Linnean Society 112:387-398.
disparity_resample,
disparity_test
This function computes the angle below which two vectors would be "significantly similar" using Li 2011 test.
critical_angle(dimensions, alpha = 0.05, prec = "normal")critical_angle(dimensions, alpha = 0.05, prec = "normal")
dimensions |
number of dimensions to use |
alpha |
significance level (by default 0.05) |
prec |
whether one has to use the internal R functions ("normal", default), or mpfr precision ("mpfr") |
This function considers the formulas in Li 2011 which have been used to perform a test of the angle between multivariate vectors. This is the test implemented in MorphoJ (Klingenberg 2011), in the R package Morpho (Schlager 2016), and in the function TestOfAngle of this package. The test produces a p value for the angle between two vectors (with significant values being "significantly similar").
This function reverts the logic of the formula/test and, given a number of dimensions (e.g., morphometric variables) and a level of significance (by default 0.05), it computes the angle below which the test would be significant.
The function outputs the critical angle (in degrees)
In case of very many dimensions, there are numerical problems in performing the test. If prec="normal" the function uses internally the package Morpho (which should be installed) and these problems will start occurring above about 340 variables. If prec="mpfr" the function uses internally a custom function which requires the package Rmpfr (which should be installed). This allows the computation of extremely large or small numbers, it is currently set to a 120 bit precision and allows the computation using up to about 1200 variables (over this number, there will be problems even using the option "mpfr")
Klingenberg CP. 2011. MorphoJ: an integrated software package for geometric morphometrics. Mol Ecol Resour 11:353-357.
Li S. 2011. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics and Statistics 4:66-70.
Schlager S. 2016. Morpho: Calculations and Visualisations Related to Geometric Morphometrics.
critical_angle(50) # This is the angle below which two vectors (e.g., PCA axes) # of 50 variables would be would be "significantly similar"critical_angle(50) # This is the angle below which two vectors (e.g., PCA axes) # of 50 variables would be would be "significantly similar"
Provides a unified interface to obtain resampled (bootstrap or rarefied) estimates of several disparity / morphospace occupation statistics.
disparity_resample( Data, group = NULL, n_resamples = 1000, statistic = "multivariate_variance", CI = 0.95, bootstrap_rarefaction = "bootstrap", sample_size = NULL )disparity_resample( Data, group = NULL, n_resamples = 1000, statistic = "multivariate_variance", CI = 0.95, bootstrap_rarefaction = "bootstrap", sample_size = NULL )
Data |
A data frame, matrix, vector, or 3D array. Observations (specimens) must be in rows (if a 3D array is supplied, the third dimension is assumed to index specimens). |
group |
A factor or a vector indicating group membership (will be coerced to factor). If 'NULL' (default) a single analysis is performed on the full dataset. |
n_resamples |
Number of resampling replicates (default 1000). |
statistic |
Character string identifying the statistic to compute. One of '"multivariate_variance"', '"mean_pairwise_euclidean_distance"', '"convex_hull_volume"', '"claramunt_proper_variance"'. Default is '"multivariate_variance"'. Ignored for univariate data (vector input). |
CI |
Desired two-sided confidence interval level (default 0.95) used for percentile confidence intervals. Use CI=1 to display the full range (minimum to maximum) of resampled values. |
bootstrap_rarefaction |
Either '"bootstrap"' (default) for resampling with replacement or '"rarefaction"' for resampling without replacement. |
sample_size |
Either 'NULL' (default), a positive integer indicating the number of rows to sample, or the character '"smallest"' to use the size of the smallest group (all groups resampled to that size). For 'bootstrap_rarefaction=="rarefaction"', sampling is without replacement and this parameter is required (not 'NULL'). For 'bootstrap_rarefaction=="bootstrap"', sampling is with replacement; if 'NULL', uses original group sizes, otherwise uses the specified sample size. If '"smallest"' is supplied but no groups are defined, an error is returned. |
The function allows choosing among the following multivariate statistics:
Multivariate variance (trace of the covariance matrix; sum of variances)
Mean pairwise Euclidean distance
Convex hull volume (n-dimensional)
Claramunt proper variance (Claramunt 2010) based on a linear shrinkage covariance matrix
If the input 'Data' is univariate (i.e., a vector), the analysis defaults to computing the (univariate) variance within each group (the attribute 'statistic' is ignored).
If 'bootstrap_rarefaction=="bootstrap"', the function performs resampling with replacement (i.e., classical bootstrap) by sampling rows of the data matrix / data frame. Optionally, the user can specify a custom sample size via the 'sample_size' argument. This allows comparison of bootstrap confidence intervals at the same sample size (essentially, this is rarefaction sampling with replacement), which can be useful to compare bootstrapped confidence intervals across different groups for statistics which are sensitive to sample size (at the expense of broader than necessary confidence intervals for groups that are larger).
If 'bootstrap_rarefaction=="rarefaction"', the function performs resampling without replacement at the sample size indicated in 'sample_size' (numeric) or, if 'sample_size=="smallest"', at the size of the smallest group (all groups are resampled to that size). Rarefaction requires specifying a valute to the attribute 'sample_size'; an error is returned otherwise.
This function uses the future framework for parallel processing,
requiring packages future and future.apply.
Users should set up their preferred parallelization strategy using
future::plan() before calling this function.
For example:
future::plan(future::sequential) for sequential
processing
future::plan(future::multisession, workers = 4) for
parallel processing with 4 workers (works on all platforms
including Windows)
future::plan(future::multicore, workers = 4) for
forked processes (Unix-like systems)
future::plan(future::cluster, workers =
c("host1", "host2")) for cluster computing
If no plan is set or the future packages are not available, the function will use sequential processing.
A list containing:
Character vector of length 1 with the human-readable name of the statistic used.
A data frame with columns 'group', 'average', 'CI_min', 'CI_max'. One row per group. When CI=1, 'CI_min' and 'CI_max' represent the minimum and maximum of resampled values rather than confidence intervals.
If a single group: numeric vector of length 'n_resamples' with the resampled values. If multiple groups: a named list with one numeric vector (length 'n_resamples') per group.
The CI level used (between 0 and 1). When CI=1, ranges are computed instead of confidence intervals.
The returned object has class "disparity_resample" and comes with associated S3 methods for convenient display and visualization:
print.disparity_resample: Prints a formatted
summary of results including confidence interval overlap
assessment for multiple groups
plot.disparity_resample: Creates a confidence
interval plot using ggplot2
This function automatically uses parallel processing via the future
framework, when the packages future and future.apply are installed.
This is particularly useful for large datasets, large number of
resamples or computationally intensive statistics (e.g., convex
hull volume).
The parallelization strategy is determined by the user's choice of
future plan, providing flexibility across different computing
environments (local multicore, cluster, etc.).
The function performs parallelization at the level of individual
bootstrap/rarefaction replicates within each group. The future plan
should be set up by the user before calling this function using
future::plan() (see examples). If no plan is set or the
future packages are not available, the function will use sequential
processing.
For bootstrap resampling, the average estimate reported is the mean of the bootstrap resampled values (for consistency across all bootstrap scenarios). For rarefaction, because the purpose is to make groups comparable at a common (smaller) sample size, the average estimate reported is the mean of the rarefied resampled values for that group (i.e., the mean across all rarefaction replicates).
'Data' can be a data frame, a matrix, a vector, or a 3D array of landmark coordinates (e.g., p landmarks x k dimensions x n specimens). In the latter case, the array is converted internally to a 2D matrix with specimens in rows and (landmark * dimension) variables in columns.
Because of how the computation works, convex hull volume computation requires the number of observations (specimens) to be (substantially) greater than the number of variables (dimensions). In case of shape or similar, consider using the scores along the first (few/several) principal components. Sometimes errors are thrown due to near-zero components, in this case try reducing the number of principal components used. Examples of use of this statistic with geometric morphometric data include Drake & Klingenberg 2010 (American Naturalist), Fruciano et al. 2012 (Environmental Biology of Fishes) and Fruciano et al. 2014 (Biological Journal of the Linnean Society). Because of the sensitivity of this statistic to outliers, usually rarefaction is preferred to bootstrapping.
"Multivariate variance" is also called "total variance", "Procrustes variance" (in geometric morphometrics) and "sum of univariate variances". Note how the computation here does not divide variance by sample size (other than the normal division performed in the computation of variances).
Drake AG, Klingenberg CP. 2010. Large-scale diversification of skull shape in domestic dogs: disparity and modularity. American Naturalist 175:289-301.
Claramunt S. 2010. Discovering exceptional diversifications at continental scales: the case of the endemic families of Neotropical Suboscine passerines. Evolution 64:2004-2019.
Fruciano C, Tigano C, Ferrito V. 2012. Body shape variation and colour change during growth in a protogynous fish. Environmental Biology of Fishes 94:615-622.
Fruciano C, Pappalardo AM, Tigano C, Ferrito V. 2014. Phylogeographical relationships of Sicilian brown trout and the effects of genetic introgression on morphospace occupation. Biological Journal of the Linnean Society 112:387-398.
Fruciano C, Franchini P, Raffini F, Fan S, Meyer A. 2016. Are sympatrically speciating Midas cichlid fish special? Patterns of morphological and genetic variation in the closely related species Archocentrus centrarchus. Ecology and Evolution 6:4102-4114.
disparity_test,
print.disparity_resample,
plot.disparity_resample
set.seed(123) # Simulate two groups with different means but same covariance if (requireNamespace("MASS", quietly = TRUE)) { X1 = MASS::mvrnorm(20, mu=rep(0, 10), Sigma=diag(10)) X2 = MASS::mvrnorm(35, mu=rep(2, 10), Sigma=diag(10)) Data = rbind(X1, X2) grp = factor(c(rep("A", nrow(X1)), rep("B", nrow(X2)))) # Sequential processing # future::plan(future::sequential) # Default sequential processing # Parallel processing (uncomment to use) # future::plan(future::multisession, workers = 2) # Use 2 workers # Bootstrap multivariate variance boot_res = disparity_resample( Data, group=grp, n_resamples=200, statistic="multivariate_variance", bootstrap_rarefaction="bootstrap" ) # Direct access to results table boot_res$results # Using the print method for formatted output print(boot_res) # Using the plot method to visualize results # plot(boot_res) # Uncomment to create confidence interval plot # Rarefaction (to the smallest group size) of mean pairwise # Euclidean distance rar_res = disparity_resample( Data, group=grp, n_resamples=200, statistic="mean_pairwise_euclidean_distance", bootstrap_rarefaction="rarefaction", sample_size="smallest" ) # Now simulate a third group with larger variance X3 = MASS::mvrnorm(15, mu=rep(0, 10), Sigma=diag(10)*1.5) grp2 = factor( c(rep("A", nrow(X1)), rep("B", nrow(X2)), rep("C", nrow(X3))) ) boot_res2 = disparity_resample( Data=rbind(X1, X2, X3), group=grp2, n_resamples=1000, statistic="multivariate_variance", bootstrap_rarefaction="bootstrap" ) print(boot_res2) # plot(boot_res2) # Plot of the obtained (95%) confidence intervals (uncomment to plot) # Reset to sequential processing when done (optional) # future::plan(future::sequential) }set.seed(123) # Simulate two groups with different means but same covariance if (requireNamespace("MASS", quietly = TRUE)) { X1 = MASS::mvrnorm(20, mu=rep(0, 10), Sigma=diag(10)) X2 = MASS::mvrnorm(35, mu=rep(2, 10), Sigma=diag(10)) Data = rbind(X1, X2) grp = factor(c(rep("A", nrow(X1)), rep("B", nrow(X2)))) # Sequential processing # future::plan(future::sequential) # Default sequential processing # Parallel processing (uncomment to use) # future::plan(future::multisession, workers = 2) # Use 2 workers # Bootstrap multivariate variance boot_res = disparity_resample( Data, group=grp, n_resamples=200, statistic="multivariate_variance", bootstrap_rarefaction="bootstrap" ) # Direct access to results table boot_res$results # Using the print method for formatted output print(boot_res) # Using the plot method to visualize results # plot(boot_res) # Uncomment to create confidence interval plot # Rarefaction (to the smallest group size) of mean pairwise # Euclidean distance rar_res = disparity_resample( Data, group=grp, n_resamples=200, statistic="mean_pairwise_euclidean_distance", bootstrap_rarefaction="rarefaction", sample_size="smallest" ) # Now simulate a third group with larger variance X3 = MASS::mvrnorm(15, mu=rep(0, 10), Sigma=diag(10)*1.5) grp2 = factor( c(rep("A", nrow(X1)), rep("B", nrow(X2)), rep("C", nrow(X3))) ) boot_res2 = disparity_resample( Data=rbind(X1, X2, X3), group=grp2, n_resamples=1000, statistic="multivariate_variance", bootstrap_rarefaction="bootstrap" ) print(boot_res2) # plot(boot_res2) # Plot of the obtained (95%) confidence intervals (uncomment to plot) # Reset to sequential processing when done (optional) # future::plan(future::sequential) }
Performs a permutation test of difference in disparity between two groups.
disparity_test(X1, X2, perm = 999)disparity_test(X1, X2, perm = 999)
X1, X2
|
Matrices or data frames containing data for each group (observations in rows, variables in columns). |
perm |
number of permutations |
The function employs commonly used test statistics to quantify disparity/morphospace occupation/variation in each group. The two statistics currently implemented are multivariate variance (also known as sum of variances, trace of the covariance matrix, Procrustes variance), and mean pairwise Euclidean distances. These two metrics have a long history in the quantification of disparity both in geometric morphometrics (e.g., Zelditch et al. 2004; Fruciano et al., 2014, 2016) and more in general in evolution (e.g., Foote, 1996; Willis 2001) The observed statistics are then compared to their empirical distributions obtained through permutations, to obtain a p-value.
The function outputs a dataframe containing: the observed values of the tests statistics for each group, their absolute differences, and The p values obtained through the permutational procedure
The values of the test statistics in the output are the observed
in the sample.
If they are of interest, and the two groups have different sample
size, consider computing their rarefied versions (for instance with
the function disparity_resample)
for reporting in papers and the like.
Foote M. 1997. The evolution of morphological diversity. Annual Review of Ecology and Systematics 28:129.
Wills MA. 2001. Morphological disparity: a primer. In. Fossils, phylogeny, and form: Springer. p. 55-144.
Zelditch ML, Swiderski DL, Sheets HD. 2004. Geometric morphometrics for biologists: a primer: Academic Press.
Fruciano C, Franchini P, Raffini F, Fan S, Meyer A. 2016. Are sympatrically speciating Midas cichlid fish special? Patterns of morphological and genetic variation in the closely related species Archocentrus centrarchus. Ecology and Evolution 6:4102-4114.
Fruciano C, Pappalardo AM, Tigano C, Ferrito V. 2014. Phylogeographical relationships of Sicilian brown trout and the effects of genetic introgression on morphospace occupation. Biological Journal of the Linnean Society 112:387-398.
library(MASS) set.seed(123) X1=mvrnorm(20, mu=rep(0, 40), Sigma=diag(40)) X2=mvrnorm(100, mu=rep(5, 40), Sigma=diag(40)) # create two groups of random observations # with different means and sample sizes, # but the same covariance matrix # We expect that the two groups will have the same # variance (disparity/morphospace occupation) # and therefore the test will be non-significant disparity_test(X1, X2, perm=999) # This is, indeed, the caselibrary(MASS) set.seed(123) X1=mvrnorm(20, mu=rep(0, 40), Sigma=diag(40)) X2=mvrnorm(100, mu=rep(5, 40), Sigma=diag(40)) # create two groups of random observations # with different means and sample sizes, # but the same covariance matrix # We expect that the two groups will have the same # variance (disparity/morphospace occupation) # and therefore the test will be non-significant disparity_test(X1, X2, perm=999) # This is, indeed, the case
Computes bootstrapped estimates of the mean distance between two groups and their confidence intervals.
dist_mean_boot(A, B, boot = 1000, ci = 0.95, nA = nrow(A), nB = nrow(B))dist_mean_boot(A, B, boot = 1000, ci = 0.95, nA = nrow(A), nB = nrow(B))
A, B
|
Matrices or data frames containing data (observations in rows, variables in columns). |
boot |
number of bootstrap resamples |
ci |
width of the confidence interval |
nA, nB
|
sample sizes for each bootstrapped group (defaults to original sample size) |
This may be useful to compare whether the differences between two groups are larger or smaller than differences between two other groups.
For instance, if we wanted to quantify shape sexual dimorphism in two populations, we could run this analysis separately for the two populations and then check the confidence intervals. If the two confidence intervals are disjunct there is evidence for the two populations having different levels of sexual dimorphism.
The computation performs bootstrap by resampling with replacement within each of the two groups and at each round computing the Euclidean distance between the two groups. It is also possible to resample at a different sample size than the one in the data using the attributes nA and nB.
Notice that the confidence interval is expressed on a scale between 0 and 1 and not as a percentage (e.g., 0.95 means 95
The function outputs a named vector with the mean, median, upper and lower confidence interval bounds obtained from the bootstrapped samples
Computes the Escoufier RV coefficient
EscoufierRV(Block1, Block2)EscoufierRV(Block1, Block2)
Block1, Block2
|
Matrices or data frames containing each block of variables (observations in rows, variables in columns). |
This function computes the usual version of the Escoufier RV coefficient (Escoufier, 1973), which quantifies the level of association between two multivariate blocks of variables. The function accepts two blocks of variables, either two data frames or two matrices each of n observations (specimens) as rows. The two blocks must have the same number of rows (specimens), but can have different number of columns (variables, such as landmark coordinates). The Escoufier RV has been shown (Fruciano et al. 2013) to be affected by sample size so comparisons of groups (e.g., species, populations) with different sample size should be avoided, unless steps are taken to account for this problem
The function returns a number, corresponding to the Escoufier RV coefficient
Escoufier Y. 1973. Le Traitement des Variables Vectorielles. Biometrics 29:751-760.
Fruciano C, Franchini P, Meyer A. 2013. Resampling-Based Approaches to Study Variation in Morphological Modularity. PLoS ONE 8:e69376.
library(MASS) set.seed(123) A=mvrnorm(100,mu=rep(0,100), Sigma=diag(100)) # Create a sample of 100 'individuals' # as multivariate normal random data # We will consider the first 20 columns as the first # block of variables, and the following one as the second block EscoufierRV(A[,1:20],A[,21:ncol(A)]) # Compute the EscoufierRV using the two blocks of variableslibrary(MASS) set.seed(123) A=mvrnorm(100,mu=rep(0,100), Sigma=diag(100)) # Create a sample of 100 'individuals' # as multivariate normal random data # We will consider the first 20 columns as the first # block of variables, and the following one as the second block EscoufierRV(A[,1:20],A[,21:ncol(A)]) # Compute the EscoufierRV using the two blocks of variables
The function compares the relative position of a set of landmarks to the one observed in a reference specimen. It also outputs which coordinates do not much the pattern observed in the reference specimen.
LM_relativepos_check( Dataset, LM_to_check, reference_specimen = 1, only_by_landmark_order = FALSE, use_specimen_names = TRUE )LM_relativepos_check( Dataset, LM_to_check, reference_specimen = 1, only_by_landmark_order = FALSE, use_specimen_names = TRUE )
Dataset |
k x p x n array of k landmarks and p dimensions for n observations (specimens). |
LM_to_check |
Vector with the indices of which landmarks of Dataset should be tested |
reference_specimen |
The index of the specimen to use as reference (by default, the first). |
only_by_landmark_order |
If TRUE, each landmark in LM_to_check will be tested relative to the next in LM_to_check; otherwise all the pairwise relative positions between landmarks will be run |
use_specimen_names |
whether the names of specimens in Dataset should be used for the output |
Compare relative positions of landmarks to the one observed in a reference specimen. This function is useful to identify specimens with switched landmark positions and similar problems when a dataset has been collected using consistent criteria (e.g., a set of fish body pictures, with the mouth-tail axis approximately horizontal for all specimens).
For instance, if we want to check that landmarks 1, 2, and 3 are all aligned along the y coordinate in 2D, we will use a specimen (usually the first), checking that it has been landmarked correctly. Then, we will run the function on landmarks 1, 2, and 3 (by setting LM_to_check=c(1, 2, 3)). The function will output a list of which specimens seem to be problematic and which landmarks and coordinates for these specimens seem problematic. The putatively problematic coordinates will be indicated with "0". Clearly, if we are only interested in 1, 2, and 3 being along y, it will not matter if in some case we will find "0" under Coord1, but only if we find a "0" under Coord2.
The function outputs a list with the following elements:
Indices of potentially problematic specimens
List of potentially problematic landmarks and coordinates for the specimens in potentially_problematic_idx
Generally, it is better to use this function with a small number of carefully selected landmarks, instead of running it on all landmarks at the same time.
The parameter only_by_landmark_order allows to set whether all combinations of landmarks should be tested (default), or not.
If only_by_landmark_order is set to TRUE, each landmark in LM_to_check will be only tested with the next one. For instance if LM_to_check=c(1,2,3) the function will only compare the first landmark with the second, and the second with the third.
If the function cannot find potentially problematic specimens, a message to this effect will be presented.
Parallel analysis based on permutations
parallel_analysis(X, perm = 999, fun = c("prcomp", "fastSVD", "shrink"))parallel_analysis(X, perm = 999, fun = c("prcomp", "fastSVD", "shrink"))
X |
Matrix or data frame containing the original data (observations in rows, variables in columns). |
perm |
number of permutations |
fun |
function to use internally to obtain eigenvalues (see Details) |
The function allows performing parallel analysis, which is a way to test for the number of significant eigenvalues/axes in a PCA. In this implementation, a null distribution of eigenvalues is obtained by randomly permuting observations independently for each of the starting variables. To compute p values, the observed eigenvalues are compared to the corresponding eigenvalues from this null distribution.
Parallel analysis may be used for dimensionality reduction, retaining only the first block of consecutive significant axes. That is, if for example the first 3 axes were significant, then the fourth not significant, one would keep only the first 3 axes (regardless of significance of the axes from the fifth on). Similarly, if the first axis is not significant, this may suggest lack of a clear structure in the data.
The function internally employs three possible strategies to obtain eigenvalues (argument of fun):
"prcomp" - the function prcomp (default)
"fastSVD" - an approach based on the function fast.svd (requires the package corpcor)
"shrink" - a decomposition of the covariance matrix estimated using linear shrinkage (much slower, requires the package nlshrink; Ledoit & Wolf 2004)
This choice should not make much difference in terms of the final result. However, for consistency, it is a good idea to use for parallel analysis the same function used for the actual PCA (this is why these three options are provided).
Vector of class "parallel_analysis" containing the p values for each of the axes of a PCA on the data provided
The object of class parallel_analysis returned by the function has a summary() method associated to it. This means that using summary() on an object created by this function, a suggestion on the number of significant axes (if any) is provided (see examples).
The most appropriate citation for this approach to parallel analysis (using permutations) is Buja & Eyuboglu (1992).
Ledoit O, Wolf M. 2004. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88:365-411.
Buja A, Eyuboglu N. 1992. Remarks on parallel analysis. Multivariate Behavioral Research 27(4):509-540.
set.seed(666) X=MASS::mvrnorm(100, mu=rep(0, 50), Sigma=diag(50)) # Simulate a multivariate random normal dataset # with 100 observations and 50 indipendent variables PA=parallel_analysis(X, perm = 999, fun = "fastSVD") # Perform parallel analysis summary(PA) # Look at a summary of the results from parallel analysis # Notice that no axis is significant # This is correct, as we had simulated data with no structure print(PA) # Look at the p values for each axisset.seed(666) X=MASS::mvrnorm(100, mu=rep(0, 50), Sigma=diag(50)) # Simulate a multivariate random normal dataset # with 100 observations and 50 indipendent variables PA=parallel_analysis(X, perm = 999, fun = "fastSVD") # Perform parallel analysis summary(PA) # Look at a summary of the results from parallel analysis # Notice that no axis is significant # This is correct, as we had simulated data with no structure print(PA) # Look at the p values for each axis
Creates a confidence interval plot for disparity resample results
## S3 method for class 'disparity_resample' plot( x, point_color = "darkblue", errorbar_color = "darkred", title = NULL, ... )## S3 method for class 'disparity_resample' plot( x, point_color = "darkblue", errorbar_color = "darkred", title = NULL, ... )
x |
An object of class "disparity_resample" |
point_color |
A single color or a vector of colors for point estimates. If length 1, the same color is used for all points. If length equals the number of groups, colors are assigned per group. (default "darkblue") |
errorbar_color |
A single color or a vector of colors for error bars. Follows the same recycling rules as 'point_color'. (default "darkred") |
title |
Optional character string for plot title (default NULL) |
... |
Additional arguments (currently unused) |
A ggplot object
Creates a confidence interval plot for RV rarefaction results
## S3 method for class 'EscoufierRVrarefy' plot(x, point_color = "darkblue", errorbar_color = "darkred", ...)## S3 method for class 'EscoufierRVrarefy' plot(x, point_color = "darkblue", errorbar_color = "darkred", ...)
x |
An object of class "EscoufierRVrarefy" |
point_color |
A single color or a vector of colors for point estimates. If length 1, the same color is used for all points. If length equals the number of groups, colors are assigned per group. (default "darkblue") |
errorbar_color |
A single color or a vector of colors for error bars. Follows the same recycling rules as 'point_color'. (default "darkred") |
... |
Additional arguments passed to the underlying plotting function |
A ggplot object
Performs a two-block PLS analysis, optionally allowing for tests of significance using permutations
pls(X, Y, perm = 999, global_RV_test = TRUE)pls(X, Y, perm = 999, global_RV_test = TRUE)
X, Y
|
Matrices or data frames containing each block of variables (observations in rows, variables in columns). |
perm |
number of permutations to use for hypothesis testing |
global_RV_test |
logical - whether global significance of the association should be tested |
This function performs a PLS analysis (sensu Rohlf & Corti 2000). Given two blocks of variables (shape or other variables) scored on the same observations (specimens), this analysis finds a series of pairs of axis accounting for maximal covariance between the two blocks. If tests of significance with permutations are selected, three different significance tests are performed:
Tested using Escoufier RV
Same test described in Rohlf & Corti 2000
For each pair of PLS (singular) axes, tests the correlation between scores of the first and second block
The object of class pls_fit returned by the function has print() and summary() methods associated to it. This means that using these generic functions on an object created by this function (see examples), it is possible to obtain information on the results. In particular, print() returns a more basic set of results on the global association, whereas summary() returns (only if permutation tests are used) results for each pair of singular axes.
Optionally, analysis using multiple cores to speedup permutation testin can be performed using the future framework (requires package future.apply and setting a plan with multiple workers)
The function outputs an object of class "pls_fit" and "list" with the following elements:
Scores along each singular (PLS) axis for the first block of variables (X)
Scores along each singular (PLS) axis for the second block of variables (Y)
Left singular axes
Right singular axes
Singular values
Percented squared covariance accounted by each pair of axes
(only if perm>0 and global_RV_test is TRUE) Observed value of Escoufier RV coefficient and p value obtained from the permutation test
(only if perm>0) For each pair of singular (PLS) axis, the singular value, the correlation between scores, their significance level based on permutation, and the proportion of squared covariance accounted are reported
Data used in the analysis
Values used to center data in the X block
Values used to center data in the Y block
The function does NOT perform GPA when applied to separate configurations of points.
When using the Escoufier RV, notice that the value reported is the observed value without rarefaction. For a description of the problem, please see Fruciano et al 2013. To obtain rarefied estimates of Escoufier RV and their confidence interval, use the function RVrarefied.
In the permutation test, rows of Y are permuted, so using the block with fewer variables as Y may speed up computations and substantially reduce memory usage.
When using the print() and summary() on the pls_fit objects obtained with this function, some of the values are rounded for ease of interpretation. The non-rounded values can be obtained accessing individual elements of the object (see examples).
If you use this function to perform the PLS analysis and test for significance, cite Rohlf & Corti 2000 (or earlier references outside of geometric morphometrics). If you report the test of significance based on the Escoufier RV coefficient, also cite Escoufier 1975. If you also use the major axis approach to obtain estimates of the shape (or other variable) predicted by each pair of axes, please cite Fruciano et al. 2020
Escoufier Y. 1973. Le Traitement des Variables Vectorielles. Biometrics 29:751-760.
Fruciano C, Colangelo P, Castiglia R, Franchini P. 2020. Does divergence from normal patterns of integration increase as chromosomal fusions increase in number? A test on a house mouse hybrid zone. Current Zoology 66:527–538.
Rohlf FJ, Corti M. 2000. Use of Two-Block Partial Least-Squares to Study Covariation in Shape. Systematic Biology 49:740-753.
############################## ### Example 1: random data ### ############################## library(MASS) set.seed(123) A=as.data.frame(mvrnorm(100,mu=rep(0,20), Sigma=diag(20))) B=as.data.frame(mvrnorm(100,mu=rep(0,10), Sigma=diag(10))) # Create two blocks of, respectively, 20 and 10 variables # for 100 observations. # This simulates two different blocks of data (shape or otherwise) # measured on the same individuals. # Note that, as we are simulating them independently, # we don't expect substantial covariation between blocks PLS_AB=pls(A, B, perm=99) # Perform PLS analysis and use 99 permutations for testing # (notice that in a real analysis, normally one uses more permutations) print(PLS_AB) # As expected, we do not find significant covariation between the # two blocks summary(PLS_AB) # The same happens when we look at the results for each of the axes # Notice that both for print() and summary() some values are # rounded for ease of visualization # However, the correct values can be always obtained from the # object created by the function # e.g., PLS_AB$singular_axis_significance ###################################### ### Example 2: using the classical ### ### iris data set as a toy example ### ###################################### data(iris) # Import the iris dataset set.seed(123) versicolor_data=iris[iris$Species=="versicolor",] # Select only the specimens belonging to the species Iris versicolor versicolor_sepal=versicolor_data[,grep( "Sepal", colnames(versicolor_data) )] versicolor_petal=versicolor_data[,grep( "Petal", colnames(versicolor_data) )] # Separate sepal and petal data PLS_sepal_petal=pls(versicolor_sepal, versicolor_petal, perm=99) # Perform PLS with permutation test # (again, chosen few permutations) print(PLS_sepal_petal) summary(PLS_sepal_petal) # Global results and results for each axis # (suggesting significant association)############################## ### Example 1: random data ### ############################## library(MASS) set.seed(123) A=as.data.frame(mvrnorm(100,mu=rep(0,20), Sigma=diag(20))) B=as.data.frame(mvrnorm(100,mu=rep(0,10), Sigma=diag(10))) # Create two blocks of, respectively, 20 and 10 variables # for 100 observations. # This simulates two different blocks of data (shape or otherwise) # measured on the same individuals. # Note that, as we are simulating them independently, # we don't expect substantial covariation between blocks PLS_AB=pls(A, B, perm=99) # Perform PLS analysis and use 99 permutations for testing # (notice that in a real analysis, normally one uses more permutations) print(PLS_AB) # As expected, we do not find significant covariation between the # two blocks summary(PLS_AB) # The same happens when we look at the results for each of the axes # Notice that both for print() and summary() some values are # rounded for ease of visualization # However, the correct values can be always obtained from the # object created by the function # e.g., PLS_AB$singular_axis_significance ###################################### ### Example 2: using the classical ### ### iris data set as a toy example ### ###################################### data(iris) # Import the iris dataset set.seed(123) versicolor_data=iris[iris$Species=="versicolor",] # Select only the specimens belonging to the species Iris versicolor versicolor_sepal=versicolor_data[,grep( "Sepal", colnames(versicolor_data) )] versicolor_petal=versicolor_data[,grep( "Petal", colnames(versicolor_data) )] # Separate sepal and petal data PLS_sepal_petal=pls(versicolor_sepal, versicolor_petal, perm=99) # Perform PLS with permutation test # (again, chosen few permutations) print(PLS_sepal_petal) summary(PLS_sepal_petal) # Global results and results for each axis # (suggesting significant association)
Project data on the major axis of PLS scores and obtain associated predictions
pls_major_axis( pls_object, new_data_x = NULL, new_data_y = NULL, axes_to_use = 1, scale_PLS = TRUE )pls_major_axis( pls_object, new_data_x = NULL, new_data_y = NULL, axes_to_use = 1, scale_PLS = TRUE )
pls_object |
object of class "pls_fit" obtained from the function pls |
new_data_x, new_data_y
|
(optional) matrices or data frames containing new data |
axes_to_use |
number of pairs of PLS axes to use in the computation (by default, this is performed only on the first axis) |
scale_PLS |
logical indicating whether PLS scores for different blocks should be scaled prior to computing the major axis |
This function acts on a pls_fit object obtained from the function pls. More in detail, the function:
Projects the original data onto the major axis for each pair of PLS axes (obtaining for each observation of the original data a score along this axis).
For each observation (specimen) of the original data, obtains the shape predicted by its score along the major axis.
(Optionally) if new data is provided, these data are first projected in the original PLS space and then the two operations above are performed on the new data.
A more in-depth explanation with a figure which allows for a more intuitive understanding is provided in Fruciano et al 2020 The idea is to obtain individual-level estimates of the shape predicted by a PLS model. This can be useful, for instance, to quantify to which extent the shape of a given individual from one group resembles the shape that individual would have according to the model computed in another group. This can be done by obtaining predictions with this function and then computing the distance between the actual shape observed for each individual and its prediction obtained from this function. This is, indeed, how this approach has been used in Fruciano et al 2020.
The function returns a list with two or three main elements which are themselves lists. The most useful elements for the final user are highlighted in boldface.
original_major_axis_projection is a list containing as many elements as specified in axes_to_use (default 1). Each of this elements contains the details of the computation of the major axis (as a PCA of PLS scores for a pair of axes), and in particular:
Eigenvector
Mean scores for that axis pair used in the computation
Scaling factor used
Scores of the original data on the major axis
original_major_axis_predictions_reversed contains the predictions of the PLS model for the original data, back-transformed to the original space (i.e., if the original data was shape, this will be shape). If axes_to_use > 1, these predictions will be based on the major axis computed for all pairs of axes considered. This element has two sub-elements:
Prediction for block 1
Prediction for block 2
new_data_results is only returned when new data is provided and contains the results of the analyses obtained using a previous PLS model on new data and, in particular:
PLS scores of the new data using the old model for the first block
PLS scores of the new data using the old model for the second block
Scores of the new data on the major axis computed using the PLS model provided in pls_object. If axes_to_use > 1, each column corresponds to a separate major axis
Predictions for Block 1 of the new data obtained by first computing the major axis projections (element new_data_major_axis_proj) and then back-transforming these projections to the original space (e.g., shape)
Predictions for Block 2 of the new data obtained by first computing the major axis projections (element new_data_major_axis_proj) and then back-transforming these projections to the original space (e.g., shape)
The function outputs a list with the following elements (please, see the Details section for explanations on their sub-elements):
For each PLS axis pair, results of the computation of major axis and projection of the original data on each axis
Data obtained back-transforming the scores on the major axis into the original space (e.g., shape)
(only if new data has been provided) PLS scores for the new data, scores of the new data on the major axis, preditions for the new data back-transformed into the original space (e.g., shape)
If you use this function, please cite Fruciano et al. 2020.
If new data is provided, this is first centered to the same average as in the original analysis, then it is translated back to the original scale.
Fruciano C, Colangelo P, Castiglia R, Franchini P. 2020. Does divergence from normal patterns of integration increase as chromosomal fusions increase in number? A test on a house mouse hybrid zone. Current Zoology 66:527–538.
###################################### ### Example using the classical ### ### iris data set as a toy example ### ###################################### data(iris) # Import the iris dataset versicolor_data=iris[iris$Species=="versicolor",] # Select only the specimens belonging to the species Iris versicolor versicolor_sepal=versicolor_data[,grep("Sepal", colnames(versicolor_data))] versicolor_petal=versicolor_data[,grep("Petal", colnames(versicolor_data))] # Separate sepal and petal data for I. versicolor PLS_sepal_petal_versicolor=pls(versicolor_sepal, versicolor_petal, perm=99) summary(PLS_sepal_petal_versicolor) # Compute the PLS for I. versicolor plot(PLS_sepal_petal_versicolor$XScores[,1], PLS_sepal_petal_versicolor$YScores[,1], asp = 1, xlab = "PLS 1 Block 1 scores", ylab = "PLS 1 Block 2 scores") # Plot the scores for the original data on the first pair of PLS # axes (one axis per block) # This is the data based on which we will compute the major axis # direction # Imagine fitting a line through those point, that is the major axis Pred_major_axis_versicolor=pls_major_axis(PLS_sepal_petal_versicolor, axes_to_use=1) # Compute for I. versicolor the projections to the major axis # using only the first pair of PLS axes (and scaling the scores # prior to the computation) hist(Pred_major_axis_versicolor$original_major_axis_projection[[1]]$original_data_PLS_projection, main="Original data - projections on the major axis - based on the first pair of PLS axes", xlab="Major axis score") # Plot distribution of PLS scores for each individual in the # original data (I. versicolor) # projected on the major axis for the first pair of PLS axis Pred_major_axis_versicolor$original_major_axis_predictions_reversed$Block1 Pred_major_axis_versicolor$original_major_axis_predictions_reversed$Block2 # Shape for each individual of the original data (I. versicolor) # predicted by its position along the major axis # Now we will use the data from new species (I. setosa and I # virginica) and obtain predictions from the PLS model obtained for # I. versicolor # The easiest is to use the data for all three species # as if they were both new data # (using versicolor as new data is not going to affect the model) all_sepal=iris[,grep("Sepal", colnames(iris))] all_petal=iris[,grep("Petal", colnames(iris))] # Separate sepals and petals (they are the two blocks) Pred_major_axis_versicolor_newdata=pls_major_axis( pls_object=PLS_sepal_petal_versicolor, new_data_x = all_sepal, new_data_y = all_petal, axes_to_use=1) # Perform the major axis computation using new data # Notice that: # - we are using the old PLS model (computed on versicolor only) # - we are adding the new data in the same order as in the original # model (i.e., new_data_x is sepal data, new_data_y is petal data) plot(Pred_major_axis_versicolor_newdata$new_data_results$new_data_Xscores[,1], Pred_major_axis_versicolor_newdata$new_data_results$new_data_Yscores[,1], col=iris$Species, asp=1, xlab = "Old PLS, Axis 1, Block 1", ylab = "Old PLS, Axis 1, Block 2") # Plot the new data (both versicolor and setosa) # in the space of the first pair of PLS axes computed only on # versicolor # The three species follow a quite similar trajectories # but they have different average value on the major axis # To visualize this better, we can plot the scores along the major # axis for the three species boxplot(Pred_major_axis_versicolor_newdata$new_data_results$new_data_major_axis_proj[,1]~ iris$Species, xlab="Species", ylab="Score on the major axis") # We can also visualize the deviations from the major axis # For instance by putting the predictions of the two blocks together # Computing differences and then performing a PCA predictions_all_data=cbind( Pred_major_axis_versicolor_newdata$new_data_results$new_data_Block1_proj_prediction_revert, Pred_major_axis_versicolor_newdata$new_data_results$new_data_Block2_proj_prediction_revert) # Get the predictions for the two blocks (sepals and petals) # and put them back together Euc_dist_from_predictions=unlist(lapply(seq(nrow(iris)), function(i) dist(rbind(iris[i,1:4],predictions_all_data[i,])))) # for each flower, compute the Euclidean distance between # the original values and what is predicted by the model boxplot(Euc_dist_from_predictions~iris$Species, xlab="Species", ylab="Euclidean distance from prediction") # I. setosa is the one which deviates the most from the prediction###################################### ### Example using the classical ### ### iris data set as a toy example ### ###################################### data(iris) # Import the iris dataset versicolor_data=iris[iris$Species=="versicolor",] # Select only the specimens belonging to the species Iris versicolor versicolor_sepal=versicolor_data[,grep("Sepal", colnames(versicolor_data))] versicolor_petal=versicolor_data[,grep("Petal", colnames(versicolor_data))] # Separate sepal and petal data for I. versicolor PLS_sepal_petal_versicolor=pls(versicolor_sepal, versicolor_petal, perm=99) summary(PLS_sepal_petal_versicolor) # Compute the PLS for I. versicolor plot(PLS_sepal_petal_versicolor$XScores[,1], PLS_sepal_petal_versicolor$YScores[,1], asp = 1, xlab = "PLS 1 Block 1 scores", ylab = "PLS 1 Block 2 scores") # Plot the scores for the original data on the first pair of PLS # axes (one axis per block) # This is the data based on which we will compute the major axis # direction # Imagine fitting a line through those point, that is the major axis Pred_major_axis_versicolor=pls_major_axis(PLS_sepal_petal_versicolor, axes_to_use=1) # Compute for I. versicolor the projections to the major axis # using only the first pair of PLS axes (and scaling the scores # prior to the computation) hist(Pred_major_axis_versicolor$original_major_axis_projection[[1]]$original_data_PLS_projection, main="Original data - projections on the major axis - based on the first pair of PLS axes", xlab="Major axis score") # Plot distribution of PLS scores for each individual in the # original data (I. versicolor) # projected on the major axis for the first pair of PLS axis Pred_major_axis_versicolor$original_major_axis_predictions_reversed$Block1 Pred_major_axis_versicolor$original_major_axis_predictions_reversed$Block2 # Shape for each individual of the original data (I. versicolor) # predicted by its position along the major axis # Now we will use the data from new species (I. setosa and I # virginica) and obtain predictions from the PLS model obtained for # I. versicolor # The easiest is to use the data for all three species # as if they were both new data # (using versicolor as new data is not going to affect the model) all_sepal=iris[,grep("Sepal", colnames(iris))] all_petal=iris[,grep("Petal", colnames(iris))] # Separate sepals and petals (they are the two blocks) Pred_major_axis_versicolor_newdata=pls_major_axis( pls_object=PLS_sepal_petal_versicolor, new_data_x = all_sepal, new_data_y = all_petal, axes_to_use=1) # Perform the major axis computation using new data # Notice that: # - we are using the old PLS model (computed on versicolor only) # - we are adding the new data in the same order as in the original # model (i.e., new_data_x is sepal data, new_data_y is petal data) plot(Pred_major_axis_versicolor_newdata$new_data_results$new_data_Xscores[,1], Pred_major_axis_versicolor_newdata$new_data_results$new_data_Yscores[,1], col=iris$Species, asp=1, xlab = "Old PLS, Axis 1, Block 1", ylab = "Old PLS, Axis 1, Block 2") # Plot the new data (both versicolor and setosa) # in the space of the first pair of PLS axes computed only on # versicolor # The three species follow a quite similar trajectories # but they have different average value on the major axis # To visualize this better, we can plot the scores along the major # axis for the three species boxplot(Pred_major_axis_versicolor_newdata$new_data_results$new_data_major_axis_proj[,1]~ iris$Species, xlab="Species", ylab="Score on the major axis") # We can also visualize the deviations from the major axis # For instance by putting the predictions of the two blocks together # Computing differences and then performing a PCA predictions_all_data=cbind( Pred_major_axis_versicolor_newdata$new_data_results$new_data_Block1_proj_prediction_revert, Pred_major_axis_versicolor_newdata$new_data_results$new_data_Block2_proj_prediction_revert) # Get the predictions for the two blocks (sepals and petals) # and put them back together Euc_dist_from_predictions=unlist(lapply(seq(nrow(iris)), function(i) dist(rbind(iris[i,1:4],predictions_all_data[i,])))) # for each flower, compute the Euclidean distance between # the original values and what is predicted by the model boxplot(Euc_dist_from_predictions~iris$Species, xlab="Species", ylab="Euclidean distance from prediction") # I. setosa is the one which deviates the most from the prediction
Prints results table and checks for CI overlap among groups
## S3 method for class 'disparity_resample' print(x, ...)## S3 method for class 'disparity_resample' print(x, ...)
x |
An object of class "disparity_resample" |
... |
Additional arguments (not used) |
Invisibly returns the input object
Prints results table and checks for CI overlap among groups
## S3 method for class 'EscoufierRVrarefy' print(x, ...)## S3 method for class 'EscoufierRVrarefy' print(x, ...)
x |
An object of class "EscoufierRVrarefy" |
... |
Additional arguments (not used) |
Invisibly returns the input object
This function projects data to the subspace orthogonal to a multivariate column vector
ProjectOrthogonal(Data, vector)ProjectOrthogonal(Data, vector)
Data |
matrix n x p of n observation for p variables, |
vector |
column vector (matrix p x 1 of p variables) |
This function is useful to remove from a dataset the variation along a specific direction (e.g., a principal component). It has been used extensively for many applications, such as
'Size Correction' (removal of an allometric vector), also called 'Burnaby's method (Burnaby 1966; see also Rohlf & Bookstein 1987)
Remove body arching from fish (Valentin et al. 2008; see also Fruciano 2016 for a discussion and other examples of usage in the context of measurement error in geometric morphometrics)
Removing variation due to sexual dimorphism on a set of individuals with unknown sex (Fruciano et al. 2014)
Optionally, vector can also be a matrix with more than one column (in this case the Data is projected to the subspace orthogonal to the space spanned by all dimensions in vector)
The function outputs a matrix n x p of the original data projected to the subspace orthogonal to the vector
Burnaby T. 1966. Growth-invariant discriminant functions and generalized distances. Biometrics:96-110.
Fruciano C. 2016. Measurement error in geometric morphometrics. Development Genes and Evolution 226:139-158.
Fruciano C, Pappalardo AM, Tigano C, Ferrito V. 2014. Phylogeographical relationships of Sicilian brown trout and the effects of genetic introgression on morphospace occupation. Biological Journal of the Linnean Society 112:387-398.
Rohlf FJ, Bookstein FL. 1987. A Comment on Shearing as a Method for' Size Correction'. Systematic Zoology:356-367.
Valentin AE, Penin X, Chanut JP, Sévigny JM, Rohlf FJ. 2008. Arching effect on fish body shape in geometric morphometric studies. Journal of Fish Biology 73:623-638.
library(MASS) A=mvrnorm(50,mu=rep(1,50),Sigma=diag(50)) B=mvrnorm(50,mu=rep(0,50),Sigma=diag(50)) AB=rbind(A,B) Group=as.factor(c(rep(1,50),rep(2,50))) # Create two groups of observations (e.g., specimens) # one centered at 0 and the other at 1 # and combine them in a single sample PCA=prcomp(AB) # Combine the two groups and perform a PCA plot(PCA$x[,1],PCA$x[,2], asp=1, col=Group) # Plot the scores along the first two principal components # The two groups are clearly distinct (red and black) ABproj=ProjectOrthogonal(AB,cbind(PCA$rotation[,1])) # Project the original data (both groups) # to the subspace orthogonal to the first principal component # (which is the direction along which there is most of variation # among groups) PCAproj=prcomp(ABproj) # Perform a new PCA on the 'corrected' dataset plot(PCAproj$x[,1], PCAproj$x[,2], asp=1, col=Group) # Plot the scores along the first two principal components # of the 'corrected' data # Notice how the two groups are now pretty much # indistinguishablelibrary(MASS) A=mvrnorm(50,mu=rep(1,50),Sigma=diag(50)) B=mvrnorm(50,mu=rep(0,50),Sigma=diag(50)) AB=rbind(A,B) Group=as.factor(c(rep(1,50),rep(2,50))) # Create two groups of observations (e.g., specimens) # one centered at 0 and the other at 1 # and combine them in a single sample PCA=prcomp(AB) # Combine the two groups and perform a PCA plot(PCA$x[,1],PCA$x[,2], asp=1, col=Group) # Plot the scores along the first two principal components # The two groups are clearly distinct (red and black) ABproj=ProjectOrthogonal(AB,cbind(PCA$rotation[,1])) # Project the original data (both groups) # to the subspace orthogonal to the first principal component # (which is the direction along which there is most of variation # among groups) PCAproj=prcomp(ABproj) # Perform a new PCA on the 'corrected' dataset plot(PCAproj$x[,1], PCAproj$x[,2], asp=1, col=Group) # Plot the scores along the first two principal components # of the 'corrected' data # Notice how the two groups are now pretty much # indistinguishable
Test based on Hotelling's T squared for the null hypothesis of no effect between two repeated measures (e.g., treatment/control)
repeated_measures_test(T1, T2, rnames = TRUE, shrink = FALSE)repeated_measures_test(T1, T2, rnames = TRUE, shrink = FALSE)
T1, T2
|
matrices (n x p of n observation for p variables) or arrays (t x p x n of n observations, t landmarks in p dimensions), |
rnames |
if TRUE (default) the rownames of the matrix or the names along the 3rd dimension (for arrays) will be used to match the order |
shrink |
if TRUE, a shrinkage estimator of covariance is used internally |
The function assumes that each individual observation (e.g., specimen) has been measured two times (e.g., at two time points, or between two treatments).
If rnames is TRUE (default), the rownames of the matrix or the names along the 3rd dimension (for arrays) will be used to match the order of observations (e.g., specimens) between the two datasets. Otherwise, the function will assume that the observations in T1 and T2 are in the same order.
This function is useful in various contexts, such as:
testing the effect of preservation (Fruciano et al. 2020)
testing for variation through time
For instance, in the context of the effect of preservation on geometric morphometrics, it has been argued (Fruciano, 2016) that various studies have improperly used on repeated measures data methods developed for independent observations, and this can lead to incorrect inference.
The function outputs a matrix n x p of the original data projected to the subspace orthogonal to the vector
The function requires internally non-singular matrices (for instance, number of cases should be larger than the number of variables). One solution can be to perform a principal component analysis and use the scores for all the axes with non-zero and non-near-zero eigenvalues. To overcome some situations where a singular matrix can occurr, the function can use internally a shrinkage estimator of the covariance matrix (Ledoit & Wolf 2004). This is called setting shrink = TRUE. However, in this case, the package nlshrink should have been installed. Also, notice that if the matrices T1 and T2 are provided as arrays, this requires the package Morpho to be installed.
If you use this function please cite Fruciano et al. 2020
Fruciano C. 2016. Measurement error in geometric morphometrics. Development Genes and Evolution 226:139-158.
Fruciano C., Schmidt, I., Ramirez Sanchez, M.M., Morek, W., Avila Valle, Z.A., Talijancic, I., Pecoraro, C., Schermann Legionnet, A. 2020. Tissue preservation can affect geometric morphometric analyses: a case study using fish body shape. Zoological Journal of the Linnean Society 188:148-162.
Ledoit O, Wolf M. 2004. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88:365-411.
Convenience function which rescales a dataset of landmarks based on a vector of distances between two landmarks
rescale_by_landmark_distance(Data, lm1, lm2, lengths)rescale_by_landmark_distance(Data, lm1, lm2, lengths)
Data |
array k x p x n of k landmarks and p dimensions for n observations (specimens) |
lm1, lm2
|
index of the two landmarks whose inter-landmark distance is known |
lengths |
vector of n lengths (distances between lm1 and lm2 in the appropriate scale) |
This function can be useful when one has the distance between two landmarks (e.g., obtained with a caliper), but a scale has not been set when acquiring the data (for instance, a scale bar was missing on photos, so configuration of landmarks are scaled in pixels).
The function outputs an array k x p x n of rescaled landmark coordinates
Obtain data in the original variables starting from PCA scores
reversePCA(Scores, Eigenvectors, Mean)reversePCA(Scores, Eigenvectors, Mean)
Scores |
matrix n x s of n observation for s scores. Can be a single column (for instance, if one is interested in PC1 only) |
Eigenvectors |
matrix p x s of eigenvectors (as column eigenvectors); notice that s (number of column eigenvectors) must be the same as the number of columns of PC scores in Scores |
Mean |
vector of length p with the multivariate mean computed on the original dataset (the dataset on which the PCA was carried out) |
Given a set of PCA scores, a set of eigenvectors and the mean of the data on which the PCA was originally computed, this function returns a set of observations in the original data space. This simple function can have many applications (not only in morphometrics) such as
Producing predicted shapes for extreme, unobserved, values of PC scores (e.g., the shape, or other set of variables, corresponding to 3 times the most extreme positive PC1 score), or their combinations (e.g., a value of 0.4 on PC1 and 0.3 on PC2)
Interpret results of analyses carried out on PC scores by converting these back to shapes
The function outputs a matrix n x p of the scores 'back-transformed' into the original variables
library(MASS) set.seed(123) A=mvrnorm(10,mu=rep(1,2),Sigma=diag(2)) # Create a small dataset of 10 observations and 2 variables Mean=colMeans(A) PCA=prcomp(A) # Compute mean and perform PCA B=reversePCA(Scores=PCA$x, Eigenvectors=PCA$rotation, Mean=Mean) # 'Revert' the PCA (in this case using all scores and all axes # to get the same results as the starting data) round(A,10)==round(B,10) # Check that all the results are the same # (rounding to the 10th decimal because of numerical precision)library(MASS) set.seed(123) A=mvrnorm(10,mu=rep(1,2),Sigma=diag(2)) # Create a small dataset of 10 observations and 2 variables Mean=colMeans(A) PCA=prcomp(A) # Compute mean and perform PCA B=reversePCA(Scores=PCA$x, Eigenvectors=PCA$rotation, Mean=Mean) # 'Revert' the PCA (in this case using all scores and all axes # to get the same results as the starting data) round(A,10)==round(B,10) # Check that all the results are the same # (rounding to the 10th decimal because of numerical precision)
Rotates a single 2D or 3D landmark configuration about the origin of the coordinate system.
rotate_landmarks(LMdata, rotation, radians = TRUE, center = TRUE)rotate_landmarks(LMdata, rotation, radians = TRUE, center = TRUE)
LMdata |
matrix k x p of k landmarks and p dimensions |
rotation |
vector containing the rotation angle(s) (a single rotation for 2D data, three rotations for 3D data) |
radians |
a logical on whether the angle(s) are provided in radians (default) or not |
center |
a logical on whether the rotation should be on the centered configuration |
This function can be useful to change the orientation of data for display purposes It could also be used to empirically check the rotation-invariance of an analysis. Notice that this function works on a single configuration of landmarks provided as a k x p matrix of k landmarks in p dimensions (i.e., p is 2 for 2D data and 3 for 3D data) As user-supplied rotation, the function expects a single number in the case of 2D data (rotation on the plane), a vector with three values (corresponding to rotation relative to the three axes) in 3D. If center=TRUE (default), first the configuration of landmarks is centered, then the rotation is performed, and finally the landmark coordinated are translated back to the original position. This accomplishes rotating the landmark configuration around its center
The function outputs a matrix k x p of the original configuration of landmarks rotated according to the user-supplied rotation
library(ggplot2) Poly1=scale(t(matrix(c(4,1,3,1,1,2,2,3),nrow=2,ncol=4)), center=TRUE, scale=FALSE) # Create a polygon centered at the origin Poly2=rotate_landmarks(LMdata=Poly1, rotation=10, radians=FALSE) # Create a new polygon which is the rotated version of the first # with respect to the origin (rotation of 10 degrees) BothPolys4Plotting=as.data.frame(rbind(Poly1,Poly2)) BothPolys4Plotting[,3]=c(rep("Original",4),rep("Rotated",4)) BothPolys4Plotting[,4]=rep(1:4,2) colnames(BothPolys4Plotting)=c("X","Y","Polygon","Landmark") # Put them together in a way that is easy to plot in ggplot2 GraphLims=range(BothPolys4Plotting[,1:2]) # limits of the plot ggplot() + geom_polygon(data=BothPolys4Plotting, mapping=aes(x=X, y=Y, group=Polygon, fill=Polygon), alpha=0.5) + geom_point(data=BothPolys4Plotting, aes(x=X, y=Y, color=Polygon)) + geom_text(data=BothPolys4Plotting, aes(x=X, y=Y, label=Landmark), hjust=1, vjust=1, size=4)+ coord_fixed(ratio=1, xlim=GraphLims, ylim=GraphLims)+ theme_classic() # Plot the two polygons (landmarks are numbered for ease of # visualization)library(ggplot2) Poly1=scale(t(matrix(c(4,1,3,1,1,2,2,3),nrow=2,ncol=4)), center=TRUE, scale=FALSE) # Create a polygon centered at the origin Poly2=rotate_landmarks(LMdata=Poly1, rotation=10, radians=FALSE) # Create a new polygon which is the rotated version of the first # with respect to the origin (rotation of 10 degrees) BothPolys4Plotting=as.data.frame(rbind(Poly1,Poly2)) BothPolys4Plotting[,3]=c(rep("Original",4),rep("Rotated",4)) BothPolys4Plotting[,4]=rep(1:4,2) colnames(BothPolys4Plotting)=c("X","Y","Polygon","Landmark") # Put them together in a way that is easy to plot in ggplot2 GraphLims=range(BothPolys4Plotting[,1:2]) # limits of the plot ggplot() + geom_polygon(data=BothPolys4Plotting, mapping=aes(x=X, y=Y, group=Polygon, fill=Polygon), alpha=0.5) + geom_point(data=BothPolys4Plotting, aes(x=X, y=Y, color=Polygon)) + geom_text(data=BothPolys4Plotting, aes(x=X, y=Y, label=Landmark), hjust=1, vjust=1, size=4)+ coord_fixed(ratio=1, xlim=GraphLims, ylim=GraphLims)+ theme_classic() # Plot the two polygons (landmarks are numbered for ease of # visualization)
Performs permutation tests for the difference in Escoufier RV between groups. For multiple groups, performs pairwise comparisons between all pairs of groups.
RVcomparison(Block1, Block2, group, perm = 999, center = TRUE)RVcomparison(Block1, Block2, group, perm = 999, center = TRUE)
Block1, Block2
|
Matrices or data frames containing each block of variables (observations in rows, variables in columns). |
group |
factor or vector indicating group membership for observations. |
perm |
number of permutations for the test |
center |
whether the groups should be mean-centered prior to the test |
This function is one of the solutions proposed by Fruciano et al. 2013 to deal with the fact that values of Escoufier RV coefficient (Escoufier 1973), which is routinely used to quantify the levels of association between multivariate blocks of variables (landmark coordinates in the case of morphometric data, but it might be any multivariate dataset) are not comparable across groups with different number of observations/individuals (Smilde et al. 2009; Fruciano et al. 2013). The solution is a permutation test. This test was originally implemented by Adriano Franchini in the Java program RVcomparator, of which this function is an updated and improved version
A data frame with one row per pairwise comparison and the following columns:
Name of the first group in the comparison
Name of the second group in the comparison
Observed Escoufier RV for the first group in the comparison
Observed Escoufier RV for the second group in the comparison
Absolute difference in the observed Escoufier RV between the two groups
p value of the permutation test
For multiple groups, the data frame includes additional columns identifying the groups being compared.
The values of RV for each of the groups the function outputs are the observed ones, and can be reported but their value should not be compared. If one wants to obtain comparable values one solution (Fruciano et al 2013) is to obtain rarefied estimates, which can be done with the function RVrarefied in this package
If you use this function please cite both Fruciano et al. 2013 (for using the rarefaction procedure) and Escoufier 1973 (because the procedure is based on Escoufier RV)
Escoufier Y. 1973. Le Traitement des Variables Vectorielles. Biometrics 29:751-760.
Fruciano C, Franchini P, Meyer A. 2013. Resampling-Based Approaches to Study Variation in Morphological Modularity. PLoS ONE 8:e69376.
Smilde AK, Kiers HA, Bijlsma S, Rubingh CM, van Erk MJ. 2009. Matrix correlations for high-dimensional data: the modified RV-coefficient. Bioinformatics 25:401-405.
library(MASS) set.seed(123) # Create sample data with different correlation structures S1 = diag(50) # Uncorrelated variables for group 1 S2 = diag(50) S2[11:50, 11:50] = 0.3 # Some correlation in second block for group 2 S2 = (S2 + t(S2)) / 2 # Ensure symmetry diag(S2) = 1 # Generate data for two groups A = mvrnorm(30, mu = rep(0, 50), Sigma = S1) B = mvrnorm(40, mu = rep(0, 50), Sigma = S2) # Combine data and create group labels combined_data1 = A[, 1:20] combined_data2 = A[, 21:50] combined_data1 = rbind(combined_data1, B[, 1:20]) combined_data2 = rbind(combined_data2, B[, 21:50]) groups = c(rep("GroupA", 30), rep("GroupB", 40)) # Perform RV comparison result = RVcomparison(combined_data1, combined_data2, group = groups, perm = 99) print(result) # Example with three groups for pairwise comparisons C = mvrnorm(25, mu = rep(0, 50), Sigma = diag(50)) combined_data1_3grp = rbind(combined_data1, C[, 1:20]) combined_data2_3grp = rbind(combined_data2, C[, 21:50]) groups_3 = c(groups, rep("GroupC", 25)) result_3grp = RVcomparison(combined_data1_3grp, combined_data2_3grp, group = groups_3, perm = 99) print(result_3grp)library(MASS) set.seed(123) # Create sample data with different correlation structures S1 = diag(50) # Uncorrelated variables for group 1 S2 = diag(50) S2[11:50, 11:50] = 0.3 # Some correlation in second block for group 2 S2 = (S2 + t(S2)) / 2 # Ensure symmetry diag(S2) = 1 # Generate data for two groups A = mvrnorm(30, mu = rep(0, 50), Sigma = S1) B = mvrnorm(40, mu = rep(0, 50), Sigma = S2) # Combine data and create group labels combined_data1 = A[, 1:20] combined_data2 = A[, 21:50] combined_data1 = rbind(combined_data1, B[, 1:20]) combined_data2 = rbind(combined_data2, B[, 21:50]) groups = c(rep("GroupA", 30), rep("GroupB", 40)) # Perform RV comparison result = RVcomparison(combined_data1, combined_data2, group = groups, perm = 99) print(result) # Example with three groups for pairwise comparisons C = mvrnorm(25, mu = rep(0, 50), Sigma = diag(50)) combined_data1_3grp = rbind(combined_data1, C[, 1:20]) combined_data2_3grp = rbind(combined_data2, C[, 21:50]) groups_3 = c(groups, rep("GroupC", 25)) result_3grp = RVcomparison(combined_data1_3grp, combined_data2_3grp, group = groups_3, perm = 99) print(result_3grp)
Computes a rarefied estimate of the Escoufier RV coefficient to account for the dependence on sample size of RV, so that RV can be compared among groups with different number of observations (sample sizes)
RVrarefied(Block1, Block2, reps = 1000, samplesize, group = NULL, CI = 0.95)RVrarefied(Block1, Block2, reps = 1000, samplesize, group = NULL, CI = 0.95)
Block1, Block2
|
Matrices or data frames containing each block of variables (observations in rows, variables in columns). |
reps |
number of resamplings to obtain the rarefied estimate |
samplesize |
sample size to which the rarefaction procedure is carried out |
group |
factor or vector indicating group membership for observations. If NULL (default), a single-group analysis is performed. If provided, the analysis is performed separately for each group. |
CI |
confidence interval level (default 0.95). Must be between 0 and 1. |
This function computes a rarefied estimate of Escoufier RV coefficient as suggested by Fruciano et al 2013 - Plos One This can be useful to compare RV among groups with the same variables but different sample sizes (as RV depends on sample size, see Fruciano et al 2013, where this procedure is described). The idea is the one rarefies the two groups at the same sample size. Following the approach in Fruciano et al. 2013 - Plos One, "rarefaction" is meant resampling to a smaller sample size with replacement (like in bootstrap).
The function outputs a list with the following elements:
A data frame with columns 'group', 'Mean', 'Median', 'CI_min', 'CI_max'. One row per group. For single-group analysis, group will be "All".
A list with all RV values obtained using the rarefaction procedure for each group. For single-group analysis, this will be a list with one element named "All".
The returned object has class "EscoufierRVrarefy" and comes with associated S3 methods for convenient display and visualization:
print.EscoufierRVrarefy: Prints a formatted summary of results including confidence interval overlap assessment for multiple groups
plot.EscoufierRVrarefy: Creates a confidence interval plot using ggplot2
the function does NOT perform GPA on each rarefied sample this may or may not make a difference in estimates. In many cases, it will probably not make much difference (e.g., Fig. 2 in Fruciano et al 2013)
If you use this function please cite both Fruciano et al. 2013 (for using the rarefaction procedure) and Escoufier 1973 (because the procedure is based on Escoufier RV)
Escoufier Y. 1973. Le Traitement des Variables Vectorielles. Biometrics 29:751-760.
Fruciano C, Franchini P, Meyer A. 2013. Resampling-Based Approaches to Study Variation in Morphological Modularity. PLoS ONE 8:e69376.
library(MASS) set.seed(123) Pop=mvrnorm(100000,mu=rep(0,100), Sigma=diag(100)) # Create a population of 100,000 'individuals' # as multivariate normal random data # We will consider the first 20 columns as the first # block of variables, and the following one as the second block A=Pop[1:50,] B=Pop[501:700,] # Take two groups (A and B) # from the same population (there should be no difference # between them) EscoufierRV(A[,1:20],A[,21:ncol(A)]) EscoufierRV(B[,1:20],B[,21:ncol(B)]) # Notice how we obtain very different values of Escoufier RV # (this is because they two groups have very different # sample sizes, one 50 observations, the other 200) RarA=RVrarefied(A[,1:20],A[,21:ncol(A)],reps=1000,samplesize=30) RarB=RVrarefied(B[,1:20],B[,21:ncol(A)],reps=1000,samplesize=30) RarA$results # Data frame with Mean, Median, CI_min, CI_max RarB$results # Data frame with Mean, Median, CI_min, CI_max # Rarefying both groups at the same sample size # (in this case 30) # it is clear that the two groups have very similar levels # of association between blocks # Multi-group analysis with custom CI combined_data = rbind(A, B) group_labels = c(rep("GroupA", nrow(A)), rep("GroupB", nrow(B))) multi_result = RVrarefied(combined_data[,1:20], combined_data[,21:ncol(combined_data)], reps=1000, samplesize=30, group=group_labels, CI=0.90) print(multi_result$results) # Data frame with results for each group # Columns: group, Mean, Median, CI_min, CI_maxlibrary(MASS) set.seed(123) Pop=mvrnorm(100000,mu=rep(0,100), Sigma=diag(100)) # Create a population of 100,000 'individuals' # as multivariate normal random data # We will consider the first 20 columns as the first # block of variables, and the following one as the second block A=Pop[1:50,] B=Pop[501:700,] # Take two groups (A and B) # from the same population (there should be no difference # between them) EscoufierRV(A[,1:20],A[,21:ncol(A)]) EscoufierRV(B[,1:20],B[,21:ncol(B)]) # Notice how we obtain very different values of Escoufier RV # (this is because they two groups have very different # sample sizes, one 50 observations, the other 200) RarA=RVrarefied(A[,1:20],A[,21:ncol(A)],reps=1000,samplesize=30) RarB=RVrarefied(B[,1:20],B[,21:ncol(A)],reps=1000,samplesize=30) RarA$results # Data frame with Mean, Median, CI_min, CI_max RarB$results # Data frame with Mean, Median, CI_min, CI_max # Rarefying both groups at the same sample size # (in this case 30) # it is clear that the two groups have very similar levels # of association between blocks # Multi-group analysis with custom CI combined_data = rbind(A, B) group_labels = c(rep("GroupA", nrow(A)), rep("GroupB", nrow(B))) multi_result = RVrarefied(combined_data[,1:20], combined_data[,21:ncol(combined_data)], reps=1000, samplesize=30, group=group_labels, CI=0.90) print(multi_result$results) # Data frame with results for each group # Columns: group, Mean, Median, CI_min, CI_max
Compute estimates of the scaled variance of eigenvalues using only the positive eigenvalues
scaled_variance_of_eigenvalues( data_matrix, boot = 999, rarefy = FALSE, shrinkage = FALSE )scaled_variance_of_eigenvalues( data_matrix, boot = 999, rarefy = FALSE, shrinkage = FALSE )
data_matrix |
Matrix or data frame containing the original data (observations in rows, variables in columns). |
boot |
number of bootstrap resamples (no bootstrap if 0) |
rarefy |
either a logical to determine whether rarefaction will be performed or a number indicating the sample size at which the samples will be rarefied |
shrinkage |
logical on whether the analysis should be based on a covariance matrix obtained through linear shrinkage |
The function allows computing the scaled variance of eigenvalues (Pavlicev et al. 2009) under a variety of settings. The scaled variance of eigenvalues is a commonly used index of morphological integration. Its value is comprised between 0 and 1, with larger values suggesting stronger integration.
Only positive eigenvalues are used in the computations used in this function.
The function employes two possible strategies to obtain eigenvalues:
a singular value decomposition of the data matrix (default)
an eigenvalue decomposition of the covariance matrix estimated using linear shrinkage (option shrinkage=TRUE; Ledoit & Wolf 2004)
Further, the function allows obtaining bootstrapped estimates by setting boot to the number of bootstrap replicates (resampling with replacement)
It is also possible to obtain rarefied estimates by setting rarefy to the desired sample size. This is useful when comparing the scaled variance of eigenvalues across multiple groups with different sample sizes. In this case, the suggestion is to use either the smallest sample size or less
Using a bootstrap estimate with the singular value decomposition approach represents a good compromise between computation time and accuracy. This should be complemented by rarefaction to the smallest sample size (or lower) in case one wants to compare the value obtained across different groups.
If boot=0 the function outputs a vector containing the scaled variance of eigenvalues and the number of dimensions used in the computations. If, instead, boot>0 (recommended) the function outputs a list containing
the mean scaled variance of eigenvalues across all bootstrap samples
the median number of dimensions used across all bootstrap samples
the 95 eigenvalues and dimensions
the scaled variance of eigenvalues and dimensions used for each of the bootstrap replicates
When boot>0 the rarefied estimates are based on sampling with replacement (bootstrap). However, if boot=0, then a single rarefied estimate is obtained by sampling without replacement. In this case, the user should repeat the same operation multiple times (e.g., 100) and take the average of the scaled variance of eigenvalues obtained.
Also notice that using the shrinkage-based estimation of the covariance matrix requires longer computational time and memory. This option requires the package nlshrink
For the computation, this function uses only positive eigenvalues (which are also used to identify data dimensionality). The eigenvalues are first scaled by dividing them by their sum (Young 2006), then their variance is computed as population variance (rather than sample variance; see Haber 2011). Finally, this value is standardized to a scale between 0 and 1 by dividing it by its maximum theoretical value of (p-1)/p^2 (where p is the number of dimensions) - this is the same scaling used in the software MorphoJ (Klingenberg 2011).
Ledoit O, Wolf M. 2004. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88:365-411.
Young NM. 2006. Function, ontogeny and canalization of shape variance in the primate scapula. Journal of Anatomy 209:623-636.
Pavlicev M, Cheverud JM, Wagner GP. Measuring Morphological Integration Using Eigenvalue Variance. Evolutionary Biology 36(1):157–170.
Haber A. 2011. A Comparative Analysis of Integration Indices. Evolutionary Biology 38:476-488.
Klingenberg CP. 2011. MorphoJ: an integrated software package for geometric morphometrics. Molecolar Ecology Resources 11:353-357.
This function performs a test for the angle between two multivariate vector, optionally allowing for "flipping" one of the axes
TestOfAngle(x, y, flip = TRUE)TestOfAngle(x, y, flip = TRUE)
x, y
|
numerical vectors |
flip |
logical stating whether (TRUE, default) axes should be "flipped" in case the angle is larger than 90 degrees (see Details) |
This function is useful to perform a test of the angle between two multivariate vectors (e.g., principal component eigenvectors computed from two covariance matrices, such as covariance matrices of two different species). The function uses internally angleTest from the package Morpho by Stefan Schlager. That function, in turn, uses the formulas in Li 2011 to compute significance, rather than using permutations. This is the same approach also implemented in MorphoJ (Klingenberg 2011). There is no special advantage in using this function relative to the original in the package Morpho, except that this function outputs angles both in radians and degrees and that it optionally allow to "flip" one of the two axes. This is useful in the cases where the direction of one axis is arbitrary, such as in PCA (where positive and negative values are interchangable and recomputing PCA might result in positive scores for the observations which were negative (and vice versa). In this situation, angles larger than 90 degrees are not meaningful and one way of dealing with this is, by choosing the option flip=TRUE (default), to change the sign of one of the two vectors ("flip").
The function outputs a list with
angle |
angle between vectors in radians |
angle_deg |
angle between vectors in degrees |
p.value |
p value |
critical_angle |
angle below which two vectors of the same number of dimensions as the ones being tested would be significant |
The p value is for the probability that the angle between two random vectors is smaller or equal to the one computed from the vectors provided (x, y). This means that significant values indicate that the two provided vectors are "significantly similar", whereas non-significant values means that the two vectors are substantially different
Klingenberg CP. 2011. MorphoJ: an integrated software package for geometric morphometrics. Mol Ecol Resour 11:353-357.
Li S. 2011. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics and Statistics 4:66-70.
Schlager S. 2016. Morpho: Calculations and Visualisations Related to Geometric Morphometrics.
library(MASS) library(clusterGeneration) set.seed(123) Data=mvrnorm(500,mu=rep(1,50),Sigma=genPositiveDefMat(dim=50)$Sigma) A=Data[1:250,] B=Data[251:500,] # Create two groups of observations (e.g., specimens) # from the same multivariate normal distribution # with the same starting covariance matrix PCA_A=prcomp(A) PCA_B=prcomp(B) # Perform separate PCAs for each of the two groups of observations TestOfAngle(PCA_A$rotation[,1],PCA_B$rotation[,1], flip=TRUE) # Test for the angle between the two first eigenvectorslibrary(MASS) library(clusterGeneration) set.seed(123) Data=mvrnorm(500,mu=rep(1,50),Sigma=genPositiveDefMat(dim=50)$Sigma) A=Data[1:250,] B=Data[251:500,] # Create two groups of observations (e.g., specimens) # from the same multivariate normal distribution # with the same starting covariance matrix PCA_A=prcomp(A) PCA_B=prcomp(B) # Perform separate PCAs for each of the two groups of observations TestOfAngle(PCA_A$rotation[,1],PCA_B$rotation[,1], flip=TRUE) # Test for the angle between the two first eigenvectors