Package 'GeometricMorphometricsMix' reference manual

Title:	Heterogeneous Methods for Shape and Other Multidimensional Data
Description:	Tools for geometric morphometric analyses and multidimensional data. Implements methods for morphological disparity analysis using bootstrap and rarefaction, as reviewed in Foote (1997) <doi:10.1146/annurev.ecolsys.28.1.129>. Includes integration and modularity testing, following Fruciano et al. (2013) <doi:10.1371/journal.pone.0069376>, using Escoufier's RV coefficient as test statistic as well as two-block partial least squares - PLS, Rohlf and Corti (2000) <doi:10.1080/106351500750049806>. Also includes vector angle comparisons, orthogonal projection for data correction (Burnaby (1966) <doi:10.2307/2528217>; Fruciano (2016) <doi:10.1007/s00427-016-0537-4>), and parallel analysis for dimensionality reduction (Buja and Eyuboglu (1992) <doi:10.1207/s15327906mbr2704_2>).
Authors:	Carmelo Fruciano [aut, cre]
Maintainer:	Carmelo Fruciano <[email protected]>
License:	MIT + file LICENSE
Version:	0.6.1.1
Built:	2026-05-17 09:29:22 UTC
Source:	https://github.com/fruciano/geometricmorphometricsmix

Test the significance of the adjusted Rand index

Description

Permutation test of the adjusted Rand index, which quantifies the level of agreement between two partitions (e.g., two schemes of classification of the same individuals obtained with two methods)

Usage

adjRand_test(A, B, perm = 999)
adjRand_test(A, B, perm = 999)

Arguments

A, B

numerical or character vectors reflecting the assignment of individual observations to groups

perm

number of permutations

Details

The adjusted Rand index (Hubert and Arabie 1985), is an adjusted for chance version of the Rand index (Rand 1971). The adjusted Rand index has an expected value of zero in the case of random partitions, and values approaching one as the two partitions become more similar to each other (with one being perfect match of the classification). This function implements the permutation test proposed by Qannari et al. (2014) to obtain a p value against the null hypothesis of independence of the two partitions.

This function is useful in various contexts, such as in integrative taxonomy when comparing the classification of individual specimens obtained using different data (e.g., sequence data and morphometric data). For an example of the application of this technique with the classification obtained with genetic data and morphometric data for multiple traits, see Fruciano et al. 2016.

Value

The function outputs a vector with the adjusted Rand index and the p value obtained from the permutation test

Notice

The function requires internally the package mclust.

Citation

If you use this function in the context of integrative taxonomy or similar (comparison of classification/unsupervised clustering with biological data), please cite all the papers in the references (otherwise, please use the relevant citations for the context).

References

Fruciano C, Franchini P, Raffini F, Fan S, Meyer A. 2016. Are sympatrically speciating Midas cichlid fish special? Patterns of morphological and genetic variation in the closely related species Archocentrus centrarchus. Ecology and Evolution 6:4102-4114.

Hubert L, Arabie P. 1985. Comparing partitions. Journal of Classification 2:193-218.

Qannari EM, Courcoux P, Faye P. 2014. Significance test of the adjusted Rand index. Application to the free sorting task. Food Quality and Preference 32, Part A:93-97.

Rand WM. 1971. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association 66:846-850.

Examples

library(mclust)
set.seed(123)

irisBIC = mclustBIC(iris[,-5])
mclustBIC_classification = summary(irisBIC,iris[,-5])$classification
original_classification = iris[,5]
# This is one of the examples in the package mclust
# Here a classification algorithm is used on the iris dataset

adjustedRandIndex(mclustBIC_classification, original_classification)
# The mclust package allows computing the adjusted Rand index
# which quantifies the agreement between the original (correct)
# classification and the one obtained with the algorithm.
# However, it is not clear whether the adjusted Rand index is
# "large enough" compared to the null hypothesis of independence
# between the two classification schemes

adjRand_test(mclustBIC_classification, original_classification,
             perm = 999)
# For that, we use the function adjRand_test, which performs
# the permutation test of Qannari et al. 2014 (in this case
# p<0.001, as 1000 permutations have been used).

adjRand_test(original_classification, original_classification,
             perm = 999)
# As it can be seen, in the ideal case of the exact same grouping,
# the adjusted Rand index takes a value of 1 (which is obviously
# significant)


library(mclust)
set.seed(123)

irisBIC = mclustBIC(iris[,-5])
mclustBIC_classification = summary(irisBIC,iris[,-5])$classification
original_classification = iris[,5]
# This is one of the examples in the package mclust
# Here a classification algorithm is used on the iris dataset

adjustedRandIndex(mclustBIC_classification, original_classification)
# The mclust package allows computing the adjusted Rand index
# which quantifies the agreement between the original (correct)
# classification and the one obtained with the algorithm.
# However, it is not clear whether the adjusted Rand index is
# "large enough" compared to the null hypothesis of independence
# between the two classification schemes

adjRand_test(mclustBIC_classification, original_classification,
             perm = 999)
# For that, we use the function adjRand_test, which performs
# the permutation test of Qannari et al. 2014 (in this case
# p<0.001, as 1000 permutations have been used).

adjRand_test(original_classification, original_classification,
             perm = 999)
# As it can be seen, in the ideal case of the exact same grouping,
# the adjusted Rand index takes a value of 1 (which is obviously
# significant)

Body arching vector from brown trout study

Description

A list containing the body arching vector obtained using common principal components following the procedure described in Fruciano et al. 2020, and the GPA consensus used to align individual models when the vector was computed. The GPA consensus is provided to align new data using the same consensus), but using the consensus for downstream work is not recommended.

Usage

data(arching_vector)
data(arching_vector)

Format

A list with two components as described above.

Details

The object is a list with the following elements:

GPA_consensus_used: A matrix of landmark coordinates for the GPA consensus to which the models have been aligned (brown trout dataset in this package)
arching_vector_CPCA: A numeric vector of shape change obtained using common principal component analysis as detailed in Fruciano et al. 2020.

References

Fruciano C., Schmidt, I., Ramirez Sanchez, M.M., Morek, W., Avila Valle, Z.A., Talijancic, I., Pecoraro, C., Schermann Legionnet, A. 2020. Tissue preservation can affect geometric morphometric analyses: a case study using fish body shape. Zoological Journal of the Linnean Society 188:148-162.

Examples

data(arching_vector)
a = arching_vector
names(a)
dim(a$GPA_consensus_used)
length(a$arching_vector_CPCA)
data(arching_vector)
a = arching_vector
names(a)
dim(a$GPA_consensus_used)
length(a$arching_vector_CPCA)

Brown trout landmark data

Description

A list containing landmark coordinates for a subset of brown trout specimens used in Fruciano et al. 2014 (Biological Journal of the Linnean Society) and Fruciano et al. 2020 (Zoological Journal of the Linnean Society). The subset originates from two rivers and was digitised by a single operator.

Usage

data(brown_trout)
data(brown_trout)

Format

A list with components described above.

Details

Notice that the dataset has been realigned using Generalized Procrustes Analysis (GPA) and corrected for body arching using the arching vector in data(arching_vector). Because this is a small subset of the fish in the original study (Fruciano et al. 2014), and contain only data from a single operator and treatment from Fruciano et al. 2020, the dataset should be strictly considered as a "toy" dataset and not used for formal statistical analyses. For conclusions about biological variation and measurement error due to preservation, please refer to the original studies (Fruciano et al. 2014, Fruciano et al. 2020, respectively).

The object is a list with the following components:

with_arching: Landmark coordinates for specimens without correction for body arching.
without_arching: Landmark coordinates for the same specimens after correction for body arching.
Factors: A data.frame with metadata for each specimen (e.g., centroid size, sex and river).

References

Fruciano C, Pappalardo AM, Tigano C, Ferrito V. 2014. Phylogeographical relationships of Sicilian brown trout and the effects of genetic introgression on morphospace occupation. Biological Journal of the Linnean Society 112:387-398.

Examples

data(brown_trout)
d = brown_trout
names(d)
str(d$Factors)
data(brown_trout)
d = brown_trout
names(d)
str(d$Factors)

BTailTest for difference in disparity/morphospace occupation

Description

Performs the BTailTest, in the same spirit as implemented in the Matlab package MDA (Navarro 2003) and used in various empirical papers (e.g., Fruciano et al. 2014, 2016).

Usage

BTailTest(Reference, Test, boot = 1000)
BTailTest(Reference, Test, boot = 1000)

Arguments

Reference

Matrix or data frame containing data for the reference group (observations in rows, variables in columns).

Test

Matrix or data frame containing data for the test group (observations in rows, variables in columns).

boot

number of bootstrap replicates

Details

This is a test of the difference in disparity between two groups: a reference and a test group. The function proceeds by computing a bootstrapped distribution of the test statistics (multivariate variance and mean pairwise Euclidean distances in this implementation) in the reference sample and then comparing the statistics observed in the test sample to this distribution to obtain p-values.

Value

The function outputs a list with the following elements:

BootstrappedSamplesEstimates: Estimates of both multivariate variance and mean pairwise Euclidean distance for each of the bootstrapped samples
pvalues: p values obtained for the test

Citation

If you use this function, in addition to this package, please cite Navarro (2003)

References

Navarro N. 2003. MDA: a MATLAB-based program for morphospace-disparity analysis. Computers & Geosciences 29:655-664.

Compute the critical angle for the test of the angle between two multivariate vectors

Description

This function computes the angle below which two vectors would be "significantly similar" using Li 2011 test.

Usage

critical_angle(dimensions, alpha = 0.05, prec = "normal")
critical_angle(dimensions, alpha = 0.05, prec = "normal")

Arguments

dimensions

number of dimensions to use

alpha

significance level (by default 0.05)

prec

whether one has to use the internal R functions ("normal", default), or mpfr precision ("mpfr")

Details

This function considers the formulas in Li 2011 which have been used to perform a test of the angle between multivariate vectors. This is the test implemented in MorphoJ (Klingenberg 2011), in the R package Morpho (Schlager 2016), and in the function TestOfAngle of this package. The test produces a p value for the angle between two vectors (with significant values being "significantly similar").

This function reverts the logic of the formula/test and, given a number of dimensions (e.g., morphometric variables) and a level of significance (by default 0.05), it computes the angle below which the test would be significant.

Value

The function outputs the critical angle (in degrees)

Notice

In case of very many dimensions, there are numerical problems in performing the test. If prec="normal" the function uses internally the package Morpho (which should be installed) and these problems will start occurring above about 340 variables. If prec="mpfr" the function uses internally a custom function which requires the package Rmpfr (which should be installed). This allows the computation of extremely large or small numbers, it is currently set to a 120 bit precision and allows the computation using up to about 1200 variables (over this number, there will be problems even using the option "mpfr")

References

Klingenberg CP. 2011. MorphoJ: an integrated software package for geometric morphometrics. Mol Ecol Resour 11:353-357.

Li S. 2011. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics and Statistics 4:66-70.

Schlager S. 2016. Morpho: Calculations and Visualisations Related to Geometric Morphometrics.

Examples


critical_angle(50)
# This is the angle below which two vectors (e.g., PCA axes)
# of 50 variables would be would be "significantly similar"

critical_angle(50)
# This is the angle below which two vectors (e.g., PCA axes)
# of 50 variables would be would be "significantly similar"

Resampling-based estimates (bootstrap or rarefaction) of disparity / morphospace occupation

Description

Provides a unified interface to obtain resampled (bootstrap or rarefied) estimates of several disparity / morphospace occupation statistics.

Usage

disparity_resample(
  Data,
  group = NULL,
  n_resamples = 1000,
  statistic = "multivariate_variance",
  CI = 0.95,
  bootstrap_rarefaction = "bootstrap",
  sample_size = NULL
)
disparity_resample(
  Data,
  group = NULL,
  n_resamples = 1000,
  statistic = "multivariate_variance",
  CI = 0.95,
  bootstrap_rarefaction = "bootstrap",
  sample_size = NULL
)

Arguments

Data

A data frame, matrix, vector, or 3D array. Observations (specimens) must be in rows (if a 3D array is supplied, the third dimension is assumed to index specimens).

group

A factor or a vector indicating group membership (will be coerced to factor). If 'NULL' (default) a single analysis is performed on the full dataset.

n_resamples

Number of resampling replicates (default 1000).

statistic

Character string identifying the statistic to compute. One of '"multivariate_variance"', '"mean_pairwise_euclidean_distance"', '"convex_hull_volume"', '"claramunt_proper_variance"'. Default is '"multivariate_variance"'. Ignored for univariate data (vector input).

CI

Desired two-sided confidence interval level (default 0.95) used for percentile confidence intervals. Use CI=1 to display the full range (minimum to maximum) of resampled values.

bootstrap_rarefaction

Either '"bootstrap"' (default) for resampling with replacement or '"rarefaction"' for resampling without replacement.

sample_size

Either 'NULL' (default), a positive integer indicating the number of rows to sample, or the character '"smallest"' to use the size of the smallest group (all groups resampled to that size). For 'bootstrap_rarefaction=="rarefaction"', sampling is without replacement and this parameter is required (not 'NULL'). For 'bootstrap_rarefaction=="bootstrap"', sampling is with replacement; if 'NULL', uses original group sizes, otherwise uses the specified sample size. If '"smallest"' is supplied but no groups are defined, an error is returned.

Details

The function allows choosing among the following multivariate statistics:

Multivariate variance (trace of the covariance matrix; sum of variances)
Mean pairwise Euclidean distance
Convex hull volume (n-dimensional)
Claramunt proper variance (Claramunt 2010) based on a linear shrinkage covariance matrix

If the input 'Data' is univariate (i.e., a vector), the analysis defaults to computing the (univariate) variance within each group (the attribute 'statistic' is ignored).

If 'bootstrap_rarefaction=="bootstrap"', the function performs resampling with replacement (i.e., classical bootstrap) by sampling rows of the data matrix / data frame. Optionally, the user can specify a custom sample size via the 'sample_size' argument. This allows comparison of bootstrap confidence intervals at the same sample size (essentially, this is rarefaction sampling with replacement), which can be useful to compare bootstrapped confidence intervals across different groups for statistics which are sensitive to sample size (at the expense of broader than necessary confidence intervals for groups that are larger).

If 'bootstrap_rarefaction=="rarefaction"', the function performs resampling without replacement at the sample size indicated in 'sample_size' (numeric) or, if 'sample_size=="smallest"', at the size of the smallest group (all groups are resampled to that size). Rarefaction requires specifying a valute to the attribute 'sample_size'; an error is returned otherwise.

This function uses the future framework for parallel processing, requiring packages future and future.apply. Users should set up their preferred parallelization strategy using future::plan() before calling this function. For example:

future::plan(future::sequential) for sequential processing
future::plan(future::multisession, workers = 4) for parallel processing with 4 workers (works on all platforms including Windows)
future::plan(future::multicore, workers = 4) for forked processes (Unix-like systems)
future::plan(future::cluster, workers = c("host1", "host2")) for cluster computing

If no plan is set or the future packages are not available, the function will use sequential processing.

Value

A list containing:

chosen_statistic: Character vector of length 1 with the human-readable name of the statistic used.
results: A data frame with columns 'group', 'average', 'CI_min', 'CI_max'. One row per group. When CI=1, 'CI_min' and 'CI_max' represent the minimum and maximum of resampled values rather than confidence intervals.
resampled_values: If a single group: numeric vector of length 'n_resamples' with the resampled values. If multiple groups: a named list with one numeric vector (length 'n_resamples') per group.
CI_level: The CI level used (between 0 and 1). When CI=1, ranges are computed instead of confidence intervals.

The returned object has class "disparity_resample" and comes with associated S3 methods for convenient display and visualization:

print.disparity_resample: Prints a formatted summary of results including confidence interval overlap assessment for multiple groups
plot.disparity_resample: Creates a confidence interval plot using ggplot2

Parallelization

This function automatically uses parallel processing via the future framework, when the packages future and future.apply are installed. This is particularly useful for large datasets, large number of resamples or computationally intensive statistics (e.g., convex hull volume). The parallelization strategy is determined by the user's choice of future plan, providing flexibility across different computing environments (local multicore, cluster, etc.). The function performs parallelization at the level of individual bootstrap/rarefaction replicates within each group. The future plan should be set up by the user before calling this function using future::plan() (see examples). If no plan is set or the future packages are not available, the function will use sequential processing.

Average estimate

For bootstrap resampling, the average estimate reported is the mean of the bootstrap resampled values (for consistency across all bootstrap scenarios). For rarefaction, because the purpose is to make groups comparable at a common (smaller) sample size, the average estimate reported is the mean of the rarefied resampled values for that group (i.e., the mean across all rarefaction replicates).

Input data types

'Data' can be a data frame, a matrix, a vector, or a 3D array of landmark coordinates (e.g., p landmarks x k dimensions x n specimens). In the latter case, the array is converted internally to a 2D matrix with specimens in rows and (landmark * dimension) variables in columns.

Note

Because of how the computation works, convex hull volume computation requires the number of observations (specimens) to be (substantially) greater than the number of variables (dimensions). In case of shape or similar, consider using the scores along the first (few/several) principal components. Sometimes errors are thrown due to near-zero components, in this case try reducing the number of principal components used. Examples of use of this statistic with geometric morphometric data include Drake & Klingenberg 2010 (American Naturalist), Fruciano et al. 2012 (Environmental Biology of Fishes) and Fruciano et al. 2014 (Biological Journal of the Linnean Society). Because of the sensitivity of this statistic to outliers, usually rarefaction is preferred to bootstrapping.

"Multivariate variance" is also called "total variance", "Procrustes variance" (in geometric morphometrics) and "sum of univariate variances". Note how the computation here does not divide variance by sample size (other than the normal division performed in the computation of variances).

References

Drake AG, Klingenberg CP. 2010. Large-scale diversification of skull shape in domestic dogs: disparity and modularity. American Naturalist 175:289-301.

Claramunt S. 2010. Discovering exceptional diversifications at continental scales: the case of the endemic families of Neotropical Suboscine passerines. Evolution 64:2004-2019.

Fruciano C, Tigano C, Ferrito V. 2012. Body shape variation and colour change during growth in a protogynous fish. Environmental Biology of Fishes 94:615-622.

Examples

set.seed(123)
# Simulate two groups with different means but same covariance
if (requireNamespace("MASS", quietly = TRUE)) {
  X1 = MASS::mvrnorm(20, mu=rep(0, 10), Sigma=diag(10))
  X2 = MASS::mvrnorm(35, mu=rep(2, 10), Sigma=diag(10))
  Data = rbind(X1, X2)
  grp = factor(c(rep("A", nrow(X1)), rep("B", nrow(X2))))

  # Sequential processing
  # future::plan(future::sequential)  # Default sequential processing
  
  # Parallel processing (uncomment to use)
  # future::plan(future::multisession, workers = 2)  # Use 2 workers

  # Bootstrap multivariate variance
  boot_res = disparity_resample(
    Data, group=grp, n_resamples=200,
    statistic="multivariate_variance",
    bootstrap_rarefaction="bootstrap"
  )
  # Direct access to results table
  boot_res$results
  
  # Using the print method for formatted output
  print(boot_res)
  
  # Using the plot method to visualize results
  # plot(boot_res)  # Uncomment to create confidence interval plot

  # Rarefaction (to the smallest group size) of mean pairwise
  # Euclidean distance
  rar_res = disparity_resample(
    Data, group=grp, n_resamples=200,
    statistic="mean_pairwise_euclidean_distance",
    bootstrap_rarefaction="rarefaction", sample_size="smallest"
  )
# Now simulate a third group with larger variance
 X3 = MASS::mvrnorm(15, mu=rep(0, 10), Sigma=diag(10)*1.5)
 grp2 = factor(
   c(rep("A", nrow(X1)), rep("B", nrow(X2)), rep("C", nrow(X3)))
 )
 boot_res2 = disparity_resample(
   Data=rbind(X1, X2, X3), group=grp2, n_resamples=1000,
   statistic="multivariate_variance",
   bootstrap_rarefaction="bootstrap"
 )
 print(boot_res2)
 # plot(boot_res2)
 # Plot of the obtained (95%) confidence intervals (uncomment to plot)
 
 # Reset to sequential processing when done (optional)
 # future::plan(future::sequential)
}

set.seed(123)
# Simulate two groups with different means but same covariance
if (requireNamespace("MASS", quietly = TRUE)) {
  X1 = MASS::mvrnorm(20, mu=rep(0, 10), Sigma=diag(10))
  X2 = MASS::mvrnorm(35, mu=rep(2, 10), Sigma=diag(10))
  Data = rbind(X1, X2)
  grp = factor(c(rep("A", nrow(X1)), rep("B", nrow(X2))))

  # Sequential processing
  # future::plan(future::sequential)  # Default sequential processing
  
  # Parallel processing (uncomment to use)
  # future::plan(future::multisession, workers = 2)  # Use 2 workers

  # Bootstrap multivariate variance
  boot_res = disparity_resample(
    Data, group=grp, n_resamples=200,
    statistic="multivariate_variance",
    bootstrap_rarefaction="bootstrap"
  )
  # Direct access to results table
  boot_res$results
  
  # Using the print method for formatted output
  print(boot_res)
  
  # Using the plot method to visualize results
  # plot(boot_res)  # Uncomment to create confidence interval plot

  # Rarefaction (to the smallest group size) of mean pairwise
  # Euclidean distance
  rar_res = disparity_resample(
    Data, group=grp, n_resamples=200,
    statistic="mean_pairwise_euclidean_distance",
    bootstrap_rarefaction="rarefaction", sample_size="smallest"
  )
# Now simulate a third group with larger variance
 X3 = MASS::mvrnorm(15, mu=rep(0, 10), Sigma=diag(10)*1.5)
 grp2 = factor(
   c(rep("A", nrow(X1)), rep("B", nrow(X2)), rep("C", nrow(X3)))
 )
 boot_res2 = disparity_resample(
   Data=rbind(X1, X2, X3), group=grp2, n_resamples=1000,
   statistic="multivariate_variance",
   bootstrap_rarefaction="bootstrap"
 )
 print(boot_res2)
 # plot(boot_res2)
 # Plot of the obtained (95%) confidence intervals (uncomment to plot)
 
 # Reset to sequential processing when done (optional)
 # future::plan(future::sequential)
}

Permutation test of difference in disparity/morphospace occupation

Description

Performs a permutation test of difference in disparity between two groups.

Usage

disparity_test(X1, X2, perm = 999)
disparity_test(X1, X2, perm = 999)

Arguments

X1, X2

Matrices or data frames containing data for each group (observations in rows, variables in columns).

perm

number of permutations

Details

The function employs commonly used test statistics to quantify disparity/morphospace occupation/variation in each group. The two statistics currently implemented are multivariate variance (also known as sum of variances, trace of the covariance matrix, Procrustes variance), and mean pairwise Euclidean distances. These two metrics have a long history in the quantification of disparity both in geometric morphometrics (e.g., Zelditch et al. 2004; Fruciano et al., 2014, 2016) and more in general in evolution (e.g., Foote, 1996; Willis 2001) The observed statistics are then compared to their empirical distributions obtained through permutations, to obtain a p-value.

Value

The function outputs a dataframe containing: the observed values of the tests statistics for each group, their absolute differences, and The p values obtained through the permutational procedure

Notice

The values of the test statistics in the output are the observed in the sample. If they are of interest, and the two groups have different sample size, consider computing their rarefied versions (for instance with the function disparity_resample) for reporting in papers and the like.

References

Foote M. 1997. The evolution of morphological diversity. Annual Review of Ecology and Systematics 28:129.

Wills MA. 2001. Morphological disparity: a primer. In. Fossils, phylogeny, and form: Springer. p. 55-144.

Zelditch ML, Swiderski DL, Sheets HD. 2004. Geometric morphometrics for biologists: a primer: Academic Press.

Examples

library(MASS)
set.seed(123)

X1=mvrnorm(20, mu=rep(0, 40), Sigma=diag(40))
X2=mvrnorm(100, mu=rep(5, 40), Sigma=diag(40))
# create two groups of random observations
# with different means and sample sizes,
# but the same covariance matrix

# We expect that the two groups will have the same
# variance (disparity/morphospace occupation)
# and therefore the test will be non-significant

disparity_test(X1, X2, perm=999)
# This is, indeed, the case

library(MASS)
set.seed(123)

X1=mvrnorm(20, mu=rep(0, 40), Sigma=diag(40))
X2=mvrnorm(100, mu=rep(5, 40), Sigma=diag(40))
# create two groups of random observations
# with different means and sample sizes,
# but the same covariance matrix

# We expect that the two groups will have the same
# variance (disparity/morphospace occupation)
# and therefore the test will be non-significant

disparity_test(X1, X2, perm=999)
# This is, indeed, the case

Bootstrapped distance between two arrays

Description

Computes bootstrapped estimates of the mean distance between two groups and their confidence intervals.

Usage

dist_mean_boot(A, B, boot = 1000, ci = 0.95, nA = nrow(A), nB = nrow(B))
dist_mean_boot(A, B, boot = 1000, ci = 0.95, nA = nrow(A), nB = nrow(B))

Arguments

A, B

Matrices or data frames containing data (observations in rows, variables in columns).

boot

number of bootstrap resamples

ci

width of the confidence interval

nA, nB

sample sizes for each bootstrapped group (defaults to original sample size)

Details

This may be useful to compare whether the differences between two groups are larger or smaller than differences between two other groups.

For instance, if we wanted to quantify shape sexual dimorphism in two populations, we could run this analysis separately for the two populations and then check the confidence intervals. If the two confidence intervals are disjunct there is evidence for the two populations having different levels of sexual dimorphism.

The computation performs bootstrap by resampling with replacement within each of the two groups and at each round computing the Euclidean distance between the two groups. It is also possible to resample at a different sample size than the one in the data using the attributes nA and nB.

Notice that the confidence interval is expressed on a scale between 0 and 1 and not as a percentage (e.g., 0.95 means 95

Value

The function outputs a named vector with the mean, median, upper and lower confidence interval bounds obtained from the bootstrapped samples

Escoufier RV coefficient

Description

Computes the Escoufier RV coefficient

Usage

EscoufierRV(Block1, Block2)
EscoufierRV(Block1, Block2)

Arguments

Block1, Block2

Matrices or data frames containing each block of variables (observations in rows, variables in columns).

Details

This function computes the usual version of the Escoufier RV coefficient (Escoufier, 1973), which quantifies the level of association between two multivariate blocks of variables. The function accepts two blocks of variables, either two data frames or two matrices each of n observations (specimens) as rows. The two blocks must have the same number of rows (specimens), but can have different number of columns (variables, such as landmark coordinates). The Escoufier RV has been shown (Fruciano et al. 2013) to be affected by sample size so comparisons of groups (e.g., species, populations) with different sample size should be avoided, unless steps are taken to account for this problem

Value

The function returns a number, corresponding to the Escoufier RV coefficient

References

Escoufier Y. 1973. Le Traitement des Variables Vectorielles. Biometrics 29:751-760.

Fruciano C, Franchini P, Meyer A. 2013. Resampling-Based Approaches to Study Variation in Morphological Modularity. PLoS ONE 8:e69376.

Examples

library(MASS)
set.seed(123)
A=mvrnorm(100,mu=rep(0,100), Sigma=diag(100))
# Create a sample of 100 'individuals'
# as multivariate normal random data
# We will consider the first 20 columns as the first
# block of variables, and the following one as the second block

EscoufierRV(A[,1:20],A[,21:ncol(A)])
# Compute the EscoufierRV using the two blocks of variables

library(MASS)
set.seed(123)
A=mvrnorm(100,mu=rep(0,100), Sigma=diag(100))
# Create a sample of 100 'individuals'
# as multivariate normal random data
# We will consider the first 20 columns as the first
# block of variables, and the following one as the second block

EscoufierRV(A[,1:20],A[,21:ncol(A)])
# Compute the EscoufierRV using the two blocks of variables

Check the relative positions for a set of landmarks, compared to a reference specimen

Description

The function compares the relative position of a set of landmarks to the one observed in a reference specimen. It also outputs which coordinates do not much the pattern observed in the reference specimen.

Usage

LM_relativepos_check(
  Dataset,
  LM_to_check,
  reference_specimen = 1,
  only_by_landmark_order = FALSE,
  use_specimen_names = TRUE
)
LM_relativepos_check(
  Dataset,
  LM_to_check,
  reference_specimen = 1,
  only_by_landmark_order = FALSE,
  use_specimen_names = TRUE
)

Arguments

Dataset

k x p x n array of k landmarks and p dimensions for n observations (specimens).

LM_to_check

Vector with the indices of which landmarks of Dataset should be tested

reference_specimen

The index of the specimen to use as reference (by default, the first).

only_by_landmark_order

If TRUE, each landmark in LM_to_check will be tested relative to the next in LM_to_check; otherwise all the pairwise relative positions between landmarks will be run

use_specimen_names

whether the names of specimens in Dataset should be used for the output

Details

Compare relative positions of landmarks to the one observed in a reference specimen. This function is useful to identify specimens with switched landmark positions and similar problems when a dataset has been collected using consistent criteria (e.g., a set of fish body pictures, with the mouth-tail axis approximately horizontal for all specimens).

For instance, if we want to check that landmarks 1, 2, and 3 are all aligned along the y coordinate in 2D, we will use a specimen (usually the first), checking that it has been landmarked correctly. Then, we will run the function on landmarks 1, 2, and 3 (by setting LM_to_check=c(1, 2, 3)). The function will output a list of which specimens seem to be problematic and which landmarks and coordinates for these specimens seem problematic. The putatively problematic coordinates will be indicated with "0". Clearly, if we are only interested in 1, 2, and 3 being along y, it will not matter if in some case we will find "0" under Coord1, but only if we find a "0" under Coord2.

Value

The function outputs a list with the following elements:

potentially_problematic_idx: Indices of potentially problematic specimens
potentially_problematic_LMs: List of potentially problematic landmarks and coordinates for the specimens in potentially_problematic_idx

Notice

Generally, it is better to use this function with a small number of carefully selected landmarks, instead of running it on all landmarks at the same time.

The parameter only_by_landmark_order allows to set whether all combinations of landmarks should be tested (default), or not.

If only_by_landmark_order is set to TRUE, each landmark in LM_to_check will be only tested with the next one. For instance if LM_to_check=c(1,2,3) the function will only compare the first landmark with the second, and the second with the third.

If the function cannot find potentially problematic specimens, a message to this effect will be presented.

Perform parallel analysis

Description

Parallel analysis based on permutations

Usage

parallel_analysis(X, perm = 999, fun = c("prcomp", "fastSVD", "shrink"))
parallel_analysis(X, perm = 999, fun = c("prcomp", "fastSVD", "shrink"))

Arguments

X

Matrix or data frame containing the original data (observations in rows, variables in columns).

perm

number of permutations

fun

function to use internally to obtain eigenvalues (see Details)

Details

The function allows performing parallel analysis, which is a way to test for the number of significant eigenvalues/axes in a PCA. In this implementation, a null distribution of eigenvalues is obtained by randomly permuting observations independently for each of the starting variables. To compute p values, the observed eigenvalues are compared to the corresponding eigenvalues from this null distribution.

Parallel analysis may be used for dimensionality reduction, retaining only the first block of consecutive significant axes. That is, if for example the first 3 axes were significant, then the fourth not significant, one would keep only the first 3 axes (regardless of significance of the axes from the fifth on). Similarly, if the first axis is not significant, this may suggest lack of a clear structure in the data.

The function internally employs three possible strategies to obtain eigenvalues (argument of fun):

"prcomp" - the function prcomp (default)
"fastSVD" - an approach based on the function fast.svd (requires the package corpcor)
"shrink" - a decomposition of the covariance matrix estimated using linear shrinkage (much slower, requires the package nlshrink; Ledoit & Wolf 2004)

This choice should not make much difference in terms of the final result. However, for consistency, it is a good idea to use for parallel analysis the same function used for the actual PCA (this is why these three options are provided).

Value

Vector of class "parallel_analysis" containing the p values for each of the axes of a PCA on the data provided

The object of class parallel_analysis returned by the function has a summary() method associated to it. This means that using summary() on an object created by this function, a suggestion on the number of significant axes (if any) is provided (see examples).

Citation

The most appropriate citation for this approach to parallel analysis (using permutations) is Buja & Eyuboglu (1992).

References

Ledoit O, Wolf M. 2004. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88:365-411.

Buja A, Eyuboglu N. 1992. Remarks on parallel analysis. Multivariate Behavioral Research 27(4):509-540.

Examples

set.seed(666)
X=MASS::mvrnorm(100, mu=rep(0, 50), Sigma=diag(50))
# Simulate a multivariate random normal dataset
# with 100 observations and 50 indipendent variables

PA=parallel_analysis(X, perm = 999, fun = "fastSVD")
# Perform parallel analysis

summary(PA)
# Look at a summary of the results from parallel analysis
# Notice that no axis is significant
# This is correct, as we had simulated data with no structure

print(PA)
# Look at the p values for each axis


set.seed(666)
X=MASS::mvrnorm(100, mu=rep(0, 50), Sigma=diag(50))
# Simulate a multivariate random normal dataset
# with 100 observations and 50 indipendent variables

PA=parallel_analysis(X, perm = 999, fun = "fastSVD")
# Perform parallel analysis

summary(PA)
# Look at a summary of the results from parallel analysis
# Notice that no axis is significant
# This is correct, as we had simulated data with no structure

print(PA)
# Look at the p values for each axis

Plot method for disparity_resample objects

Description

Creates a confidence interval plot for disparity resample results

Usage

## S3 method for class 'disparity_resample'
plot(
  x,
  point_color = "darkblue",
  errorbar_color = "darkred",
  title = NULL,
  ...
)
## S3 method for class 'disparity_resample'
plot(
  x,
  point_color = "darkblue",
  errorbar_color = "darkred",
  title = NULL,
  ...
)

Arguments

x

An object of class "disparity_resample"

point_color

A single color or a vector of colors for point estimates. If length 1, the same color is used for all points. If length equals the number of groups, colors are assigned per group. (default "darkblue")

errorbar_color

A single color or a vector of colors for error bars. Follows the same recycling rules as 'point_color'. (default "darkred")

title

Optional character string for plot title (default NULL)

...

Additional arguments (currently unused)

Value

A ggplot object

Plot method for EscoufierRVrarefy objects

Description

Creates a confidence interval plot for RV rarefaction results

Usage

## S3 method for class 'EscoufierRVrarefy'
plot(x, point_color = "darkblue", errorbar_color = "darkred", ...)
## S3 method for class 'EscoufierRVrarefy'
plot(x, point_color = "darkblue", errorbar_color = "darkred", ...)

Arguments

x

An object of class "EscoufierRVrarefy"

point_color

errorbar_color

A single color or a vector of colors for error bars. Follows the same recycling rules as 'point_color'. (default "darkred")

...

Additional arguments passed to the underlying plotting function

Value

A ggplot object

Partial least squares (PLS) analysis

Description

Performs a two-block PLS analysis, optionally allowing for tests of significance using permutations

Usage

pls(X, Y, perm = 999, global_RV_test = TRUE)
pls(X, Y, perm = 999, global_RV_test = TRUE)

Arguments

X, Y

Matrices or data frames containing each block of variables (observations in rows, variables in columns).

perm

number of permutations to use for hypothesis testing

global_RV_test

logical - whether global significance of the association should be tested

Details

This function performs a PLS analysis (sensu Rohlf & Corti 2000). Given two blocks of variables (shape or other variables) scored on the same observations (specimens), this analysis finds a series of pairs of axis accounting for maximal covariance between the two blocks. If tests of significance with permutations are selected, three different significance tests are performed:

Global significance: Tested using Escoufier RV
Axis-specific significance (singular value): Same test described in Rohlf & Corti 2000
Axis-specific significance (correlation of PLS scores): For each pair of PLS (singular) axes, tests the correlation between scores of the first and second block

The object of class pls_fit returned by the function has print() and summary() methods associated to it. This means that using these generic functions on an object created by this function (see examples), it is possible to obtain information on the results. In particular, print() returns a more basic set of results on the global association, whereas summary() returns (only if permutation tests are used) results for each pair of singular axes.

Optionally, analysis using multiple cores to speedup permutation testin can be performed using the future framework (requires package future.apply and setting a plan with multiple workers)

Value

The function outputs an object of class "pls_fit" and "list" with the following elements:

XScores: Scores along each singular (PLS) axis for the first block of variables (X)
YScores: Scores along each singular (PLS) axis for the second block of variables (Y)
U: Left singular axes
V: Right singular axes
D: Singular values
percentage_squared_covariance: Percented squared covariance accounted by each pair of axes
global_significance_RV: (only if perm>0 and global_RV_test is TRUE) Observed value of Escoufier RV coefficient and p value obtained from the permutation test
singular_axis_significance: (only if perm>0) For each pair of singular (PLS) axis, the singular value, the correlation between scores, their significance level based on permutation, and the proportion of squared covariance accounted are reported
OriginalData: Data used in the analysis
x_center: Values used to center data in the X block
y_center: Values used to center data in the Y block

Notice

The function does NOT perform GPA when applied to separate configurations of points.
When using the Escoufier RV, notice that the value reported is the observed value without rarefaction. For a description of the problem, please see Fruciano et al 2013. To obtain rarefied estimates of Escoufier RV and their confidence interval, use the function RVrarefied.
In the permutation test, rows of Y are permuted, so using the block with fewer variables as Y may speed up computations and substantially reduce memory usage.
When using the print() and summary() on the pls_fit objects obtained with this function, some of the values are rounded for ease of interpretation. The non-rounded values can be obtained accessing individual elements of the object (see examples).

Citation

If you use this function to perform the PLS analysis and test for significance, cite Rohlf & Corti 2000 (or earlier references outside of geometric morphometrics). If you report the test of significance based on the Escoufier RV coefficient, also cite Escoufier 1975. If you also use the major axis approach to obtain estimates of the shape (or other variable) predicted by each pair of axes, please cite Fruciano et al. 2020

References

Escoufier Y. 1973. Le Traitement des Variables Vectorielles. Biometrics 29:751-760.

Fruciano C, Colangelo P, Castiglia R, Franchini P. 2020. Does divergence from normal patterns of integration increase as chromosomal fusions increase in number? A test on a house mouse hybrid zone. Current Zoology 66:527–538.

Rohlf FJ, Corti M. 2000. Use of Two-Block Partial Least-Squares to Study Covariation in Shape. Systematic Biology 49:740-753.

Examples


##############################
### Example 1: random data ###
##############################

library(MASS)
set.seed(123)
A=as.data.frame(mvrnorm(100,mu=rep(0,20), Sigma=diag(20)))
B=as.data.frame(mvrnorm(100,mu=rep(0,10), Sigma=diag(10)))
# Create two blocks of, respectively, 20 and 10 variables
# for 100 observations.
# This simulates two different blocks of data (shape or otherwise)
# measured on the same individuals.
# Note that, as we are simulating them independently,
# we don't expect substantial covariation between blocks

PLS_AB=pls(A, B, perm=99)
# Perform PLS analysis and use 99 permutations for testing
# (notice that in a real analysis, normally one uses more permutations)
print(PLS_AB)
# As expected, we do not find significant covariation between the
# two blocks

summary(PLS_AB)
# The same happens when we look at the results for each of the axes

# Notice that both for print() and summary() some values are
# rounded for ease of visualization
# However, the correct values can be always obtained from the
# object created by the function
# e.g.,
PLS_AB$singular_axis_significance


######################################
### Example 2: using the classical ###
### iris data set as a toy example ###
######################################

data(iris)
# Import the iris dataset
set.seed(123)

versicolor_data=iris[iris$Species=="versicolor",]
# Select only the specimens belonging to the species Iris versicolor
versicolor_sepal=versicolor_data[,grep(
 "Sepal", colnames(versicolor_data)
)]
versicolor_petal=versicolor_data[,grep(
 "Petal", colnames(versicolor_data)
)]
# Separate sepal and petal data
PLS_sepal_petal=pls(versicolor_sepal, versicolor_petal, perm=99)
# Perform PLS with permutation test
# (again, chosen few permutations)

print(PLS_sepal_petal)
summary(PLS_sepal_petal)
# Global results and results for each axis
# (suggesting significant association)


##############################
### Example 1: random data ###
##############################

library(MASS)
set.seed(123)
A=as.data.frame(mvrnorm(100,mu=rep(0,20), Sigma=diag(20)))
B=as.data.frame(mvrnorm(100,mu=rep(0,10), Sigma=diag(10)))
# Create two blocks of, respectively, 20 and 10 variables
# for 100 observations.
# This simulates two different blocks of data (shape or otherwise)
# measured on the same individuals.
# Note that, as we are simulating them independently,
# we don't expect substantial covariation between blocks

PLS_AB=pls(A, B, perm=99)
# Perform PLS analysis and use 99 permutations for testing
# (notice that in a real analysis, normally one uses more permutations)
print(PLS_AB)
# As expected, we do not find significant covariation between the
# two blocks

summary(PLS_AB)
# The same happens when we look at the results for each of the axes

# Notice that both for print() and summary() some values are
# rounded for ease of visualization
# However, the correct values can be always obtained from the
# object created by the function
# e.g.,
PLS_AB$singular_axis_significance


######################################
### Example 2: using the classical ###
### iris data set as a toy example ###
######################################

data(iris)
# Import the iris dataset
set.seed(123)

versicolor_data=iris[iris$Species=="versicolor",]
# Select only the specimens belonging to the species Iris versicolor
versicolor_sepal=versicolor_data[,grep(
 "Sepal", colnames(versicolor_data)
)]
versicolor_petal=versicolor_data[,grep(
 "Petal", colnames(versicolor_data)
)]
# Separate sepal and petal data
PLS_sepal_petal=pls(versicolor_sepal, versicolor_petal, perm=99)
# Perform PLS with permutation test
# (again, chosen few permutations)

print(PLS_sepal_petal)
summary(PLS_sepal_petal)
# Global results and results for each axis
# (suggesting significant association)

Major axis predictions for partial least squares (PLS) analysis

Description

Project data on the major axis of PLS scores and obtain associated predictions

Usage

pls_major_axis(
  pls_object,
  new_data_x = NULL,
  new_data_y = NULL,
  axes_to_use = 1,
  scale_PLS = TRUE
)
pls_major_axis(
  pls_object,
  new_data_x = NULL,
  new_data_y = NULL,
  axes_to_use = 1,
  scale_PLS = TRUE
)

Arguments

pls_object

object of class "pls_fit" obtained from the function pls

new_data_x, new_data_y

(optional) matrices or data frames containing new data

axes_to_use

number of pairs of PLS axes to use in the computation (by default, this is performed only on the first axis)

scale_PLS

logical indicating whether PLS scores for different blocks should be scaled prior to computing the major axis

Details

This function acts on a pls_fit object obtained from the function pls. More in detail, the function:

Projects the original data onto the major axis for each pair of PLS axes (obtaining for each observation of the original data a score along this axis).
For each observation (specimen) of the original data, obtains the shape predicted by its score along the major axis.
(Optionally) if new data is provided, these data are first projected in the original PLS space and then the two operations above are performed on the new data.

A more in-depth explanation with a figure which allows for a more intuitive understanding is provided in Fruciano et al 2020 The idea is to obtain individual-level estimates of the shape predicted by a PLS model. This can be useful, for instance, to quantify to which extent the shape of a given individual from one group resembles the shape that individual would have according to the model computed in another group. This can be done by obtaining predictions with this function and then computing the distance between the actual shape observed for each individual and its prediction obtained from this function. This is, indeed, how this approach has been used in Fruciano et al 2020.

The function returns a list with two or three main elements which are themselves lists. The most useful elements for the final user are highlighted in boldface.

original_major_axis_projection is a list containing as many elements as specified in axes_to_use (default 1). Each of this elements contains the details of the computation of the major axis (as a PCA of PLS scores for a pair of axes), and in particular:

major_axis_rotation: Eigenvector
mean_pls_scores: Mean scores for that axis pair used in the computation
pls_scale: Scaling factor used
original_data_PLS_projection: Scores of the original data on the major axis

original_major_axis_predictions_reversed contains the predictions of the PLS model for the original data, back-transformed to the original space (i.e., if the original data was shape, this will be shape). If axes_to_use > 1, these predictions will be based on the major axis computed for all pairs of axes considered. This element has two sub-elements:

Block1: Prediction for block 1
Block2: Prediction for block 2

new_data_results is only returned when new data is provided and contains the results of the analyses obtained using a previous PLS model on new data and, in particular:

new_data_Xscores: PLS scores of the new data using the old model for the first block
new_data_Yscores: PLS scores of the new data using the old model for the second block
new_data_major_axis_proj: Scores of the new data on the major axis computed using the PLS model provided in pls_object. If axes_to_use > 1, each column corresponds to a separate major axis
new_data_Block1_proj_prediction_revert: Predictions for Block 1 of the new data obtained by first computing the major axis projections (element new_data_major_axis_proj) and then back-transforming these projections to the original space (e.g., shape)
new_data_Block2_proj_prediction_revert: Predictions for Block 2 of the new data obtained by first computing the major axis projections (element new_data_major_axis_proj) and then back-transforming these projections to the original space (e.g., shape)

Value

The function outputs a list with the following elements (please, see the Details section for explanations on their sub-elements):

original_major_axis_projection: For each PLS axis pair, results of the computation of major axis and projection of the original data on each axis
original_major_axis_predictions_reversed: Data obtained back-transforming the scores on the major axis into the original space (e.g., shape)
new_data_results: (only if new data has been provided) PLS scores for the new data, scores of the new data on the major axis, preditions for the new data back-transformed into the original space (e.g., shape)

Citation

If you use this function, please cite Fruciano et al. 2020.

Notice

If new data is provided, this is first centered to the same average as in the original analysis, then it is translated back to the original scale.

References

Examples




######################################
### Example using the classical    ###
### iris data set as a toy example ###
######################################

data(iris)
# Import the iris dataset
versicolor_data=iris[iris$Species=="versicolor",]
# Select only the specimens belonging to the species Iris versicolor
versicolor_sepal=versicolor_data[,grep("Sepal",
                                       colnames(versicolor_data))]
versicolor_petal=versicolor_data[,grep("Petal",
                                       colnames(versicolor_data))]
# Separate sepal and petal data for I. versicolor


PLS_sepal_petal_versicolor=pls(versicolor_sepal,
                               versicolor_petal,
                               perm=99)
summary(PLS_sepal_petal_versicolor)
# Compute the PLS for I. versicolor


plot(PLS_sepal_petal_versicolor$XScores[,1],
     PLS_sepal_petal_versicolor$YScores[,1],
     asp = 1,
     xlab = "PLS 1 Block 1 scores",
     ylab = "PLS 1 Block 2 scores")
# Plot the scores for the original data on the first pair of PLS
# axes (one axis per block)
# This is the data based on which we will compute the major axis
# direction
# Imagine fitting a line through those point, that is the major axis

Pred_major_axis_versicolor=pls_major_axis(PLS_sepal_petal_versicolor,
                                          axes_to_use=1)
# Compute for I. versicolor the projections to the major axis
# using only the first pair of PLS axes (and scaling the scores
# prior to the computation)

hist(Pred_major_axis_versicolor$original_major_axis_projection[[1]]$original_data_PLS_projection,
     main="Original data - projections on the major axis - based on the first pair of PLS axes",
     xlab="Major axis score")
# Plot distribution of PLS scores for each individual in the
# original data (I. versicolor)
# projected on the major axis for the first pair of PLS axis

Pred_major_axis_versicolor$original_major_axis_predictions_reversed$Block1
Pred_major_axis_versicolor$original_major_axis_predictions_reversed$Block2
# Shape for each individual of the original data (I. versicolor)
# predicted by its position along the major axis

# Now we will use the data from new species (I. setosa and I
# virginica) and obtain predictions from the PLS model obtained for
# I. versicolor

# The easiest is to use the data for all three species
# as if they were both new data
# (using versicolor as new data is not going to affect the model)


all_sepal=iris[,grep("Sepal", colnames(iris))]
all_petal=iris[,grep("Petal", colnames(iris))]
# Separate sepals and petals (they are the two blocks)

Pred_major_axis_versicolor_newdata=pls_major_axis(
  pls_object=PLS_sepal_petal_versicolor,
  new_data_x = all_sepal,
  new_data_y = all_petal,
  axes_to_use=1)
# Perform the major axis computation using new data
# Notice that:
# - we are using the old PLS model (computed on versicolor only)
# - we are adding the new data in the same order as in the original
#   model (i.e., new_data_x is sepal data, new_data_y is petal data)


plot(Pred_major_axis_versicolor_newdata$new_data_results$new_data_Xscores[,1],
     Pred_major_axis_versicolor_newdata$new_data_results$new_data_Yscores[,1],
     col=iris$Species, asp=1,
     xlab = "Old PLS, Axis 1, Block 1",
     ylab = "Old PLS, Axis 1, Block 2")
# Plot the new data (both versicolor and setosa)
# in the space of the first pair of PLS axes computed only on
# versicolor
# The three species follow a quite similar trajectories
# but they have different average value on the major axis

# To visualize this better, we can plot the scores along the major
# axis for the three species
boxplot(Pred_major_axis_versicolor_newdata$new_data_results$new_data_major_axis_proj[,1]~
        iris$Species,
        xlab="Species",
        ylab="Score on the major axis")

# We can also visualize the deviations from the major axis
# For instance by putting the predictions of the two blocks together
# Computing differences and then performing a PCA
predictions_all_data=cbind(
  Pred_major_axis_versicolor_newdata$new_data_results$new_data_Block1_proj_prediction_revert,
  Pred_major_axis_versicolor_newdata$new_data_results$new_data_Block2_proj_prediction_revert)
# Get the predictions for the two blocks (sepals and petals)
# and put them back together

Euc_dist_from_predictions=unlist(lapply(seq(nrow(iris)),
                                         function(i)
  dist(rbind(iris[i,1:4],predictions_all_data[i,]))))
# for each flower, compute the Euclidean distance between
# the original values and what is predicted by the model

boxplot(Euc_dist_from_predictions~iris$Species,
        xlab="Species",
        ylab="Euclidean distance from prediction")
# I. setosa is the one which deviates the most from the prediction



######################################
### Example using the classical    ###
### iris data set as a toy example ###
######################################

data(iris)
# Import the iris dataset
versicolor_data=iris[iris$Species=="versicolor",]
# Select only the specimens belonging to the species Iris versicolor
versicolor_sepal=versicolor_data[,grep("Sepal",
                                       colnames(versicolor_data))]
versicolor_petal=versicolor_data[,grep("Petal",
                                       colnames(versicolor_data))]
# Separate sepal and petal data for I. versicolor


PLS_sepal_petal_versicolor=pls(versicolor_sepal,
                               versicolor_petal,
                               perm=99)
summary(PLS_sepal_petal_versicolor)
# Compute the PLS for I. versicolor


plot(PLS_sepal_petal_versicolor$XScores[,1],
     PLS_sepal_petal_versicolor$YScores[,1],
     asp = 1,
     xlab = "PLS 1 Block 1 scores",
     ylab = "PLS 1 Block 2 scores")
# Plot the scores for the original data on the first pair of PLS
# axes (one axis per block)
# This is the data based on which we will compute the major axis
# direction
# Imagine fitting a line through those point, that is the major axis

Pred_major_axis_versicolor=pls_major_axis(PLS_sepal_petal_versicolor,
                                          axes_to_use=1)
# Compute for I. versicolor the projections to the major axis
# using only the first pair of PLS axes (and scaling the scores
# prior to the computation)

hist(Pred_major_axis_versicolor$original_major_axis_projection[[1]]$original_data_PLS_projection,
     main="Original data - projections on the major axis - based on the first pair of PLS axes",
     xlab="Major axis score")
# Plot distribution of PLS scores for each individual in the
# original data (I. versicolor)
# projected on the major axis for the first pair of PLS axis

Pred_major_axis_versicolor$original_major_axis_predictions_reversed$Block1
Pred_major_axis_versicolor$original_major_axis_predictions_reversed$Block2
# Shape for each individual of the original data (I. versicolor)
# predicted by its position along the major axis

# Now we will use the data from new species (I. setosa and I
# virginica) and obtain predictions from the PLS model obtained for
# I. versicolor

# The easiest is to use the data for all three species
# as if they were both new data
# (using versicolor as new data is not going to affect the model)


all_sepal=iris[,grep("Sepal", colnames(iris))]
all_petal=iris[,grep("Petal", colnames(iris))]
# Separate sepals and petals (they are the two blocks)

Pred_major_axis_versicolor_newdata=pls_major_axis(
  pls_object=PLS_sepal_petal_versicolor,
  new_data_x = all_sepal,
  new_data_y = all_petal,
  axes_to_use=1)
# Perform the major axis computation using new data
# Notice that:
# - we are using the old PLS model (computed on versicolor only)
# - we are adding the new data in the same order as in the original
#   model (i.e., new_data_x is sepal data, new_data_y is petal data)


plot(Pred_major_axis_versicolor_newdata$new_data_results$new_data_Xscores[,1],
     Pred_major_axis_versicolor_newdata$new_data_results$new_data_Yscores[,1],
     col=iris$Species, asp=1,
     xlab = "Old PLS, Axis 1, Block 1",
     ylab = "Old PLS, Axis 1, Block 2")
# Plot the new data (both versicolor and setosa)
# in the space of the first pair of PLS axes computed only on
# versicolor
# The three species follow a quite similar trajectories
# but they have different average value on the major axis

# To visualize this better, we can plot the scores along the major
# axis for the three species
boxplot(Pred_major_axis_versicolor_newdata$new_data_results$new_data_major_axis_proj[,1]~
        iris$Species,
        xlab="Species",
        ylab="Score on the major axis")

# We can also visualize the deviations from the major axis
# For instance by putting the predictions of the two blocks together
# Computing differences and then performing a PCA
predictions_all_data=cbind(
  Pred_major_axis_versicolor_newdata$new_data_results$new_data_Block1_proj_prediction_revert,
  Pred_major_axis_versicolor_newdata$new_data_results$new_data_Block2_proj_prediction_revert)
# Get the predictions for the two blocks (sepals and petals)
# and put them back together

Euc_dist_from_predictions=unlist(lapply(seq(nrow(iris)),
                                         function(i)
  dist(rbind(iris[i,1:4],predictions_all_data[i,]))))
# for each flower, compute the Euclidean distance between
# the original values and what is predicted by the model

boxplot(Euc_dist_from_predictions~iris$Species,
        xlab="Species",
        ylab="Euclidean distance from prediction")
# I. setosa is the one which deviates the most from the prediction

Print method for disparity_resample objects

Description

Prints results table and checks for CI overlap among groups

Usage

## S3 method for class 'disparity_resample'
print(x, ...)
## S3 method for class 'disparity_resample'
print(x, ...)

Arguments

x

An object of class "disparity_resample"

...

Additional arguments (not used)

Value

Invisibly returns the input object

Print method for EscoufierRVrarefy objects

Description

Prints results table and checks for CI overlap among groups

Usage

## S3 method for class 'EscoufierRVrarefy'
print(x, ...)
## S3 method for class 'EscoufierRVrarefy'
print(x, ...)

Arguments

x

An object of class "EscoufierRVrarefy"

...

Additional arguments (not used)

Value

Invisibly returns the input object

Project to subspace orthogonal to a vector

Description

This function projects data to the subspace orthogonal to a multivariate column vector

Usage

ProjectOrthogonal(Data, vector)
ProjectOrthogonal(Data, vector)

Arguments

Data

matrix n x p of n observation for p variables,

vector

column vector (matrix p x 1 of p variables)

Details

This function is useful to remove from a dataset the variation along a specific direction (e.g., a principal component). It has been used extensively for many applications, such as

'Size Correction' (removal of an allometric vector), also called 'Burnaby's method (Burnaby 1966; see also Rohlf & Bookstein 1987)
Remove body arching from fish (Valentin et al. 2008; see also Fruciano 2016 for a discussion and other examples of usage in the context of measurement error in geometric morphometrics)
Removing variation due to sexual dimorphism on a set of individuals with unknown sex (Fruciano et al. 2014)

Optionally, vector can also be a matrix with more than one column (in this case the Data is projected to the subspace orthogonal to the space spanned by all dimensions in vector)

Value

The function outputs a matrix n x p of the original data projected to the subspace orthogonal to the vector

References

Burnaby T. 1966. Growth-invariant discriminant functions and generalized distances. Biometrics:96-110.

Fruciano C. 2016. Measurement error in geometric morphometrics. Development Genes and Evolution 226:139-158.

Rohlf FJ, Bookstein FL. 1987. A Comment on Shearing as a Method for' Size Correction'. Systematic Zoology:356-367.

Valentin AE, Penin X, Chanut JP, Sévigny JM, Rohlf FJ. 2008. Arching effect on fish body shape in geometric morphometric studies. Journal of Fish Biology 73:623-638.

Examples

library(MASS)
A=mvrnorm(50,mu=rep(1,50),Sigma=diag(50))
B=mvrnorm(50,mu=rep(0,50),Sigma=diag(50))
AB=rbind(A,B)
Group=as.factor(c(rep(1,50),rep(2,50)))
# Create two groups of observations (e.g., specimens)
# one centered at 0 and the other at 1
# and combine them in a single sample

PCA=prcomp(AB)
# Combine the two groups and perform a PCA

plot(PCA$x[,1],PCA$x[,2], asp=1, col=Group)
# Plot the scores along the first two principal components
# The two groups are clearly distinct (red and black)

ABproj=ProjectOrthogonal(AB,cbind(PCA$rotation[,1]))
# Project the original data (both groups)
# to the subspace orthogonal to the first principal component
# (which is the direction along which there is most of variation
# among groups)

PCAproj=prcomp(ABproj)
# Perform a new PCA on the 'corrected' dataset

plot(PCAproj$x[,1], PCAproj$x[,2], asp=1, col=Group)
# Plot the scores along the first two principal components
# of the 'corrected' data
# Notice how the two groups are now pretty much
# indistinguishable



library(MASS)
A=mvrnorm(50,mu=rep(1,50),Sigma=diag(50))
B=mvrnorm(50,mu=rep(0,50),Sigma=diag(50))
AB=rbind(A,B)
Group=as.factor(c(rep(1,50),rep(2,50)))
# Create two groups of observations (e.g., specimens)
# one centered at 0 and the other at 1
# and combine them in a single sample

PCA=prcomp(AB)
# Combine the two groups and perform a PCA

plot(PCA$x[,1],PCA$x[,2], asp=1, col=Group)
# Plot the scores along the first two principal components
# The two groups are clearly distinct (red and black)

ABproj=ProjectOrthogonal(AB,cbind(PCA$rotation[,1]))
# Project the original data (both groups)
# to the subspace orthogonal to the first principal component
# (which is the direction along which there is most of variation
# among groups)

PCAproj=prcomp(ABproj)
# Perform a new PCA on the 'corrected' dataset

plot(PCAproj$x[,1], PCAproj$x[,2], asp=1, col=Group)
# Plot the scores along the first two principal components
# of the 'corrected' data
# Notice how the two groups are now pretty much
# indistinguishable

Perform test on two repeated measures

Description

Test based on Hotelling's T squared for the null hypothesis of no effect between two repeated measures (e.g., treatment/control)

Usage

repeated_measures_test(T1, T2, rnames = TRUE, shrink = FALSE)
repeated_measures_test(T1, T2, rnames = TRUE, shrink = FALSE)

Arguments

T1, T2

matrices (n x p of n observation for p variables) or arrays (t x p x n of n observations, t landmarks in p dimensions),

rnames

if TRUE (default) the rownames of the matrix or the names along the 3rd dimension (for arrays) will be used to match the order

shrink

if TRUE, a shrinkage estimator of covariance is used internally

Details

The function assumes that each individual observation (e.g., specimen) has been measured two times (e.g., at two time points, or between two treatments).

If rnames is TRUE (default), the rownames of the matrix or the names along the 3rd dimension (for arrays) will be used to match the order of observations (e.g., specimens) between the two datasets. Otherwise, the function will assume that the observations in T1 and T2 are in the same order.

This function is useful in various contexts, such as:

testing the effect of preservation (Fruciano et al. 2020)
testing for variation through time

For instance, in the context of the effect of preservation on geometric morphometrics, it has been argued (Fruciano, 2016) that various studies have improperly used on repeated measures data methods developed for independent observations, and this can lead to incorrect inference.

Value

The function outputs a matrix n x p of the original data projected to the subspace orthogonal to the vector

Notice

The function requires internally non-singular matrices (for instance, number of cases should be larger than the number of variables). One solution can be to perform a principal component analysis and use the scores for all the axes with non-zero and non-near-zero eigenvalues. To overcome some situations where a singular matrix can occurr, the function can use internally a shrinkage estimator of the covariance matrix (Ledoit & Wolf 2004). This is called setting shrink = TRUE. However, in this case, the package nlshrink should have been installed. Also, notice that if the matrices T1 and T2 are provided as arrays, this requires the package Morpho to be installed.

Citation

If you use this function please cite Fruciano et al. 2020

References

Fruciano C. 2016. Measurement error in geometric morphometrics. Development Genes and Evolution 226:139-158.

Ledoit O, Wolf M. 2004. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88:365-411.

Rescale landmark data based on interlandmark distances

Description

Convenience function which rescales a dataset of landmarks based on a vector of distances between two landmarks

Usage

rescale_by_landmark_distance(Data, lm1, lm2, lengths)
rescale_by_landmark_distance(Data, lm1, lm2, lengths)

Arguments

Data

array k x p x n of k landmarks and p dimensions for n observations (specimens)

lm1, lm2

index of the two landmarks whose inter-landmark distance is known

lengths

vector of n lengths (distances between lm1 and lm2 in the appropriate scale)

Details

This function can be useful when one has the distance between two landmarks (e.g., obtained with a caliper), but a scale has not been set when acquiring the data (for instance, a scale bar was missing on photos, so configuration of landmarks are scaled in pixels).

Value

The function outputs an array k x p x n of rescaled landmark coordinates

'Reverse' PCA

Description

Obtain data in the original variables starting from PCA scores

Usage

reversePCA(Scores, Eigenvectors, Mean)
reversePCA(Scores, Eigenvectors, Mean)

Arguments

Scores

matrix n x s of n observation for s scores. Can be a single column (for instance, if one is interested in PC1 only)

Eigenvectors

matrix p x s of eigenvectors (as column eigenvectors); notice that s (number of column eigenvectors) must be the same as the number of columns of PC scores in Scores

Mean

vector of length p with the multivariate mean computed on the original dataset (the dataset on which the PCA was carried out)

Details

Given a set of PCA scores, a set of eigenvectors and the mean of the data on which the PCA was originally computed, this function returns a set of observations in the original data space. This simple function can have many applications (not only in morphometrics) such as

Producing predicted shapes for extreme, unobserved, values of PC scores (e.g., the shape, or other set of variables, corresponding to 3 times the most extreme positive PC1 score), or their combinations (e.g., a value of 0.4 on PC1 and 0.3 on PC2)
Interpret results of analyses carried out on PC scores by converting these back to shapes

Value

The function outputs a matrix n x p of the scores 'back-transformed' into the original variables

Examples


library(MASS)
set.seed(123)
A=mvrnorm(10,mu=rep(1,2),Sigma=diag(2))
# Create a small dataset of 10 observations and 2 variables

Mean=colMeans(A)
PCA=prcomp(A)
# Compute mean and perform PCA

B=reversePCA(Scores=PCA$x, Eigenvectors=PCA$rotation, Mean=Mean)
# 'Revert' the PCA (in this case using all scores and all axes
# to get the same results as the starting data)

round(A,10)==round(B,10)
# Check that all the results are the same
# (rounding to the 10th decimal because of numerical precision)


library(MASS)
set.seed(123)
A=mvrnorm(10,mu=rep(1,2),Sigma=diag(2))
# Create a small dataset of 10 observations and 2 variables

Mean=colMeans(A)
PCA=prcomp(A)
# Compute mean and perform PCA

B=reversePCA(Scores=PCA$x, Eigenvectors=PCA$rotation, Mean=Mean)
# 'Revert' the PCA (in this case using all scores and all axes
# to get the same results as the starting data)

round(A,10)==round(B,10)
# Check that all the results are the same
# (rounding to the 10th decimal because of numerical precision)

User-defined rotation of a landmark configuration

Description

Rotates a single 2D or 3D landmark configuration about the origin of the coordinate system.

Usage

rotate_landmarks(LMdata, rotation, radians = TRUE, center = TRUE)
rotate_landmarks(LMdata, rotation, radians = TRUE, center = TRUE)

Arguments

LMdata

matrix k x p of k landmarks and p dimensions

rotation

vector containing the rotation angle(s) (a single rotation for 2D data, three rotations for 3D data)

radians

a logical on whether the angle(s) are provided in radians (default) or not

center

a logical on whether the rotation should be on the centered configuration

Details

This function can be useful to change the orientation of data for display purposes It could also be used to empirically check the rotation-invariance of an analysis. Notice that this function works on a single configuration of landmarks provided as a k x p matrix of k landmarks in p dimensions (i.e., p is 2 for 2D data and 3 for 3D data) As user-supplied rotation, the function expects a single number in the case of 2D data (rotation on the plane), a vector with three values (corresponding to rotation relative to the three axes) in 3D. If center=TRUE (default), first the configuration of landmarks is centered, then the rotation is performed, and finally the landmark coordinated are translated back to the original position. This accomplishes rotating the landmark configuration around its center

Value

The function outputs a matrix k x p of the original configuration of landmarks rotated according to the user-supplied rotation

Examples

library(ggplot2)

Poly1=scale(t(matrix(c(4,1,3,1,1,2,2,3),nrow=2,ncol=4)),
center=TRUE, scale=FALSE)
# Create a polygon centered at the origin
Poly2=rotate_landmarks(LMdata=Poly1, rotation=10, radians=FALSE)
# Create a new polygon which is the rotated version of the first
# with respect to the origin (rotation of 10 degrees)

BothPolys4Plotting=as.data.frame(rbind(Poly1,Poly2))
BothPolys4Plotting[,3]=c(rep("Original",4),rep("Rotated",4))
BothPolys4Plotting[,4]=rep(1:4,2)
colnames(BothPolys4Plotting)=c("X","Y","Polygon","Landmark")
# Put them together in a way that is easy to plot in ggplot2
GraphLims=range(BothPolys4Plotting[,1:2])
# limits of the plot

ggplot() +
geom_polygon(data=BothPolys4Plotting,
             mapping=aes(x=X, y=Y, group=Polygon, fill=Polygon),
             alpha=0.5) +
geom_point(data=BothPolys4Plotting, aes(x=X, y=Y, color=Polygon)) +
geom_text(data=BothPolys4Plotting, aes(x=X, y=Y, label=Landmark),
          hjust=1, vjust=1, size=4)+
coord_fixed(ratio=1, xlim=GraphLims, ylim=GraphLims)+
theme_classic()
# Plot the two polygons (landmarks are numbered for ease of
# visualization)


library(ggplot2)

Poly1=scale(t(matrix(c(4,1,3,1,1,2,2,3),nrow=2,ncol=4)),
center=TRUE, scale=FALSE)
# Create a polygon centered at the origin
Poly2=rotate_landmarks(LMdata=Poly1, rotation=10, radians=FALSE)
# Create a new polygon which is the rotated version of the first
# with respect to the origin (rotation of 10 degrees)

BothPolys4Plotting=as.data.frame(rbind(Poly1,Poly2))
BothPolys4Plotting[,3]=c(rep("Original",4),rep("Rotated",4))
BothPolys4Plotting[,4]=rep(1:4,2)
colnames(BothPolys4Plotting)=c("X","Y","Polygon","Landmark")
# Put them together in a way that is easy to plot in ggplot2
GraphLims=range(BothPolys4Plotting[,1:2])
# limits of the plot

ggplot() +
geom_polygon(data=BothPolys4Plotting,
             mapping=aes(x=X, y=Y, group=Polygon, fill=Polygon),
             alpha=0.5) +
geom_point(data=BothPolys4Plotting, aes(x=X, y=Y, color=Polygon)) +
geom_text(data=BothPolys4Plotting, aes(x=X, y=Y, label=Landmark),
          hjust=1, vjust=1, size=4)+
coord_fixed(ratio=1, xlim=GraphLims, ylim=GraphLims)+
theme_classic()
# Plot the two polygons (landmarks are numbered for ease of
# visualization)

Compare Escoufier RV coefficient between groups

Description

Performs permutation tests for the difference in Escoufier RV between groups. For multiple groups, performs pairwise comparisons between all pairs of groups.

Usage

RVcomparison(Block1, Block2, group, perm = 999, center = TRUE)
RVcomparison(Block1, Block2, group, perm = 999, center = TRUE)

Arguments

Block1, Block2

Matrices or data frames containing each block of variables (observations in rows, variables in columns).

group

factor or vector indicating group membership for observations.

perm

number of permutations for the test

center

whether the groups should be mean-centered prior to the test

Details

This function is one of the solutions proposed by Fruciano et al. 2013 to deal with the fact that values of Escoufier RV coefficient (Escoufier 1973), which is routinely used to quantify the levels of association between multivariate blocks of variables (landmark coordinates in the case of morphometric data, but it might be any multivariate dataset) are not comparable across groups with different number of observations/individuals (Smilde et al. 2009; Fruciano et al. 2013). The solution is a permutation test. This test was originally implemented by Adriano Franchini in the Java program RVcomparator, of which this function is an updated and improved version

Value

A data frame with one row per pairwise comparison and the following columns:

group1: Name of the first group in the comparison
group2: Name of the second group in the comparison
Observed_RV_group1: Observed Escoufier RV for the first group in the comparison
Observed_RV_group2: Observed Escoufier RV for the second group in the comparison
Absolute_difference_in_RV: Absolute difference in the observed Escoufier RV between the two groups
p_value: p value of the permutation test

For multiple groups, the data frame includes additional columns identifying the groups being compared.

Notice

The values of RV for each of the groups the function outputs are the observed ones, and can be reported but their value should not be compared. If one wants to obtain comparable values one solution (Fruciano et al 2013) is to obtain rarefied estimates, which can be done with the function RVrarefied in this package

Citation

If you use this function please cite both Fruciano et al. 2013 (for using the rarefaction procedure) and Escoufier 1973 (because the procedure is based on Escoufier RV)

References

Escoufier Y. 1973. Le Traitement des Variables Vectorielles. Biometrics 29:751-760.

Fruciano C, Franchini P, Meyer A. 2013. Resampling-Based Approaches to Study Variation in Morphological Modularity. PLoS ONE 8:e69376.

Smilde AK, Kiers HA, Bijlsma S, Rubingh CM, van Erk MJ. 2009. Matrix correlations for high-dimensional data: the modified RV-coefficient. Bioinformatics 25:401-405.

Examples


library(MASS)
set.seed(123)

# Create sample data with different correlation structures
S1 = diag(50)  # Uncorrelated variables for group 1
S2 = diag(50)
S2[11:50, 11:50] = 0.3  # Some correlation in second block for group 2
S2 = (S2 + t(S2)) / 2  # Ensure symmetry
diag(S2) = 1

# Generate data for two groups
A = mvrnorm(30, mu = rep(0, 50), Sigma = S1)
B = mvrnorm(40, mu = rep(0, 50), Sigma = S2)

# Combine data and create group labels
combined_data1 = A[, 1:20]
combined_data2 = A[, 21:50]
combined_data1 = rbind(combined_data1, B[, 1:20])
combined_data2 = rbind(combined_data2, B[, 21:50])
groups = c(rep("GroupA", 30), rep("GroupB", 40))

# Perform RV comparison
result = RVcomparison(combined_data1, combined_data2,
                      group = groups, perm = 99)
print(result)

# Example with three groups for pairwise comparisons
C = mvrnorm(25, mu = rep(0, 50), Sigma = diag(50))
combined_data1_3grp = rbind(combined_data1, C[, 1:20])
combined_data2_3grp = rbind(combined_data2, C[, 21:50])
groups_3 = c(groups, rep("GroupC", 25))

result_3grp = RVcomparison(combined_data1_3grp, combined_data2_3grp, 
                          group = groups_3, perm = 99)
print(result_3grp)

library(MASS)
set.seed(123)

# Create sample data with different correlation structures
S1 = diag(50)  # Uncorrelated variables for group 1
S2 = diag(50)
S2[11:50, 11:50] = 0.3  # Some correlation in second block for group 2
S2 = (S2 + t(S2)) / 2  # Ensure symmetry
diag(S2) = 1

# Generate data for two groups
A = mvrnorm(30, mu = rep(0, 50), Sigma = S1)
B = mvrnorm(40, mu = rep(0, 50), Sigma = S2)

# Combine data and create group labels
combined_data1 = A[, 1:20]
combined_data2 = A[, 21:50]
combined_data1 = rbind(combined_data1, B[, 1:20])
combined_data2 = rbind(combined_data2, B[, 21:50])
groups = c(rep("GroupA", 30), rep("GroupB", 40))

# Perform RV comparison
result = RVcomparison(combined_data1, combined_data2,
                      group = groups, perm = 99)
print(result)

# Example with three groups for pairwise comparisons
C = mvrnorm(25, mu = rep(0, 50), Sigma = diag(50))
combined_data1_3grp = rbind(combined_data1, C[, 1:20])
combined_data2_3grp = rbind(combined_data2, C[, 21:50])
groups_3 = c(groups, rep("GroupC", 25))

result_3grp = RVcomparison(combined_data1_3grp, combined_data2_3grp, 
                          group = groups_3, perm = 99)
print(result_3grp)

Rarefied version of Escoufier RV coefficient

Description

Computes a rarefied estimate of the Escoufier RV coefficient to account for the dependence on sample size of RV, so that RV can be compared among groups with different number of observations (sample sizes)

Usage

RVrarefied(Block1, Block2, reps = 1000, samplesize, group = NULL, CI = 0.95)
RVrarefied(Block1, Block2, reps = 1000, samplesize, group = NULL, CI = 0.95)

Arguments

Block1, Block2

Matrices or data frames containing each block of variables (observations in rows, variables in columns).

reps

number of resamplings to obtain the rarefied estimate

samplesize

sample size to which the rarefaction procedure is carried out

group

factor or vector indicating group membership for observations. If NULL (default), a single-group analysis is performed. If provided, the analysis is performed separately for each group.

CI

confidence interval level (default 0.95). Must be between 0 and 1.

Details

This function computes a rarefied estimate of Escoufier RV coefficient as suggested by Fruciano et al 2013 - Plos One This can be useful to compare RV among groups with the same variables but different sample sizes (as RV depends on sample size, see Fruciano et al 2013, where this procedure is described). The idea is the one rarefies the two groups at the same sample size. Following the approach in Fruciano et al. 2013 - Plos One, "rarefaction" is meant resampling to a smaller sample size with replacement (like in bootstrap).

Value

The function outputs a list with the following elements:

results: A data frame with columns 'group', 'Mean', 'Median', 'CI_min', 'CI_max'. One row per group. For single-group analysis, group will be "All".
AllRarefiedSamples: A list with all RV values obtained using the rarefaction procedure for each group. For single-group analysis, this will be a list with one element named "All".

The returned object has class "EscoufierRVrarefy" and comes with associated S3 methods for convenient display and visualization:

print.EscoufierRVrarefy: Prints a formatted summary of results including confidence interval overlap assessment for multiple groups
plot.EscoufierRVrarefy: Creates a confidence interval plot using ggplot2

Notice

the function does NOT perform GPA on each rarefied sample this may or may not make a difference in estimates. In many cases, it will probably not make much difference (e.g., Fig. 2 in Fruciano et al 2013)

Citation

If you use this function please cite both Fruciano et al. 2013 (for using the rarefaction procedure) and Escoufier 1973 (because the procedure is based on Escoufier RV)

References

Escoufier Y. 1973. Le Traitement des Variables Vectorielles. Biometrics 29:751-760.

Fruciano C, Franchini P, Meyer A. 2013. Resampling-Based Approaches to Study Variation in Morphological Modularity. PLoS ONE 8:e69376.

Examples

library(MASS)
set.seed(123)
Pop=mvrnorm(100000,mu=rep(0,100), Sigma=diag(100))
# Create a population of 100,000 'individuals'
# as multivariate normal random data
# We will consider the first 20 columns as the first
# block of variables, and the following one as the second block

A=Pop[1:50,]
B=Pop[501:700,]
# Take two groups (A and B)
# from the same population (there should be no difference
# between them)

EscoufierRV(A[,1:20],A[,21:ncol(A)])
EscoufierRV(B[,1:20],B[,21:ncol(B)])
# Notice how we obtain very different values of Escoufier RV
# (this is because they two groups have very different
# sample sizes, one 50 observations, the other 200)

RarA=RVrarefied(A[,1:20],A[,21:ncol(A)],reps=1000,samplesize=30)
RarB=RVrarefied(B[,1:20],B[,21:ncol(A)],reps=1000,samplesize=30)
RarA$results  # Data frame with Mean, Median, CI_min, CI_max
RarB$results  # Data frame with Mean, Median, CI_min, CI_max
# Rarefying both groups at the same sample size
# (in this case 30)
# it is clear that the two groups have very similar levels
# of association between blocks

# Multi-group analysis with custom CI
combined_data = rbind(A, B)
group_labels = c(rep("GroupA", nrow(A)), rep("GroupB", nrow(B)))
multi_result = RVrarefied(combined_data[,1:20], combined_data[,21:ncol(combined_data)], 
                         reps=1000, samplesize=30, group=group_labels, CI=0.90)
print(multi_result$results)  # Data frame with results for each group
# Columns: group, Mean, Median, CI_min, CI_max

library(MASS)
set.seed(123)
Pop=mvrnorm(100000,mu=rep(0,100), Sigma=diag(100))
# Create a population of 100,000 'individuals'
# as multivariate normal random data
# We will consider the first 20 columns as the first
# block of variables, and the following one as the second block

A=Pop[1:50,]
B=Pop[501:700,]
# Take two groups (A and B)
# from the same population (there should be no difference
# between them)

EscoufierRV(A[,1:20],A[,21:ncol(A)])
EscoufierRV(B[,1:20],B[,21:ncol(B)])
# Notice how we obtain very different values of Escoufier RV
# (this is because they two groups have very different
# sample sizes, one 50 observations, the other 200)

RarA=RVrarefied(A[,1:20],A[,21:ncol(A)],reps=1000,samplesize=30)
RarB=RVrarefied(B[,1:20],B[,21:ncol(A)],reps=1000,samplesize=30)
RarA$results  # Data frame with Mean, Median, CI_min, CI_max
RarB$results  # Data frame with Mean, Median, CI_min, CI_max
# Rarefying both groups at the same sample size
# (in this case 30)
# it is clear that the two groups have very similar levels
# of association between blocks

# Multi-group analysis with custom CI
combined_data = rbind(A, B)
group_labels = c(rep("GroupA", nrow(A)), rep("GroupB", nrow(B)))
multi_result = RVrarefied(combined_data[,1:20], combined_data[,21:ncol(combined_data)], 
                         reps=1000, samplesize=30, group=group_labels, CI=0.90)
print(multi_result$results)  # Data frame with results for each group
# Columns: group, Mean, Median, CI_min, CI_max

Compute scaled variance of eigenvalues

Description

Compute estimates of the scaled variance of eigenvalues using only the positive eigenvalues

Usage

scaled_variance_of_eigenvalues(
  data_matrix,
  boot = 999,
  rarefy = FALSE,
  shrinkage = FALSE
)
scaled_variance_of_eigenvalues(
  data_matrix,
  boot = 999,
  rarefy = FALSE,
  shrinkage = FALSE
)

Arguments

data_matrix

Matrix or data frame containing the original data (observations in rows, variables in columns).

boot

number of bootstrap resamples (no bootstrap if 0)

rarefy

either a logical to determine whether rarefaction will be performed or a number indicating the sample size at which the samples will be rarefied

shrinkage

logical on whether the analysis should be based on a covariance matrix obtained through linear shrinkage

Details

The function allows computing the scaled variance of eigenvalues (Pavlicev et al. 2009) under a variety of settings. The scaled variance of eigenvalues is a commonly used index of morphological integration. Its value is comprised between 0 and 1, with larger values suggesting stronger integration.

Only positive eigenvalues are used in the computations used in this function.

The function employes two possible strategies to obtain eigenvalues:

a singular value decomposition of the data matrix (default)
an eigenvalue decomposition of the covariance matrix estimated using linear shrinkage (option shrinkage=TRUE; Ledoit & Wolf 2004)

Further, the function allows obtaining bootstrapped estimates by setting boot to the number of bootstrap replicates (resampling with replacement)

It is also possible to obtain rarefied estimates by setting rarefy to the desired sample size. This is useful when comparing the scaled variance of eigenvalues across multiple groups with different sample sizes. In this case, the suggestion is to use either the smallest sample size or less

Using a bootstrap estimate with the singular value decomposition approach represents a good compromise between computation time and accuracy. This should be complemented by rarefaction to the smallest sample size (or lower) in case one wants to compare the value obtained across different groups.

Value

If boot=0 the function outputs a vector containing the scaled variance of eigenvalues and the number of dimensions used in the computations. If, instead, boot>0 (recommended) the function outputs a list containing

the mean scaled variance of eigenvalues across all bootstrap samples
the median number of dimensions used across all bootstrap samples
the 95 eigenvalues and dimensions
the scaled variance of eigenvalues and dimensions used for each of the bootstrap replicates

Notice

When boot>0 the rarefied estimates are based on sampling with replacement (bootstrap). However, if boot=0, then a single rarefied estimate is obtained by sampling without replacement. In this case, the user should repeat the same operation multiple times (e.g., 100) and take the average of the scaled variance of eigenvalues obtained.

Also notice that using the shrinkage-based estimation of the covariance matrix requires longer computational time and memory. This option requires the package nlshrink

Computational details

For the computation, this function uses only positive eigenvalues (which are also used to identify data dimensionality). The eigenvalues are first scaled by dividing them by their sum (Young 2006), then their variance is computed as population variance (rather than sample variance; see Haber 2011). Finally, this value is standardized to a scale between 0 and 1 by dividing it by its maximum theoretical value of (p-1)/p^2 (where p is the number of dimensions) - this is the same scaling used in the software MorphoJ (Klingenberg 2011).

References

Ledoit O, Wolf M. 2004. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88:365-411.

Young NM. 2006. Function, ontogeny and canalization of shape variance in the primate scapula. Journal of Anatomy 209:623-636.

Pavlicev M, Cheverud JM, Wagner GP. Measuring Morphological Integration Using Eigenvalue Variance. Evolutionary Biology 36(1):157–170.

Haber A. 2011. A Comparative Analysis of Integration Indices. Evolutionary Biology 38:476-488.

Klingenberg CP. 2011. MorphoJ: an integrated software package for geometric morphometrics. Molecolar Ecology Resources 11:353-357.

Perform a test of the angle between two multivariate vectors

Description

This function performs a test for the angle between two multivariate vector, optionally allowing for "flipping" one of the axes

Usage

TestOfAngle(x, y, flip = TRUE)
TestOfAngle(x, y, flip = TRUE)

Arguments

x, y

numerical vectors

flip

logical stating whether (TRUE, default) axes should be "flipped" in case the angle is larger than 90 degrees (see Details)

Details

This function is useful to perform a test of the angle between two multivariate vectors (e.g., principal component eigenvectors computed from two covariance matrices, such as covariance matrices of two different species). The function uses internally angleTest from the package Morpho by Stefan Schlager. That function, in turn, uses the formulas in Li 2011 to compute significance, rather than using permutations. This is the same approach also implemented in MorphoJ (Klingenberg 2011). There is no special advantage in using this function relative to the original in the package Morpho, except that this function outputs angles both in radians and degrees and that it optionally allow to "flip" one of the two axes. This is useful in the cases where the direction of one axis is arbitrary, such as in PCA (where positive and negative values are interchangable and recomputing PCA might result in positive scores for the observations which were negative (and vice versa). In this situation, angles larger than 90 degrees are not meaningful and one way of dealing with this is, by choosing the option flip=TRUE (default), to change the sign of one of the two vectors ("flip").

Value

The function outputs a list with

angle

angle between vectors in radians

angle_deg

angle between vectors in degrees

p.value

p value

critical_angle

angle below which two vectors of the same number of dimensions as the ones being tested would be significant

Notice

The p value is for the probability that the angle between two random vectors is smaller or equal to the one computed from the vectors provided (x, y). This means that significant values indicate that the two provided vectors are "significantly similar", whereas non-significant values means that the two vectors are substantially different

References

Klingenberg CP. 2011. MorphoJ: an integrated software package for geometric morphometrics. Mol Ecol Resour 11:353-357.

Li S. 2011. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics and Statistics 4:66-70.

Schlager S. 2016. Morpho: Calculations and Visualisations Related to Geometric Morphometrics.

Examples

library(MASS)
library(clusterGeneration)
set.seed(123)
Data=mvrnorm(500,mu=rep(1,50),Sigma=genPositiveDefMat(dim=50)$Sigma)
A=Data[1:250,]
B=Data[251:500,]
# Create two groups of observations (e.g., specimens)
# from the same multivariate normal distribution
# with the same starting covariance matrix

PCA_A=prcomp(A)
PCA_B=prcomp(B)
# Perform separate PCAs for each of the two groups of observations

TestOfAngle(PCA_A$rotation[,1],PCA_B$rotation[,1], flip=TRUE)
# Test for the angle between the two first eigenvectors

library(MASS)
library(clusterGeneration)
set.seed(123)
Data=mvrnorm(500,mu=rep(1,50),Sigma=genPositiveDefMat(dim=50)$Sigma)
A=Data[1:250,]
B=Data[251:500,]
# Create two groups of observations (e.g., specimens)
# from the same multivariate normal distribution
# with the same starting covariance matrix

PCA_A=prcomp(A)
PCA_B=prcomp(B)
# Perform separate PCAs for each of the two groups of observations

TestOfAngle(PCA_A$rotation[,1],PCA_B$rotation[,1], flip=TRUE)
# Test for the angle between the two first eigenvectors

Package 'GeometricMorphometricsMix'

Help Index

Test the significance of the adjusted Rand index

Description

Usage

Arguments

Details

Value

Notice

Citation

References

Examples

Body arching vector from brown trout study

Description

Usage

Format

Details

References

See Also

Examples

Brown trout landmark data

Description

Usage

Format

Details

References

Examples

BTailTest for difference in disparity/morphospace occupation

Description

Usage

Arguments

Details

Value

Citation

References

See Also

Compute the critical angle for the test of the angle between two multivariate vectors

Description

Usage

Arguments

Details

Value

Notice

References

See Also

Examples

Resampling-based estimates (bootstrap or rarefaction) of disparity / morphospace occupation

Description

Usage

Arguments

Details

Value

Parallelization

Average estimate

Input data types

Note

References

See Also

Examples

Permutation test of difference in disparity/morphospace occupation

Description

Usage

Arguments

Details

Value

Notice

References

See Also

Examples

Bootstrapped distance between two arrays

Description

Usage

Arguments

Details

Value

Escoufier RV coefficient

Description

Usage

Arguments

Details