Comment on “Beyond principal component analysis: Nonparametric and nonlinear approaches for robust analysis of Gafsa basin groundwater” by Kumada and Takefuji (2025)

Boschetti, Tiziano; Rossi, Mattia

doi:10.1016/j.gexplo.2026.108007

We thank the authors for their interest in our work and for fostering constructive discussion on multivariate methods applied to hydrogeochemical datasets from the Gafsa Basin, Tunisia (Kumada and Takefuji, 2025). In their abstract, Kumada and Takefuji (2025) state that “complementary methodologies—Feature Agglomeration, Independent Component Analysis, and High Variance Gene Selection—create a more comprehensive analytical framework capable of capturing nonlinear relationships, hierarchical structures, and statistically independent variation sources that PCA might overlook.” We encourage the authors to substantiate this claim with quantitative evidence, such as figures or tables derived from the Gafsa Basin dataset, which is publicly available as supplementary material in Boschetti et al. (2025). We further encourage a more detailed discussion of the assumptions and limitations inherent to the proposed methods. For example, Independent Component Analysis (ICA) requires non-Gaussianity and sufficiently large sample sizes to yield stable and interpretable components. Miki et al. (2025) applied ICA to a dataset with 135 samples and 19 variables (n/p ≈ 7), whereas the Gafsa dataset comprises 33 samples and 19 variables (n/p ≈ 1.7). Under such conditions, ICA is more appropriately regarded as an exploratory rather than an inferential tool. We acknowledge that Principal Component Analysis (PCA) has well-known limitations. However, in hydrogeochemical studies characterized by relatively small sample sizes and moderate-to-high dimensionality, alternative probabilistic approaches such as Exploratory Factor Analysis (EFA) are often impractical. EFA relies on iterative parameter estimation procedures (e.g., maximum likelihood), which in low n/p or N < p settings frequently result in non-convergence, unstable solutions, or improper estimates such as Heywood cases (Widaman, 1993). Moreover, EFA requires adequate sampling conditions, commonly assessed through measures such as the Kaiser–Meyer–Olkin index and Bartlett's test, which are rarely satisfied in sparse environmental datasets. In contrast, PCA provides a mathematically well-defined and computationally stable framework for dimensionality reduction. By applying a deterministic algebraic transformation of the variance–covariance (or correlation) matrix, PCA yields a unique solution without requiring assumptions about latent variable structures or iterative model fitting (Jolliffe and Cadima, 2016; Velicer and Jackson, 1990). For these reasons, PCA is often preferred for feature extraction and exploratory analysis in contexts where computational robustness and reproducibility take precedence over formal inference (e.g., identification of latent constructs). Accordingly, in Boschetti et al. (2025), PCA was intentionally employed for exploratory and descriptive purposes, with component interpretation constrained by independent hydrogeological and geochemical evidence, including the a posteriori use of a categorical variable related to predetermined pollution sources, rather than being treated as inferential or causal. PCA has a long-standing and well-established tradition in hydrogeochemical research (e.g., Vriend, 1990; Boschetti et al., 2003, Boschetti et al., 2005; Cortecci et al., 2008; Golla, 2018; Barbieri et al., 2021). Notably, the combined use of PCA and hierarchical cluster analysis (HCA) has consistently proven effective in identifying distinct hydrochemical water types from springs and wells, even in highly complex settings, such as hydrothermal systems in arid climates and under conditions of limited sample availability (e.g., Awaleh et al., 2015, Awaleh et al., 2017, Awaleh et al., 2018, Awaleh et al., 2024). Conversely, we are not aware of applications of any High Variance Gene Selection (HVGS) to hydrogeochemical datasets. HVGS was developed primarily for high-dimensional omics data, where the number of variables is orders of magnitude greater than the number of observations. In hydrogeochemistry, sample sizes are typically constrained by the availability of wells or springs, potentially limiting both the applicability and statistical stability of HVGS-based approaches. In summary, while PCA—like any multivariate technique—has limitations, it remains a valuable exploratory tool when applied with appropriate caution, particularly in low n/p settings. Other approaches, including ICA and Feature Agglomeration, may provide complementary perspectives but are likewise subject to methodological constraints and stability issues when sample sizes are limited. More broadly, we acknowledge that all multivariate approaches are sensitive to sample size and data structure, and that no single method can fully compensate for instability in low n/p datasets. Consistent with Yang and Cheng (2015), we regard multivariate variable-reduction techniques as complementary rather than mutually exclusive, with their suitability depending on data structure, sample size, and study objectives.

Comment on “Beyond principal component analysis: Nonparametric and nonlinear approaches for robust analysis of Gafsa basin groundwater” by Kumada and Takefuji (2025) / Boschetti, T., Rossi, M.. - In: JOURNAL OF GEOCHEMICAL EXPLORATION. - ISSN 0375-6742. - 284:(2026). [10.1016/j.gexplo.2026.108007]