This book is about using graphs to explore and model continuous multivariate data. Such data are often modelled using the multivariate normal distribution and, indeed, there is a literature of weighty statistical tomes presenting the mathematical theory of this activity. Our book is very different. Although we use the methods described in these books, we focus on ways of exploring whether the data do indeed have a normal distribution. We emphasize outlier detection, transformations to normality and the detection of clusters and unsuspected influential subsets. We then quantify the effect of these departures from normality on procedures such as discrimination and cluster analysis. The normal distribution is central to our book because, subject to our exploration of departures, it provides useful models for many sets of data. However, the standard estimates of the parameters, especially the covariance matrix of the observations, are highly sensitive to the presence of outliers. This is both a blessing and a curse. It is a blessing because, if we estimate the parameters with the outliers excluded, their effect is appreciable and apparent if we then include them for estimation. It is however a curse because it can be hard to detect which observations are outliers. We use the forward search for this purpose. The search starts from a small, robustly chosen, subset of the data that excludes outliers. We then move forward through the data, adding observations to the subset used for parameter estimation. As we move forward we monitor statistical quantities such as parameter estimates, Mahalanobis distances and test statistics. In this way we can immediately detect the presence of outliers and clusters of observations and determine their effect on inferences drawn from the data. We can then improve our models. This book is a companion to ``Robust Diagnostic Regression Analysis" by Atkinson and Riani published by Springer in 2000. In the preface to that book we wrote ``This bald statement masks the excitement we feel about the methods we have developed based on the forward search. We are continuously amazed, each time we analyze a new set of data, by the amount of information the plots generate and the insights they provide". Although more years have passed than we intended before the completion of our new book, in which process we have become three authors rather than two, this statement of our enthusiasm still holds. The first chapter of this book introduces the forward search and contains four examples of its use for multivariate data analysis. We show how outliers and groups in the data can be identified and introduce some important plots. The second chapter, on theory, is in two parts. The first gives the distributional theory for a single sample from a multivariate normal distribution, with particular emphasis on the distributions of various Mahalanobis distances. The second part of the chapter contains a detailed description of the forward search and its properties. An understanding of all details of this chapter is not essential for an appreciation of the uses of the forward search in the later chapters. If you feel you know enough statistical theory for your present purposes, continue to Chapter 3. The next three chapters describe methods for a sample believed to be from a single multivariate normal distribution. Chapter Three continues, extends and amplifies the analyses of the four examples from Chapter 1. In Chapter 4 we apply the forward search to multivariate transformations to normality. Analyses of three of the examples from earlier chapters are supplemented by the analysis of three new examples. Chapter 5 contains our first use of the forward search in a procedure depending on multivariate normality, that of principal components analysis. We are particularly interested in how the components are affected by outliers and other unsuspected structure in the data. The two following chapters describe the forward search for data in several groups rather than one. In Chapter 6 the subject is discriminant analysis and in Chapter 7 cluster analysis, where the number of groups, as well as their composition, is unknown. Here the forward search enables us to see how individual observations are distorting the boundaries between our putative clusters. Finally, in Chapter 8 we consider the analysis of spatial data, which has something in common with the regression analysis of our earlier book.
Exploring Multivariate Data with the Forward Search / A. C., Atkinson; Riani, Marco; Cerioli, Andrea.  (2004), pp. 1621.
Exploring Multivariate Data with the Forward Search
RIANI, Marco;CERIOLI, Andrea
20040101
Abstract
This book is about using graphs to explore and model continuous multivariate data. Such data are often modelled using the multivariate normal distribution and, indeed, there is a literature of weighty statistical tomes presenting the mathematical theory of this activity. Our book is very different. Although we use the methods described in these books, we focus on ways of exploring whether the data do indeed have a normal distribution. We emphasize outlier detection, transformations to normality and the detection of clusters and unsuspected influential subsets. We then quantify the effect of these departures from normality on procedures such as discrimination and cluster analysis. The normal distribution is central to our book because, subject to our exploration of departures, it provides useful models for many sets of data. However, the standard estimates of the parameters, especially the covariance matrix of the observations, are highly sensitive to the presence of outliers. This is both a blessing and a curse. It is a blessing because, if we estimate the parameters with the outliers excluded, their effect is appreciable and apparent if we then include them for estimation. It is however a curse because it can be hard to detect which observations are outliers. We use the forward search for this purpose. The search starts from a small, robustly chosen, subset of the data that excludes outliers. We then move forward through the data, adding observations to the subset used for parameter estimation. As we move forward we monitor statistical quantities such as parameter estimates, Mahalanobis distances and test statistics. In this way we can immediately detect the presence of outliers and clusters of observations and determine their effect on inferences drawn from the data. We can then improve our models. This book is a companion to ``Robust Diagnostic Regression Analysis" by Atkinson and Riani published by Springer in 2000. In the preface to that book we wrote ``This bald statement masks the excitement we feel about the methods we have developed based on the forward search. We are continuously amazed, each time we analyze a new set of data, by the amount of information the plots generate and the insights they provide". Although more years have passed than we intended before the completion of our new book, in which process we have become three authors rather than two, this statement of our enthusiasm still holds. The first chapter of this book introduces the forward search and contains four examples of its use for multivariate data analysis. We show how outliers and groups in the data can be identified and introduce some important plots. The second chapter, on theory, is in two parts. The first gives the distributional theory for a single sample from a multivariate normal distribution, with particular emphasis on the distributions of various Mahalanobis distances. The second part of the chapter contains a detailed description of the forward search and its properties. An understanding of all details of this chapter is not essential for an appreciation of the uses of the forward search in the later chapters. If you feel you know enough statistical theory for your present purposes, continue to Chapter 3. The next three chapters describe methods for a sample believed to be from a single multivariate normal distribution. Chapter Three continues, extends and amplifies the analyses of the four examples from Chapter 1. In Chapter 4 we apply the forward search to multivariate transformations to normality. Analyses of three of the examples from earlier chapters are supplemented by the analysis of three new examples. Chapter 5 contains our first use of the forward search in a procedure depending on multivariate normality, that of principal components analysis. We are particularly interested in how the components are affected by outliers and other unsuspected structure in the data. The two following chapters describe the forward search for data in several groups rather than one. In Chapter 6 the subject is discriminant analysis and in Chapter 7 cluster analysis, where the number of groups, as well as their composition, is unknown. Here the forward search enables us to see how individual observations are distorting the boundaries between our putative clusters. Finally, in Chapter 8 we consider the analysis of spatial data, which has something in common with the regression analysis of our earlier book.File  Dimensione  Formato  

ARC04_book.pdf
non disponibili
Tipologia:
Documento in Postprint
Licenza:
NON PUBBLICO  Accesso privato/ristretto
Dimensione
6.51 MB
Formato
Adobe PDF

6.51 MB  Adobe PDF  Visualizza/Apri Richiedi una copia 
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.