Cluster analysis is the generic name of all those techniques which allow to aggregate n-units into k-groups where k is usually much smaller than n. Classification can be useful in many fields including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics and market research. Generalizing, cluster analysis is peculiar all times when we need to identify groups of units which have similar behaviour. The main objective of this work is to find an effective cluster analysis method which can be applied to different frameworks and in particular to market research. The aim of this work is to present a comparison among different methods to underline, if it exists, the strongest classification method, based on data structure, to get an optimal allocation for each dataset. To achieve this target we compare existing methods with new ones based on robust approaches which have shown high efficiency in many simulations performed so far. For the computational part of the work the software which has been used is MatLab. The structure of the thesis is as follows. The first chapter focuses on the problem of identifying outliers and how they affected the different classification techniques. In particular we consider: a) the method of k-means that represents the reference benchmark given its widespread diffusion in the economic sciences; b) the method of trimmed k-means which constitutes a robustification of the method of k-means, developed in the late 90s; c) the method of TCLUST which is one of the robust methods attracting the main research efforts in the statistical literature; d) the Forward Search, which is a robust method developed in large part within the Department of Economics of University of Parma and the London School of Economics, whose potentiality for classification purposes are still largely unexplored. The second chapter is focused on the tests of the methods introduced on simulated data sets generated by various types of distributions with different degrees of overlapping observations. The purpose is to understand which method and which calibration of the parameters allows to obtain the best classification. The results of the classification are then measured through performance indices of proper allocation which allow to obtain a comparison of the different methods. In the third chapter we will test the methods on a real data set of marketing interest. Finally, the thesis concludes with an appendix that describes the contributions of the work in the field of computing.
A comparison of different classification methods / Morelli, G.. - (2013 Apr 11).
A comparison of different classification methods
MORELLI, Gianluca
2013-04-11
Abstract
Cluster analysis is the generic name of all those techniques which allow to aggregate n-units into k-groups where k is usually much smaller than n. Classification can be useful in many fields including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics and market research. Generalizing, cluster analysis is peculiar all times when we need to identify groups of units which have similar behaviour. The main objective of this work is to find an effective cluster analysis method which can be applied to different frameworks and in particular to market research. The aim of this work is to present a comparison among different methods to underline, if it exists, the strongest classification method, based on data structure, to get an optimal allocation for each dataset. To achieve this target we compare existing methods with new ones based on robust approaches which have shown high efficiency in many simulations performed so far. For the computational part of the work the software which has been used is MatLab. The structure of the thesis is as follows. The first chapter focuses on the problem of identifying outliers and how they affected the different classification techniques. In particular we consider: a) the method of k-means that represents the reference benchmark given its widespread diffusion in the economic sciences; b) the method of trimmed k-means which constitutes a robustification of the method of k-means, developed in the late 90s; c) the method of TCLUST which is one of the robust methods attracting the main research efforts in the statistical literature; d) the Forward Search, which is a robust method developed in large part within the Department of Economics of University of Parma and the London School of Economics, whose potentiality for classification purposes are still largely unexplored. The second chapter is focused on the tests of the methods introduced on simulated data sets generated by various types of distributions with different degrees of overlapping observations. The purpose is to understand which method and which calibration of the parameters allows to obtain the best classification. The results of the classification are then measured through performance indices of proper allocation which allow to obtain a comparison of the different methods. In the third chapter we will test the methods on a real data set of marketing interest. Finally, the thesis concludes with an appendix that describes the contributions of the work in the field of computing.| File | Dimensione | Formato | |
|---|---|---|---|
|
G_Morelli_phd_thesis.pdf
embargo fino al 01/01/2101
Licenza:
Non specificato
Dimensione
6.09 MB
Formato
Adobe PDF
|
6.09 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


