In this paper we introduce new similarity indeces for variables with multiple categories. The proposed measures are conceptually simple and straightforward to compute. In contrast to traditionally used similarity indeces, they also consider the frequency of the modalities of each attribute in the sample. This feature is useful when dealing with rare categories, since it makes sense to differently evaluate the pairwise presence of a rare category from the pairwise presence of a widespread one. Moreover, this feature helps finding under-represented groups in cluster analysis. There are two versions of the weighted index: one for independent categorical variables and one for dependent variables. The suitability of the proposed indeces is shown in this paper using both simulated and real world data sets.
|Tipologia ministeriale:||Articolo su rivista|
|Appare nelle tipologie:||1.1 Articolo su rivista|