We extend the Forward Search approach for robust data analysis to address problems in text mining. In this domain, datasets are collections of an arbitrary number of documents, which are represented as vectors of thousands of elements according to the vector space model. When the num- ber of variables v is so large and the dataset size n is perhaps smaller by order of magnitudes, the traditional Mahalanobis metric cannot be used as a similarity distance between documents. We show that by monitoring with the Forward Search approach the cosine similarity, a metric for vector space mod- els widely used in text mining, it is possible to estimate robustly a centroid for a document collection and order the documents so that the most dissim- ilar (possibly outliers, for that collection) are left at the end. We also show that the presence of more groups of documents in the collection is clearly detected with multiple starts of the Forward Search.

Robustness issues in text mining / M., Turchi; D., Perrotta; Riani, Marco; Cerioli, Andrea. - STAMPA. - (2013), pp. 263-272.

Robustness issues in text mining

RIANI, Marco;CERIOLI, Andrea
2013-01-01

Abstract

We extend the Forward Search approach for robust data analysis to address problems in text mining. In this domain, datasets are collections of an arbitrary number of documents, which are represented as vectors of thousands of elements according to the vector space model. When the num- ber of variables v is so large and the dataset size n is perhaps smaller by order of magnitudes, the traditional Mahalanobis metric cannot be used as a similarity distance between documents. We show that by monitoring with the Forward Search approach the cosine similarity, a metric for vector space mod- els widely used in text mining, it is possible to estimate robustly a centroid for a document collection and order the documents so that the most dissim- ilar (possibly outliers, for that collection) are left at the end. We also show that the presence of more groups of documents in the collection is clearly detected with multiple starts of the Forward Search.
2013
9783642330414
Robustness issues in text mining / M., Turchi; D., Perrotta; Riani, Marco; Cerioli, Andrea. - STAMPA. - (2013), pp. 263-272.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11381/2684286
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? 3
social impact