Citations are core metrics to gauge the relevance of scientific literature. Identifying features that can predict a high citation count is therefore of primary importance. For the present study, we generated a dataset of 121,640 publications on chronic inflammation from the Scopus database, containing data such as titles, authors, journal, publication date, type of document, type of access and citation count, ranging from 1951 to 2021. Hence we further computed title length, author count, title sentiment score, number of colons, semicolons and question marks in the title and we used these data as predictors in Gradient boosting, Bagging and Random Forest regressors and classifiers. Based on these data, we were able to train these machines, and Gradient Boosting achieved an F1 score of 0.552 on classification. These models agreed that document type, access type and number of authors were the best predicting factors, followed by title length.
The effect of article characteristics on citation number in a diachronic dataset of the biomedical literature on chronic inflammation: An analysis by ensemble machines / Galli, C.; Guizzardi, S.. - In: PUBLICATIONS. - ISSN 2304-6775. - 9:2(2021), p. 15.15. [10.3390/publications9020015]
The effect of article characteristics on citation number in a diachronic dataset of the biomedical literature on chronic inflammation: An analysis by ensemble machines
Galli C.;Guizzardi S.
2021-01-01
Abstract
Citations are core metrics to gauge the relevance of scientific literature. Identifying features that can predict a high citation count is therefore of primary importance. For the present study, we generated a dataset of 121,640 publications on chronic inflammation from the Scopus database, containing data such as titles, authors, journal, publication date, type of document, type of access and citation count, ranging from 1951 to 2021. Hence we further computed title length, author count, title sentiment score, number of colons, semicolons and question marks in the title and we used these data as predictors in Gradient boosting, Bagging and Random Forest regressors and classifiers. Based on these data, we were able to train these machines, and Gradient Boosting achieved an F1 score of 0.552 on classification. These models agreed that document type, access type and number of authors were the best predicting factors, followed by title length.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.