Abstract
Preprocessing is an important step in information retrieval and text mining. In this study, we examined the impact of stemming on clustering Turkish texts. We used two datasets compiled from web sites of Turkish news agencies, and performed extensive experiments. We empirically show that there is no significant evidence that stemming always improves the quality of clustering for texts in Turkish. However, when stemming is used, dimensionality of the document-term matrix dramatically decreases without inversely affecting the clustering performance. As a result, it is highly recommended to apply stemming for clustering Turkish texts.
Original language | English |
---|---|
Title of host publication | 2012 International Symposium on Innovations in Intelligent Systems and Applications |
Publisher | IEEE |
Number of pages | 4 |
ISBN (Electronic) | 9781467314480 |
ISBN (Print) | 9781467314466 |
DOIs | |
Publication status | Published - 23 Jul 2012 |
Externally published | Yes |
Event | 2012 International Symposium on Innovations in Intelligent Systems and Applications - Trabzon, Turkey Duration: 2 Jul 2012 → 4 Jul 2012 |
Conference
Conference | 2012 International Symposium on Innovations in Intelligent Systems and Applications |
---|---|
Abbreviated title | INISTA |
Country/Territory | Turkey |
City | Trabzon |
Period | 2/07/12 → 4/07/12 |
Keywords
- data mining
- text mining
- document clustering
- preprocessing
- stemming