Abstract
In order to apply data mining techniques on texts that are written in natural language, a preprocessing step is used to transform this unstructured data into structured format. Stemming is an important text preprocessing technique. In this study, we empirically examined the impact of essentially three stemming methods on clustering Turkish texts: Zemberek, Affix Stripping, and Fixed Prefix. Although very similar results were obtained with all these methods in terms of clustering quality, Zemberek and Fixed Prefix 5 methods produced better results when compared to the others. Besides clustering quality, Zemberek and Fixed Prefix 5 methods are preferable stemming methods for Turkish text clustering applications due to high dimensionality reduction rate they provide.
Translated title of the contribution | Examining the impact of different stemming methods on clustering Turkish texts |
---|---|
Original language | Turkish |
Title of host publication | ELECO'2012 Elektrik - Elektronik ve Bilgisayar Mühendisliği Sempozyumu |
Pages | 598-602 |
Number of pages | 5 |
Publication status | Published - 29 Nov 2012 |
Externally published | Yes |
Event | ELECO'2012 Electric - Electronic and Computer Engineering Symposium - Bursa, Turkey Duration: 29 Nov 2012 → 1 Dec 2012 |
Conference
Conference | ELECO'2012 Electric - Electronic and Computer Engineering Symposium |
---|---|
Abbreviated title | ELECO'2012 |
Country/Territory | Turkey |
City | Bursa |
Period | 29/11/12 → 1/12/12 |
Keywords
- text mining
- stemming
- document clustering
- natural language processing