Automatic genre classification for Afrikaans

  • Dirk Snyman
  • Gerhard van Huyssteen
  • Walter Daelemans
Keywords: Genreklassifikasie, Hulpbronskaars Tale, Masjienleer, Mensetaaltegnologie, Natuurliketaalprosessering

Abstract

When working in the terrain of text processing, metadata about a particular text plays an important role. Metadata is often generated, using automatic text classification systems which classify a text into one or more predefined classes or categories based on its contents. One of the dimensions by which a text can be can be classified, is its genre. In this study the development of an automatic genre classification system in a resource scarce environment is postulated. This study aimed to investigate the techniques and approaches that are generally used for automatic genre classification systems, and identify the best approach for Afrikaans (a resource scarce language). With the development of an automatic genre classification system, there is a set of variables that must be considered as they influence the performance of machine learning approaches (i.e. the algorithm used, the amount of training data, and data representation as features). If these variables are handled correctly, an optimal combination of them can be identified to successfully develop a genre classification system. In this article a genre classification system is being developed by using the following approach: The implementation of a MNB algorithm with a bag of words approach feature set. This system provides a resultant f-score (performance measure) of 0.929.

Author Biography

Dirk Snyman
B.A. Taaltegnologie
Published
2014-11-24
How to Cite
Snyman, D., van Huyssteen, G., & Daelemans, W. (2014). Automatic genre classification for Afrikaans. Suid-Afrikaans Tydskrif Vir Natuurwetenskap En Tegnologie / <i>South African Journal of Science and Technology</I&gt;, 33(1), 12 bladsye. https://doi.org/10.4102/satnt.v33i1.759
Section
Oorspronklike Navorsing