Datenbestand vom 17. April 2024

Warenkorb Datenschutzhinweis Dissertationsdruck Dissertationsverlag Institutsreihen     Preisrechner

aktualisiert am 17. April 2024

ISBN 978-3-8439-1774-2

84,00 € inkl. MwSt, zzgl. Versand


978-3-8439-1774-2, Reihe Informatik

Dominique Ziegelmayer
Character n-gram-based sentiment analysis

199 Seiten, Dissertation Universität Köln (2014), Softcover, A5

Zusammenfassung / Abstract

With growing availability and popularity of user-generated content, automatic analysis and aggregation of such information becomes increasingly important. Sentiment polarity classification, one of the main tasks in sentiment analysis, aims to analyze and classify documents according to opinions stated therein. Existing work has mainly focused on standard machine learning techniques. Below, we investigate a novel approach that has proven successful in conventional text classification tasks such as authorship attribution or topic categorization.

This thesis examines classifiers based on adaptive statistical data compression models or more general based on statistics about variable or fixed length character sequences, i.e. character n-grams. We define a classifier using the prediction by partial matching (PPM) compression algorithm and introduce the p2-Measure as a simple abstraction of PPM, motivated in information theory. By coupling the p2-Measure with feature weighting and feature selection schemes, it consistently outperforms the far more sophisticated SVM.

In the course of this work, we analyze advantages of the p2-Measure and character n-gram based approaches in detail. Besides the transfer performance between different source and target domains, namely cross-domain sentiment analysis, we are also interested in potential benefits of our method on foreign language datasets. Moreover, we will investigate to which extend the

p2-Measure can be used to determine not only the polarity but also the strength and even the original rating of a document. Altogether, our results show that the p2-Measure is a serious alternative to the word-based standard approach and that it is especially suitable for noisy or foreign language datasets.