Scholarly article on topic 'Distributionally Extended Network-based Word Sense Disambiguation in Semantic Clustering of Polish Texts'

Distributionally Extended Network-based Word Sense Disambiguation in Semantic Clustering of Polish Texts Academic research paper on "Computer and information sciences"

CC BY-NC-ND
0
0
Share paper
Academic journal
IERI Procedia
OECD Field of science
Keywords
{"Word Sense Disambiguation" / wordnet / "text classification" / plWordNet}

Abstract of research paper on Computer and information sciences, author of scientific article — Paweł Kędzia, Maciej Piasecki, Jan Kocoń, Agnieszka Indyka-Piasecka

Abstract In the paper we present an extended version of the graph-based unsupervised Word Sense Disambiguation algorithm. The algorithm is based on the spreading activation scheme applied to the graphs dynamically built on the basis of the text words and a large wordnet. The algorithm, originally proposed for English and Princeton WordNet, was adapted to Polish and plWordNet. An extension based on the knowledge acquired from the corpus-derived Measure of Semantic Relatedness was proposed. The extended algorithm was evaluated against the manually disambiguated corpus. We observed improvement in the case of the disambiguation performed for shorter text contexts. In addition the algorithm application expressed improvement in document clustering task.

Academic research paper on topic "Distributionally Extended Network-based Word Sense Disambiguation in Semantic Clustering of Polish Texts"

CrossMark

Available online at www.sciencedirect.com

ScienceDirect

IERI Procedia 10 (2014) 38 - 44

2014 International Conference on Future Information Engineering

Distributionally Extended Network-Based Word Sense Disambiguation in Semantic Clustering of Polish Texts

Pawel Kçdzia1, Maciej Piasecki, Jan Kocon, Agnieszka Indyka-Piasecka

Wroclaw University of Technology, ul. Wybrzeze Wyspianskiego 27, Wroclaw 50-370, Poland bSecond affiliation, Address, City and Postcode, Country

Abstract

In the paper we present an extended version of the graph-based unsupervised Word Sense Disambiguation algorithm. The algorithm is based on the spreading activation scheme applied to the graphs dynamically built on the basis of the text words and a large wordnet. The algorithm, originally proposed for English and Princeton WordNet, was adapted to Polish and plWordNet. An extension based on the knowledge acquired from the corpus-derived Measure of Semantic Relatedness was proposed. The extended algorithm was evaluated against the manually disambiguated corpus. We observed improvement in the case of the disambiguation performed for shorter text contexts. In addition the algorithm application expressed improvement in document clustering task.

© 2014Published byElsevierB.V.Thisisanopenaccessarticle undertheCC BY-NC-ND license (http://creativecommons.Org/licenses/by-nc-nd/3.0/).

Selection and peer review under responsibility of Information Engineering Research Institute Keywords: Word Sense Disambiguation, wordnet, text classification, plWordNet

1. Introduction

Documents are commonly represented in Information Retrieval as bags of words, i.e. collections of words (words with the number of their occurrences). Linguistic structure of the text is very rarely taken into account, mostly due to the limited robustness of the natural language processing (i.e. limited precision and speed of processing). However, the bag of words model causes the loss of information even on the word level. Many

* Corresponding author. Tel.: +48 71 320 42 24; fax: +0-000-000-0000. E-mail address: pawel.kedzia@pwr.wroc.pl.

2212-6678 © 2014 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.Org/licenses/by-nc-nd/3.0/).

Selection and peer review under responsibility of Information Engineering Research Institute doi:10.1016/j.ieri.2014.09.073

words are polysemous and can express several meanings, e.g. the word car corresponds to 5 noun meanings in Princeton WordNet 3.1 (PWN) (Fellbaum, 1998): car 1 - a motor vehicle, car 2 - "a wheeled vehicle adapted to the rails of railroad", car 3 - "the compartment that is suspended from an airship", car 4 - an elevator car and car 5 - a cable car. Improper matching of the words in the user query against their use in the documents can lead to incorrect retrieval or ranking of the results.

Word Sense Disambiguation (WSD) methods can potentially help in several Information Retrieval (IR) tasks in which the user information need specification is longer than a couple of words, e.g. in Question Answering, document classification and document clustering. However, supervised WSD tools (developed on the basis of supervised machine learning algorithms applied to a text corpus that was manually annotated with word senses) express relatively good accuracy but have coverage which is limited only to words annotated in the corpus. Such corpus annotation is very laborious and costly, so the typical coverage is from 100 till several thousand words at most. Unsupervised WSD methods are often based on the sense induction from text corpora, but their accuracy is much lower than the accuracy of the supervised methods and the coverage is still far from perfect (not all word senses are represented well enough). However, there is yet another group of the unsupervised WSD methods - algorithms that use the wordnet graph of relation (see Sec. 2) and the spreading activation scheme to find senses matching the surrounding text passages.

Our goal was to adapt a spreading activation based WSD algorithm to a wordnet different than PWN, namely Polish plWordNet, and the language different than English. Moreover, we wanted to extend the wordnet with knowledge resources acquired from the text corpus and to build a WSD tool for practical applications.

2. plWordNet - repository of lexical senses

In the spreading activation based WSD (SA-WSD) a wordnet is used in two roles: as repository of senses defining all senses per each word, and as a knowledge base describing senses by lexico-semantic relations.

A wordnet consists of synsets, lexical units and lexico-semantic relations. A lexical unit is a pair: a word plus sense number, e.g. car 2. A synset is a set of near synonyms and consists of one or more lexical units. Each synset represents a unique lexical meaning and synset identifiers can represent lexical meanings. Lexico-semantic relations represent binary meaning associations between lexical units observed in the lexical system. Lexico-semantic relations are encoded in the wordnet as relations between synsets (the basic ones) or between lexical units, e.g. the synset from PWN 3.1 {car 1, auto 1, automobile 1, ... } is linked by the hypernymy relation with the synset {motor vehicle 1, automotive vehicle 1} and by holonymy with the synset { bumper 2}.

plWordNet 2.1 (plWN) is a huge wordnet for Polish of the size close to the PWN size: ~161 000 lexical units, ~118 000 synsets and ~108 000 unique words. The lexical units are described by more than 40 different lexico-semantic relations. plWN was developed on the basis of Polish corpora and provides better coverage of the corpus vocabulary than PWN and higher relation density than PWN. However, plWN almost does not have glosses for synsets (short textual sense descriptions) that are intensively used in SA-WSD methods.

3. Distributionally extended WSD

Many approaches to the graph-based WSD were proposed, e.g. Gutiérrez et al. 2012, Tsatsaronis et al. 2010, Mihalcea and Figa 2004, Agirre and Soroa 2004, Navigli 2006, Sinha and Mihalcea 2007, Agirre and Soora 2008]. In our work we follow the Page-Rank based approach proposed by Agirre and Soroa 2004 [Agirre and Soora 2008, Agirre et al. 2009, Agirre et al. 2010]. The key concepts are: Lexical Knowledge Base (LKB) - a set of the concepts (PWN synsets) together with the relations between them and the dictionary

of lemmas (word types) describing associations between lemmas and concepts. LKB can be represented as a graph G = (V;E) where V is the set of concepts and E is a set of relations in the concept set, i.e e,j e e is undirected relation (v,, vj), where v, VjE V. The relations correspond to the wordnet lexico-semantic relations and also weights are assigned to them. The graph based WSD builds a graph related to words included in a text fragment, next applies a ranking function to the graph, chooses concepts with the highest final ranking values from the graph and finally assigns them to the words in the text.

Agirre and Soroa 2004 [Agirre and Soora 2008, Agirre et al. 2009, Agirre et al. 2010] used the PWN relation graph as the basis for LKB. Agirre and Soora 2009 proposed the idea of using Page-Rank approach to WSD. Basic PR algorithm can be described by the state change equation: Pr = cMPr + (1 - c)v where v is the vector with N dimensions (i.e. graph nodes), and Pr is the vector of ranking values. The initial values of the elements are set all to 1/N. M is a probability matrix of the size N x N, and Mj = 1/d if there is an edge linking Vi and Vj or 0 otherwise, where dj is degree of Vj node in the graph, and c is a damping factor (typically set to <0.85, ..., 0.95>). PR is run iteratively for predefined number of iterations.

Disambiguation of the words from a text contexts is done by: selecting the synsets that are associated with lemmas from the text context, building a subgraph of the wordnet relation on the basis of the selected synsets, running PR on the subgraph and choosing the appropriate synsets on the basis of the final node values.

This basic scheme can be extended with the use of the personalized vector v prior to the disambiguation. In both cases the result is an assignment of the chosen synsets to the text words. The context can be one or many sentences. Agirre and Soroa 2004 [Agirre and Soora 2008] proposed a slightly modified PR:

where a is damping factor. They also introduced modified estimation of the initial vector v values according to which the values are focused only on the nodes including lemmas identical to the text words being disambiguated. As a result, the probability mass is not distributed over the whole graph - this approach is called Personalized Page Rank (PPR). The subgraph concept links come first from the wordnet relations linking synsets selected by the text words. Links to synsets corresponding to words from the glosses of the synset (selected previously) are added. It was assumed that the glosses were sense disambiguated. In this way new concepts are added from the glosses to the subgraph which is bigger and provides more information

In our case plWN almost does not provide glosses and the glosses are not disambiguated. Unfortunately, the additional links derived from glosses lead to the significant improvement in the SA-WSD algorithm.

To supplement the missing information, we used Measure of Semantic Relatedness (MSR) as an additional source of links for the subgraphs. MSR assigns a numerical value to pairs of words such that words that are semantically close receive higher values than those that are not related. MSR is built on the basis of the statistical analysis of word distributions in a large corpus, i.e. words that occur in similar linguistic contexts are described by similar feature vectors and receive higher values of relatedness. For the work presented here, we used the MSR proposed by Piasecki, et al. 2009 in which word occurrences are described by a set of basic syntactic relations with other words, co-occurrence frequencies are weighed by PMI and the relatedness is calculated by the cosine measure. Using MSR, we constructed an extended LKB in which additional concepts were added on the basis of the most related words from MSR. For each word wi we acquired from MSR a list SWiof the k=20 most related words to w,-. For each word Wj from the list SW, we added to the disambiguation subgraphs all synsets including Wj as additional concepts and all links between those synsets and synsets already present in the subgraph, e.g. the initial list of 385 370 synset relations in the text-based subgraph was expanded with 234 6834 new edges in the output graph.

4. Evaluation

WSD algorithm can be evaluated directly by comparison of its decisions with the human judgments and, indirectly, by applying the algorithm in a text processing task as a tool.

4.1. Corpus based evaluation

For comparison we used a part of the KPWr corpus of Polish [Broda et al. 2012] which is sense disambiguated. This part include 1996 documents, 14 022 words and 60 unique lemmas manually sense disambiguated. WSD annotation contains information about the appropriate plWN synset e.g.:

<tok><orth>dni</orth>

<lex><base>dzien</base><ctag>subst:pl:nom:m3</ctag></lex>

<prop key="sense:wsd_dziefi">dziefi-2</prop> </tok>

where dzien 2 'a day' annotates the sense of the word dni sdays' (lemma dzien). The precision and recall of our algorithm in relation to the annotation are presented in Table 1, where plWordnet data source means that the subgraph included only relation added from plWN on the basis of the text word, MSRkbes20 - the subgraph was built on the basis of the synsets corresponding to the 20 most related words and plWwordNet+MSRkbest20 - the full extended subgraph was built. Sentence context type describes the size of the text context used for the subgraph construction: sentence by sentence (initial vector v has assigned values in the nodes with lemmas from the sentence in PPR) or the whole document at once.

Table 1. Results of precision on annotated KPWr corpora for Personalized Page rank with 30 iterations.

Data source Context type Noun precision Verb precision

plWordNet Sentence 0.34 0.24

Whole document 0.43 0.28

MSRkbest20 Sentence 0.38 0.0

Whole document 0.37 0.0

plWordNet + Sentence 0.39 0.22

MSRkbest20 Whole document 0.37 0.27

The best result is highlighted. In all configurations the recall was equal to 0.88. The precision of for verbs in the case of MSRkbes20 data source is equal to 0, because the used MSR covered only nouns. The use of the 20 most related words improved the results only in the case of sentences as contexts. A potential cause is that the first words have a high similarity, but the further do not, e.g. for word subst:opatrznosc 'providence' the most related word is subst:bog 'God' (~0.25) but one of the last is subst:ojciec father' (~0.096).

4.2. Evaluation by application

Evaluation by application was performed for document clustering due to the available document collection. We used two different data sources with two different representations: TF and TF/IDF with two types of features (bag-of-words of orthographic forms and plWN synsets corresponding to the text words). The first set of documents was derived from the Polish Wikipedia (http://pl.wikipedia.org). It contains 40 documents from 4 categories (10 documents per category): Planets (P in Table 2) , Cities (C), Felidae (F) and Canidae animals (A). During evaluation we used CLUTO system [Tagarelli and Karypis 2008, Zhao and Karypis 2005] for

clustering. The results presented in Table 2 were achieved for RBR clustering method with G1' criteria function, 4 clusters, the cosine similarity function and TF documents representation.

Table 2. Clustering results using RBR method, G1' criteria function, cosine similarity and 4 clusters for Wiki dataset and synset features.

Class Size ISim ISdev ESim ESdev Entropy Purity P C F A

0 10 0.341 0.060 0.019 0.004 0.000 1.000 0 10 0 0

1 10 0.299 0.045 0.021 0.006 0.000 1.000 10 0 0 0

2 9 0.210 0.023 0.033 0.010 0.000 1.000 0 0 0 9

3 11 0.182 0.020 0.029 0.008 0.220 0.909 0 0 10 1

G1'=510, Entropy: 0.062, Purity: 0.974 Accuracy = 97.5%

The purity and entropy indicator (the general and in all subclass) are almost perfect. Only one document from Canidae was erroneously clustered to Felidae. It is about hyenas that are related to Canidae, but they belong to the Felidae taxonomy. So it can be treated as correct clustering. In Table 3 the results with bag of words features for the same dataset were shown. They are worse than those for WSD synset features.

Table 3. Clustering results using RBR method, G1' criteria function, cosine similarity and 4 clusters for Wiki dataset and b-o-w features.

Class Size ISim ISdev ESim ESdev Entropy Purity P C F A

0 9 0.226 0.020 0.013 0.002 0.217 0.889 0 10 0 0

1 10 0.184 0.023 0.010 0.002 0.000 1.000 10 0 0 0

2 9 0.151 0.012 0.015 0.006 0.395 0.667 0 0 3 6

3 11 0.143 0.010 0.015 0.003 0.407 0.636 0 0 7 4

G1'=470, Entropy: 0.256, Purity: 0.795 Accuracy = 82.5%

The second dataset comes from KPWr and includes 30 documents from 3 categories: Government (G in Table 4) Science (S) and Technical (T) with 10 documents in each category. The results for best configuration from the Wikipedia dataset are presented in Table 4.

Table 4. Clustering results using RBR method, G1' criteria function, cosine similarity and 4 clusters, KPWr dataset and synset features.

Class Size ISim ISdev ESim ESdev Entropy Purity G S T

0 13 0.268 0.061 0.065 0.025 0.829 0.462 1 6 6

1 8 0.418 0.079 0.037 0.020 0.000 1.000 8 0 0

2 8 0.192 0.027 0.078 0.039 0.500 0.500 1 4 3

G1'=430, Entropy: 0.616, Purity: 0.621 Accuracy = 58%

On this dataset, the obtained results were worse than those on the Wikipedia (also for bag of words feature in Table 5). Only for the government category it was possible to separate the documents. This may be due to the fact that the documents in KPWr are more complex and longer than in the Wikipedia.

Table 5. Clustering results using RBR method, G1' criteria function, cosine similarity and 4 clusters, KPWr dataset and b-o-w features.

Class Size ISim ISdev ESim ESdev Entropy Purity G S T

0 13 0.497 0.111 0.105 0.019 0.648 0.538 1 7 5

1 10 0.320 0.091 0.048 0.047 0.234 0.900 9 1 0

2 6 0.315 0.087 0.111 0.041 0.730 0.500 0 2 4

G1'=392, Entropy: 0.447, Purity: 0.655 Accuracy = 71,5%

5. Conclusions (MP)

The application of the large lexicon WSD algorithm brought very encouraging results. WSD of the limited accuracy in tests on the disambiguated corpus, showed improvement in document clustering. The WSD algorithm based on spreading activation has been easily adapted to the new language and new wordnet. So, from this perspective, even the evaluation by comparison with the human judgment can be treated as positive. In order to supplement the missing information in the plWordNet (i.e. glosses) we used a corpus-derived Measure of Semantic Relatedness which provides automatically extracted semantic associations between words. The extension of WSD algorithm with the information from MSR brought improvement for shorter contexts, but decreased the accuracy for larger contexts. There is clear need for further research on defining MSR parameters appropriate for WSD and further exploration of the corpus-derived information in SA WSD. Acknowledgements. Polish National Centre for Research and Development project SyNaT and the European Union within European Innovative Economy Programme project NEKST P0IG.01.01.02-14-013/09

References

[1] Yoan Gutiérrez, Sonia Vázquez, Andrés Montoyo . A graph-Based Approach to WSD Using Relevant Semantic Trees and N-Cliques Model. Computational Linguistics and Intelligent Text Processing LNCS Volume 7181, 2012, pp 225-237

[2] George Tsatsaronis, Iraklis Varlamis, and Kjetil Norvag An Experimental Study on Unsupervised Graph-based Word Sense Disambiguation, LNCS Volume 6008, 2010, pp 184-198

[3] R. Mihalcea, P. Tarau, and E. Figa. Pagerank on semantic networks with application to word sense disambiguation. In Proc. of COLING, 2004.

[4] E. Agirre and A. Soroa. Personalizing pagerank for word sense disambiguation. In Proc. Of EACL, pages 33-41, 2009.

[5] R. Navigli. Online word sense disambiguation with structural semantic interconnections. In Proc. of EACL, 2006.

[6] R. Sinha and R. Mihalcea. Unsupervised graph-based word sense disambiguation using measures of word semantic similarity. In Proc. of ICSC, 2007.

[7] Eneko Agirre and Aitor Soroa. Using the multilingual central repository for graph-based word sense disambiguation. In Proc. of the Language Resources and Evaluation, 2008. ELRA.

[8] Eneko Agirre, Oier Lopez De Lacalle, and Aitor Soroa. Knowledge-based wsd on specific domains: Performing better than generic supervised wsd. In Proceedings of the 21st International Jont Conference on Artifical Intelligence , IJCAI'09, pages 1501-1506, San Francisco, CA, USA, 2009.

[9] Eneko Agirre, Aitor Soroa, and Mark Stevenson. Graph-based word sense disambiguation of biomedical documents. Bioinformatics , 26(22):2889-2896, November 2010b.

[10] Bartosz Broda, Michal Marcinczuk, Marek Maziarz, Adam Radziszewski and Adam Wardynski: KPWr: Towards a Free Corpus of Polish. LREC 2012.

[11] Andrea Tagarelli and George Karypis. A Segment-based Approach To Clustering Multi-Topic Documents. Text Mining Workshop, SIAM Datamining Conference, 2008.

[12] Ying Zhao and George Karypis. Hierarchical Clustering Algorithms for Document Datasets. Data Mining and Knowledge Discovery, Vol. 10, No. 2, pp. 141 - 168, 2005.

[13] Christiane Fellbaum, editor. 1998. WordNet - An Electronic Lexical Database. The MIT Press.

[14] Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, and Stan Szpakowicz. 2013. Beyond the transfer-and

merge wordnet construction: plWordNet and a comparison with WordNet. In Proc. of the Recent Advances in Natural Language Processing RANLP 2013, Hissar, Bulgaria, ACL, 2013.

[15] Maciej Piasecki, Stanislaw Szpakowicz, and Bartosz Broda. 2009. A Wordnet from the Ground Up. Oficyna Wydawnicza Politechniki Wroclawskiej.