Scholarly article on topic 'A Redundancy Elimination Approach towards Summary Refinement'

A Redundancy Elimination Approach towards Summary Refinement Academic research paper on "Computer and information sciences"

CC BY-NC-ND
0
0
Share paper
Academic journal
IERI Procedia
OECD Field of science
Keywords
{"Text Summarization" / "Summary Refinement" / "Redundancy Elimination" / "Binomial Distribution."}

Abstract of research paper on Computer and information sciences, author of scientific article — M. Esther Hannah, Saswati Mukherjee, Sakthi Balaramar

Abstract A summary generated by a machine, in contrast to human-generated summaries are produced in less time, unbiased, not time or mood dependent and reliable. However many commonly used approaches are feature based methods that look out for important sentences or phrases by observing features or cues. Such feature based methods may end up producing summaries that contain sentences which are similar in meaning, mostly which depend on sentence scoring and hence not a desirable factor. The proposed work takes a machine generated summary as rough summary and uses binomial distribution to identify importance of every sentence in the rough summary. The semantic similarity between sentences is identified and the sentences are removed thereby refining the summary. By eliminating similar sentences the summary is refined so that only informative sentences are left in the summary. The proposed redundancy elimination approach is applied on summaries obtained from an existing summarization system with the fuzzy based summarization model as a case study. Evaluation of the summary refinement approach is done on DUC2002 dataset and the results are promising.

Academic research paper on topic "A Redundancy Elimination Approach towards Summary Refinement"

CrossMark

Available online at www.sciencedirect.com

ScienceDirect

IERI Procedia 10 (2014) 245 - 251

2014 International Conference on Future Information Engineering

A Redundancy Elimination Approach towards Summary

Refinement

Esther Hannah. M a, Saswati Mukherjee b, Sakthi Balaramarc *

Associate Professor, St. Joseph's College of Engg, Chennai, India Professor, DIST, Anna University, Chennai, India Student, St. Joseph's College of Engg, Chennai, India

Abstract

A summary generated by a machine, in contrast to human-generated summaries are produced in less time, unbiased, not time or mood dependent and reliable. However many commonly used approaches are feature based methods that look out for important sentences or phrases by observing features or cues. Such feature based methods may end up producing summaries that contain sentences which are similar in meaning, mostly which depend on sentence scoring and hence not a desirable factor. The proposed work takes a machine generated summary as rough summary and uses binomial distribution to identify importance of every sentence in the rough summary. The semantic similarity between sentences is identified and the sentences are removed thereby refining the summary. By eliminating similar sentences the summary is refined so that only informative sentences are left in the summary. The proposed redundancy elimination approach is applied on summaries obtained from an existing summarization system with the fuzzy based summarization model as a case study. Evaluation of the summary refinement approach is done on DUC2002 dataset and the results are promising.

© 2014Published byElsevierB.V. Thisisanopen access articleundertheCC BY-NC-NDlicense (http://creativecommons.Org/licenses/by-nc-nd/3.0/).

Selection and peer review under responsibility of Information Engineering Research Institute Keywords: Text Summarization; Summary Refinement; Redundancy Elimination; Binomial Distribution.

* Esther Hannah. M . Tel.: +919841627565; E-mail address: author@institute.xxx .

2212-6678 © 2014 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.Org/licenses/by-nc-nd/3.0/).

Selection and peer review under responsibility of Information Engineering Research Institute doi: 10.1016/j.ieri.2014.09.084

1. Introduction

The amount of data on the Internet increases every day, and therefore the task of selecting and classifying relevant information becomes all the more difficult. Text summarization systems can automate the task of generating a summary from a large text in a considerable amount of time. Traditionally researchers looked at designing statistical models for achieving this. More recently, attention has turned to a variety of machine learning algorithms that can build models automatically.

A summary consists of the main topics in one or more documents as a short and concise readable text. Summaries are generated from a single document [1] or multiple documents. The summary formed from more than one documents is called multi-document summarization. A 'query-biased' summarization [6] provides information to user on queries. Topic summarization deals with the generation of topics along with providing the most informative sentences. The present work focuses on summary refinement. Summarization makes the document more readable by making only the information [9] content provided to the user. Human generated summaries are expensive and machine generated summaries are not up to the mark. Several efforts are made by researchers in order to generate good, informative summaries.

The rest of this paper is organized as follows. Section 2 reviews the background and related work. Section 3 provides an overview of the proposed work summary refinement model. Section 4 discusses the conducted evaluation and results. Section 5 concludes the paper.

2. Background work

One of the very first works in automatic text summarization was done by Luhn et al in 1958, demonstrates research work done in IBM, focused on technical documents [11]. Luhn proposed that the 'frequency of word' proves to be a useful measure in determining the significance factor of sentences. Many approaches are already proposed on text summarization [4], based on the model they used the results vary. Some use to assign numeric weights to index terms based on the frequency of the term occurring in the document. The automatic extracting system [2] assign numerical weights to text sentence based on the weights assigned to certain machine-recognizable characteristics or clues. Some use to assign weights based on the semantic similarity measurement.

Extracting key sentences from a document to form a summary can be done by measuring the relevance of sentences using fuzzy-rough sets [15]. A text summarization system that produces extractive summaries that utilize a well-defined set of features that represent the sentences in a text was proposed.

3. A Summary refinement model

System generated summaries are prone to contain similar sentences that convey similar meaning. Such similar sentences would have been the candidate sentences of the summary due to their importance in features. This leads to redundancy in summaries and thereby increases the length of the summaries. Some researchers have proposed to refine a system-generated summary using filtering sentences or phrases before they could become part of the summary. We have taken up this challenge and suggested a way of refining extractive summaries by removing redundant sentences.The proposed approach makes use of Binomial distribution for measuring Context Based Indexing [10]. By giving the weights to the topical terms, the sentence similarity weight can be assigned and a graph can help to eliminate redundant sentences to give a refined summary. The similarity values are used to construct a graph, showing the connection between sentences. The sentences with

more connections are taken for summary that it contains more informative text. Figure-1 shows the architecture of the proposed summary refinement model.

Rough Pre-processing

Clean Text

Measuring Lexical

Term Indexing

Sentence Similarity

Lexical Association value

Term Indexing Weight

Similarity Value

Eliminate . Redundancy Refined Summary

Fig.1 Summary Refinement Model

3.1 A Summary and its Lexical Association

The lexical association between a pair of topical terms is greater than the lexical association between a pair of non topical terms or between a topical and a non-topical term. For a given rough summary, each term may be topic related term or non topical term. Identifying the topical terms is an important part in any information retrieval system. In this research we use term co-occurrence knowledge to differentiate between non topical and topical terms. The lexical association value is measured using term co-occurrence pattern.

Let the given rough summary contain N sentences. Let the document contain s unique words, called "index terms". Let T = {th t2,...,ts} be the set of index terms. Let S = {ShS2,...,SN} be the set of N sentences. Let the frequency be fij with which term tj occurs in sentence Si and Nj be the number of sentences in which the term tj occurs at least once. Let Nij denote the number of sentences in which terms ti and tj co-occur. Let us consider the distribution of terms ti and tj in the input document. As per the binomial distribution, the probability Pi of the term ti appearing in a document is given by equation (1):

Pi=Ni/N (1)

Consider the Nj sentences in which term tj occurs. Term ti occurs in Nij sentences out of these Nj sentences and does not occur in Nj—Nj sentences. Therefore, the probability of Nj co-occurrences in Nj sentences is given by the equation (2) .

Prob(Nij) = B(N,Nj,Nij) (2)

Self information measures the information content that is associated with probability given in equation (2). Therefore, the information content present in the Nij term co-occurrences of term ti in Nj sentences is shown in

equation (3) .

Inf(Nij) = -log2(Prob(Nij))

3.2 Term Weighing and Sentence Indexing

The lexical association measure in a document and the context sensitive indexing weight of each term in a document is estimated as explained in the previous section. The lexical association value between the same terms is initialized to 0. The term weight of each term tj in a document based on its lexical association, denoted by TermWeight(tj) is calculated by the summing the lexical association of the term tj with other unique terms of the rough summary iteratively. The TermWeight(tj) is set to 1 for all the terms in a text document as given in equation (4) and cumulative sentence weight SentWeight(S,) is given by equation (5).

3.3 Sentence Similarity Measurement

Estimating how similar one sentence is with another is important task which has a wide impact on many text mining applications. A more popular way to compute the similarity value between a pair of documents is by making use of the cosine similarity measure. The inner product of two vectors is divided by the product of their vector lengths. The vectors are normalized to unit length with only the cosine of the angle between these vectors counts for the similarity between them. The similarity between two sentences si and sj is computed using the dot product.

3.4 Redundancy Elimination

Once the similarity between sentences are estimated, a undirected graph G = (V, E) is constructed, where V is a set of nodes denoting the sentences in a document and E the edge denotes the similarity values between two nodes. The number of connections for each sentence is calculated. Nodes that have higher similarity value represent redundant sentences and hence one of the nodes in each pair of nodes that are very similar to the other has been removed.

4. Experimental setup

For evaluating the summary refinement model, we have used the fuzzy based summarization system for providing the rough summary. the fuzzy based summarization system[] makes use of the feature extraction for each sentence in the input text document. Since we found that the context based indexing scheme for refining summaries might be a good candidate for refining summaries, we applied the method on DUC 2002 dataset. For performing the experiment we have a chosen text summarization system that is based on fuzzy logic [3].

4.1 Fuzzy Based Summarization (FBS) Systems — A Case Study

In the FBS, the base for sentence scoring is the set of fuzzy rules. The trapezoidal membership function is made use for fuzzification. The membership function is simple and commonly used and hence the fuzzy inference system(FIS) of FBS makes use of it. Input given to the FIS is fuzzified, followed by defuzzification

TermWeight(tj) = Yj=o LexAsso (tj) SentWeight (Sj) = £yl0 TermWeight (tj)

so that the importance of every sentence is known. The features that are selected and extracted are used as input to FIS, and the score of the features lies between 0 and 1. Manually categorized of the feature values such as very low(vl), low(l), average(a), high(h), very high(vh) was done respectively. These input linguistic variables are fuzzified into different fuzzy sets namely less, average and more.

5. Results and Evaluation

The datasets provided by the Document Understanding Conference (DUC) 2002 is made use for evaluating the proposed summary refinement model. Evaluation of a summary is perhaps the most difficult task since there is no universal standard to measure and evaluate a summary. Since human judgment is expensive, in terms of resources, and is inconsistent, a need for an evaluation standard is a necessity. The Summary Refinement Model is evaluated by a set of metrics called Recall-oriented Understudy for Gisting Evaluation (ROUGE). The Precision, Recall and F-score values of the refined summary are generated using ROUGE-1, which is said to be apt for extractive text summarization systems. The Figure-2 shows the comparison value of lexical association between fifteen documents and their summaries. The analysis shows that the lexical association value for summary is always greater than document for most of the cases.

2 1.5 1

Miiiimrim

iiirrminmi

I Document Summary

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Fig.2 Average Lexical Association

Table 1 shows the results of ROUGE-1 for the rough summary produced by FBS system, Baseline Summary given by DUC 2002, Microsoft Word 2007 and for the refined FBS summary produced by the proposed context based summary refinement approach. The graphical representation of the results is shown in Figure 3.

Table 1 ROUGE-1 results of refined FBS system

ROUGE-1 Refined FBS Summary Fuzzy Based Summarization (FBS) System MS word

DUC2002 2007

Recall 0.4635 0.4918 0.44147 0.39306

Precision 0.5065 0.4734 0.44766 0.46718

F-score 0.4894 0.4824 0.44132 0.42691

Figu.3 Results of Refined FBS

6. Conclusion and future scope

In this proposed work, the information content of each term is calculated using the binomial distribution and also a graph-based algorithm is used to eliminate redundancy. The sentences are eliminated to get a refined summary. The performance shows that the summarization has been improved over various methods. The DUC 2002 is used for evaluation and the ROUGE results were promising. Further work would be to perform experiments using this context sensitive document indexing approach for various other information retrieval tasks. It is believed to be interesting to see the use of the proposed summary refinement model in other natural language applications such as document classification, text categorization and knowledge based information acquisition. However the proposed summary refinement model was applied on an existing fuzzy based summarization system and the results are promising and thus can be used as an effective tool in summary refinement.

References

[1] X. Wan and J. Xiao (2010), "Exploiting neighborhood knowledge for single document summarization and key phrase extraction", ACM Trans. Inf. Syst., vol. 28, pp. 8:1-8:34.

[2] K. S. Jones (1998), "Automatic summarizing: Factors and directions", MIT Press, pp. 1-12.

[3] Esther Hannah, T.V.Geetha, Saswati Mukherjee, (2011), "Automatic Extractive Text Summarization Based On Fuzzy Logic: A Sentence Oriented Approach", Swarm, Evolutionary,and Memetic Computing, Part I, LNCS 7076, pp. 530-538.

[4] M.A. Fattah and Fuji Ren, (2008), "Automatic Text Summarization", In proceedings of World Academy of Science, Engineering and Technology Vol.27.pp 192-195.

[5] L.-W. Ku, L.-Y. Lee, T.-H. Wu, and H.-H. Chen (2005), "Major topic detection and its application to opinion summarization", Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 627-628.

[6] J. G. Conrad, J. L. Leidner, F. Schilder, and R. Kondadadi (2009), "Query-based opinion summarization for legal blog entries", Proceedings of the 12th International Conference on Artificial Intelligence and Law, pp. 167-176.

[7] K. Cai, C. Chen, and J. Bu (2009), "Exploration of term relationship for bayesian network based sentence retrieval," Pattern Recognition Letters, vol. 30, no. 9, pp. 805 - 811.

[8] H. Li (2002), "Word clustering and disambiguation based on co-occurrence data," Nat. Lang. Eng., vol. 8, pp. 25-42.

[9] G. Amati and C. J. Van Rijsbergen (2002), "Probabilistic models of information retrieval based on measuring the divergence from randomness," ACM Trans. Inf. Syst., vol. 20, pp. 357- 389.

[10] Pawan Goyal and Laxmidhar (2012),"A Context based Word Indexing Model for Document Summarization", IEEE Transactions On Knowledge And Data Engineering, pp. 1-11.

[11]Luhn H.P., "The Automatic Creation of Literature Abstract", IBM Journal of research and development, vol. 2, pp.159-165,1958.

[12] Conroy, J. M. and O'leary, D. P. , 2001,"Text summarization via hidden markov Models", In Proceedings of SIGIR '01, pages 406-407.

[13] Khosrow Kaikhah, "Text Summarization Using Neural Networks", Faculty Publications, Texas State University, 2004

[14] S.P. Yong, Ahmad I.Z. Abidin and Y.Y. Chen, "A Neural Based Text Summarization System", in Proceedings of the 6th International Conference of DATA MINING, 2005.

[15] Hsun-Hui Huang, Yau-Hwang Kuo, Horng-Chang Yang, "Fuzzy-Rough Set Aided Sentence Extraction Summarization", Proceedings of the first International Conference on Innovative Computing, Information and Control, 2006.