Scholarly article on topic 'Verb Tense Generation'

Verb Tense Generation Academic research paper on "Languages and literature"

CC BY-NC-ND
0
0
Share paper
OECD Field of science
Keywords
{}

Abstract of research paper on Languages and literature, author of scientific article — John Lee

Abstract Correct usage of verb tenses is important because they encode the temporal order of events in a text. However, tense systems vary from one language to another, and are difficult to master for machines and non-native speakers alike. We present a method to predict verb tenses based on syntactic and lexical features, as well as temporal expressions in the context. A statistical model trained on Conditional Random Fields significantly outperforms the baseline. This model may be used in post-editing verbs in machine translation output and texts written by non-native speakers.

Academic research paper on topic "Verb Tense Generation"

SciVerse ScíenceDírect

Procedia - Social and Behavioral Sciences 27 (2011) 122 - 130

Pacific Association For Computational Linguistics (PACLING 2011)

Verb Tense Generation

John Leea*

aHalliday Centre for Intelligent Applications of Language Studies, Department of Chinese, Translation and Linguistics, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong

Abstract

Correct usage of verb tenses is important because they encode the temporal order of events in a text. However, tense systems vary from one language to another, and are difficult to master for machines and non-native speakers alike. We present a method to predict verb tenses based on syntactic and lexical features, as well as temporal expressions in the context. A statistical model trained on Conditional Random Fields significantly outperforms the baseline. This model may be used in post-editing verbs in machine translation output and texts written by non-native speakers.

© 2011 Published by Elsevier Ltd. S election and/or peer-review urnier responsibility of PA CLING Organizing Committee.

Keywords: Type your keywords here, separated by semicolons ;

1. Introduction

Almost every sentence contains a verb. A verb describes a situation --- an event or a state of being; in many languages, the time and nature of this situation are marked by tenses. Tenses relate the time of the utterance of the verb to the time of occurrence of the situation. Indeed, from verb tenses, one can infer temporal relationships within the sentence [1], and also determine the temporal ordering of events in a document [2,3]. The various tenses of the English verb 'eat' are shown in Table 1.

The inventory of tenses vary from one language to another. The German present tense, for example, may be translated into either the English present or future. Spanish uses two different tenses, the simple past (preterit) and the ongoing past (imperfect), for stative or non-stative verbs [4], while English makes

Corresponding author. Tel: Email address:

1877-0428 © 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of PACLING Organizing Committee. doi:10.1016/j.sbspro.2011.10.590

no such distinction. In Chinese, which lacks overt tense markers altogether, only the context dictates whether the tense should be rendered in past, present, or future. The lack of one-to-one mappings between tenses in the source and target languages obviously poses a challenge to machine translation (MT) [4,5,6]. Some interlingua-based MT systems have incorporate rules for tense generation in restricted domains [7]; however, we are not aware of any statistical MT system, designed for the general domain, that has tackled this issue explicitly.

If machines have difficulty in choosing the appropriate verb tense, it is also no easy task for non-native speakers. Japanese learners of English, for example, tend to prefer the root form of an English verb, underusing its various inflected forms [8]. Recently, many writing assistance tools have been developed to provide automatic feedback to students writing in a foreign language [9,10], but there has been no attempt to detect and correct tense errors.

Our goal is to automatically assign correct tenses to verbs in English texts, by statistically modeling the distribution of tense usage in real, coherent texts. Drawing on linguistic insights (see section 3), we view verb tenses in a document as a chain, taking into account local lexical and syntactic contexts, including temporal expressions. This model can be applied towards post-editing machine translation output and texts written by non-native speakers.

The rest of this paper is organized as follows. In section 2, we define our research question. In section 3, we survey two linguistic models that underpin our computational model for verb tense generation, to be described in section 4. We train and evaluate this model in section 5, before concluding the paper.

Table 1. Reichenbach's analysis of verb tenses consists of two dimensions, TIME and ASPECT [11]. From his six "base tenses", shown in italics, the system is extended to cover all verb forms seen in our data. The symbol "<" means "precedes", and ">" means "follows". In the progressive aspect, the event time (ET) is understood as a period rather than a point.

ASPECT ^ Simple Perfect Progressive Perfect Progressive

TIME 4 (RT=ET) (RT>ET) (RT contained in ET) (RT > ET)

Past (RT < ST) it ate it had eaten it was eating it had been eating

Present (RT=ST) it eats it has eaten it is eating it has been eating

Future (RT>ST) it will eat it will have eaten it will be eating it will have been eating

Infinitive (undefined) to eat to have eaten to be eating to have been eating

Participle (undefined) eating having eaten eating having been eating

2. Research Question

In this section, we first present the verb tense classification system to be adopted in this paper, together with its terminology; we then define the research question.

2.1. Background

Many linguistic theories about verb tenses have been proposed [12,13,14,15]. Among the most well-known analyses is the Base Tense Structure [11], which classifies verb tenses along two dimensions, TIME and ASPECT.

The TIME dimension relates the reference time (RT) and speech time (ST) of the verb, and can take on the values "past", "present" or "future". The ASPECT dimension relates reference time and event time

(ET), and can take on the values "simple" or "perfect". These two dimensions yield the six so-called 'base tenses', shown in italics in Table 1. Consider the following sentences, adopted from [16]:

(1) John is a colleague of Mary's.

(2) He went over to her house yesterday.

(3) On the way, he had stopped by the flower shop for some roses.

In sentence (1), the verb 'is' is in the "simple present" tense, indicating that the event time is identical to the time of utterance (ET=RT=ST). In sentence (2), in contrast, the verb 'went' is in "simple past", so the speaker is referring to an event from an earlier time (ET=RT<ST). The use of the "past perfect" in sentence (3), 'had stopped', signals that the event time precedes the reference time of 'went' in sentence (2), as will be discussed in section 3.1; in turn, that reference time precedes the time of utterance (ET<RT<ST).

To cover all tenses found in the Penn Treebank, the Base Tense Structure is extended in both dimensions. To the TIME dimension are added two categories for the non-finite verbs, "infinitive" and "participle", whose RTs are undefined. To the ASPECT dimension are added the categories "progressive", whose ET is an interval that contains the RT, and "perfect progressive", whose ET precedes RT.

Even with these extensions, the system in Table 1 still lacks some other dimensions of verb tenses, including MODALITY ("it could eat", "it could have eaten"), NEGATION ("it does not eat", "it did not eat") and VOICE ("it is eaten", "it was eaten"). These dimensions are overtly marked in more languages, making them less problematic both for machine translation and for non-native speakers, and so will not be addressed in this paper.

2.2. Problem Definition

Most previous work in verb tense generation assumes as input some symbolic representation of the relevant time intervals. A set of manually crafted rules then infers the relationship between the event, reference and speech times, and finally selects the appropriate tense [17,18].

However, in many applications, such prior information can hardly be expected to be available. For example, in MT, the source language may have no tense markers; in writing assistance tools, the tenses used by the non-native speakers are unreliable. For these applications, it is more realistic to take only the raw text as input, assuming only knowledge of the infinitive form of the verb. Given the infinitive form, then, the system is required to assign to the verb its appropriate tense, based on its context. No attempt will be made to further determine the sense of the tense [19].

Thus, this task can be viewed as the classification of each verb as one of the categories in Table 1. For example, the verbs in the sentence

Lorillard Inc., the unit of New York-based Loews Corp. that make Kent cigarettes, stop use crocidolite in its Micronite cigarette filters in 1956.

are to be classified as:

Lorillard Inc., the unit of New York-based Loews Corp. that make[present, simple] Kent cigarettes, stop[past, simple] use [participle, simple] crocidolite in its Micronite cigarette filters in 1956.

In this paper, for reasons to be discussed in section 3.2, we will restrict our attention to the verb in the main clause (henceforth the 'main verb') in each sentence. Table 2 shows the distribution of the tenses in our experimental data.

Table 2. Breakdown of verb tenses in the training set. The categories for the nonfinite verbs, "Infinitive" and "Participle", are collapsed into "Other".

Percentage

Less than 0.5% each

Past simple Present simple Present perfective Present progressive Future simple Other

Past perfective All other categories

3. Linguistic Models

Insights from two linguistic studies have informed the design of our computational model for verb tense generation. The first (section 3.1) establishes verb tenses an anaphor, drawing an analogy with nouns; the second (section 3.2) provides guidance on how to resolve anaphoricity.

3.1. Tense as Anaphor

When the reference time (RT) and event time (ET) are not explicitly provided in a sentence, a listener needs to reconstruct them from context. It has been argued that the listener does so by performing anaphor resolution [16].

A noun may be anaphoric, if it refers to a previously mentioned entity, or non-anaphoric, if it does not. A verb has similar properties: if it is anaphoric, then its RT is the same as the RT previously established by a preceding verb. Typically, the tense of this preceding verb would have the same TIME dimension. In contrast, if a verb is non-anaphoric, then it establishes a new RT. A tense whose TIME value differs from those of the preceding verbs is a strong indication of non-anaphoricity.

Under this framework, the three sentences in section 2.1 may be re-analyzed as follows. The text starts with the speech time identical to the reference time, as indicated by the "present" verb 'is' in sentence (1). The "past" verb 'went' in (2) is non-anaphoric, introducing a new RT, 'yesterday'. This RT also serves as the antecedent for the "past" verb 'had stopped' in (3), whose RT is the same as that of 'went'. The ET of 'had stopped', however, is placed before the RT, due to its "perfect" aspect.

In summary, tense generation may be understood as a kind of anaphor resolution. If a verb is anaphoric to a preceding verb, then its RT remains unchanged; if it is non-anaphoric, then it shifts to a different RT. In this paper, we will not attempt to pinpoint the exact RT; rather, for the TIME dimension,

all verbs are assigned as one of three "generic" RT, namely, "present", "past" or "future". In other words, the verb's anaphoricity will decide whether there is a shift to a different TIME category.

3.2. Anaphoricity Resolution

Each of the sentences considered in sections 2.1 and 3.1 has only one reference time (RT). In general, however, a sentence can contain multiple verbs and RTs. Consider the following three sentences, taken from the Penn Treebank [20]:

(1) A form of asbestos once used to make Kent Cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than 30 years ago, researchers reported.

(2) The asbestos fiber, crocidolite, is unusually resilient once it enters the lungs, with brief exposures to it causing symptoms that show up decades later, researchers said.

(3) Lorillard Inc., the unit of New York-based Loews Corp. that makes Kent Cigarettes, stopped using crocidolite in its Micronite cigarette filters in 1956.

Both sentence (1) and sentence (2) contain at least two RTs: one for the researchers talking to the author of the text ('reported' and 'said', respectively); another for the verb within the reported speech ('has caused' and 'is', respectively). Intuitively, the verb 'said' in (2) has the same RT as the verb 'reported' in (1). It is therefore anaphoric, with 'reported' as its antecedent, rather than the other verbs such as show' which are located closer to it. This phenomenon can be explained by the Temporal Centering theory [21], which is analogous to the Centering Theory [22].

According to the Centering Theory, in an utterance Ui, the nouns constitute the forward-looking centers, denoted as Cf(U1). These centers are ranked in terms of salience. The most highly ranked one, usually the subject, is called the backward-looking center, denoted as Cb(Ui). By the principle of "center retention", it is conjectured that, in a coherent document, Cb(Ui) is most likely to be the same as Cb(Ui-i), the backward-looking center of the preceding sentence. Although less likely, by a process of "center shift", it may also be the same as one of the elements of Cf(Ui-1). Finally, and even less likely, it may be non-anaphoric.

In Temporal Centering theory, the forward-looking centers are the verbs, rather than the nouns. The most salient verb, or the backward-looking center, is argued to be the one that is in the main clause. Based on a small-scale study on the Brown Corpus, this definition of the backward-looking center is found to favor center retention over center shift [21].

Applying Temporal Centering to the example above, the verb 'reported' is in the main clause of sentence (1), and is thus the backward-looking center. The theory correctly predicts that it is more likely to be the antecedent of the verb 'said' in (2) than other verbs in the subordinate clauses. Following this assumption, in our experiments, we will consider only the verbs in the main clauses, but not those in the subordinate clauses.

4. Computational Model

The linguistic theories described in section 3 form the basis of our computational model. We now approach the verb tense generation task as a sequence labeling problem. From a document with N sentences, we extract its sequence of N main verbs, and predict their tenses.

4.1. Conditional Random Fields

Conditional Random Fields (CRF) are a form of undirected graphical model [23], with two characteristics: (1) the model is conditionally trained on observed variables; and (2) the joint probability function is log-linear in the parameters. Restricting consideration for anaphoricity to the immediately preceding verb, we use a linear-chain, order-1 CRF, similar to the architecture in [24], to represent our sequence of verbs. Our training data take the form {(y1, x1), ..., (ym, xm)}, where yi is the tense label (see Table 1) of the main verb in the ith sentence in the document, and xi consists of the features associated with that verb.

4.2. Features

Our features include lexical and syntactic features, drawn from the Penn Treebank [20]; and temporal features, extracted via the TempEx tagger [25].

The lexical and syntactic features include:

• Verb: The infinitive (root) form of the verb.

• Verb.i: Verb in the main clause of the preceding sentence.

• Agent: The subject of the verb.

• Preposition: The preposition, if any, that dominates a temporal expression, such as 'during' for

"during summer". Some prepositions are indicative of the ASPECT of the verb.

The temporal tags ("-TMP") in the Penn Treebank could have been used to generate the temporal features; to avoid dependence on this feature, however, we used instead the TempEx tagger [25] to extract temporal expressions from the sentences. This tagger takes the reference time, i.e., the date on which the document was written, as an input parameter. It then identifies temporal expressions and gives each one a TimeType and (possibly) a Time Value:

• TimeType: The type may be "date" or "time".

• TimeValue: The value is an ISO-standard time-stamp. It is compared with the reference time

and rewritten as "present_ref' or "past_ref\ Future references were found to be relatively uninformative, and were excluded. In the training set, it is found that 89.4% of the expressions labeled "past_ref' do in fact correspond to a verb in past tense, while "present_ref' are less reliable, at 55.3%.

The main verb obtains these two feature values from the closest temporal expressions, if any, in the parse tree.

5. Experiments

5.1. Setup

The CRF implementation in the MALLET package [26] was used to train an order-1 CRF. The variance in the regularization term is set at default to o = 10. We trained on verb tense sequences from sections 0 to 21 of the Penn Treebank [20]. There are 45650 main verbs in a total of 2053 articles, and hence the same number of tense sequences. The breakdown of the tenses of these verbs is shown in Table 2.

Section 23 served as the test data. Features for this section were obtained from the parse trees produced by a statistical natural language parser [27].

5.2. Results

The experimental results are tabulated in Table 3. A simple baseline of always predicting the majority class, i.e., " past simple", yielded 45.5% accuracy.

The order-1 CRF performed at 58.3%. Although it significantly outperformed the baseline, it still frequently over-predicted the "past simple" tense. This error is partially caused by the strong influence of the previous tense on the prediction of the current one; in the absence of explicit temporal expressions, the model is reluctant to predict a shift in tense, i.e., a shift in reference time. Temporal expressions, unfortunately, are sparse. In the training set, only 1.2% of the verbs in the "present" tense, and 6.1% of those in "past", have the benefit of an explicit temporal expression elsewhere in the sentence.

To gauge the extent to which knowledge of the previous tense is helpful, we ran the HISTORY ORACLE experiment. In this model, one additional feature --- the correct tense label of the previous verb — was added to the CRF model. This feature improved the accuracy rate by an absolute 4%. While this oracle information improved the performance on the TIME dimension, the ASPECT dimension remained difficult to predict. In examining the errors, it seems that real-world knowledge would be needed in many cases to determine the most likely relation between the event and reference times; in others, this relation is hardly recoverable from context.

Table 3. Accuracy in verb tense generation. MAJORITY is the baseline of choosing the most frequent tense. CRF is the order-1 conditional random fields described in section 5.1. HISTORY ORACLE is the CRF model augmented with the tense label of the preceding verb. Please see section 5.2 for a discussion of these results.

Accuracy

MAJORITY

45.5% 58.3% 62.2%

HISTORY ORACLE

6. Conclusion and Future Work

We have presented a method to predict verb tenses based on syntactic and lexical features as well as temporal expressions in the sentence. A statistical model, trained on a linear Conditional Random Field, significantly outperforms the majority baseline.

Looking forward, we plan to incorporate semantic features for the verbs, possibly drawn from FrameNet [28] or Levin's verb classes [29], and to consider verbs outside the main clauses, making fuller use of the Temporal Centering Theory [21]. We would also like to explore if domain knowledge about the tense usage in particular kinds of texts, such as scientific publications, can be incorporated.

Acknowledgements

This work grew out of two graduate-level courses, "Computational Models of Discourse" and "Machine Learning", taken by the author at the Massachusetts Institute of Technology. The author would like to thank Professor Regina Barzilay and Professor Tommi Jaakkola, who taught these two courses.

References

[1] M. Lapata and A. Lascarides. Learning sentence-internal temporal relations. Journal of Artificial Intelligence Research, 27:85—117, 2006.

[2] I. Mani, M. Verhagen, B. Wellner, C. M. Lee, and J. Pustejovsky. Machine learning of temporal relations. In Proc. COLING/ACL, pages 753—760, Sydney, Australia, 2006.

[3] N. Chambers and D. Jurafsky. Jointly combining implicit constraints improves temporal ordering. In Proc. EMNLP, 2008.

[4] B. J. Dorr, P. W. Jordan, and J. W. Benoit. A survey of current paradigms in machine translation. Advances in Computers, 49:1—68, 1999.

[5] M. Schiehlen. Granularity in tense translation. In Proceedings of the 18th International Conference on Computational Linguistics, Saarbr'ucken, Germany, 2000.

[6] Y. Ye and Z. Zhang. Tense tagging for verbs in cross-lingual context: A case study. In Proc. IJCNLP, 2005.

[7] C. Wang and S. Seneff. High-quality speech-to-speech translation for computer-aided language learning. ACM Transactions on Speech and Language Processing, 3(2), 2006.

[8] J. Lee and S. Seneff. An analysis of grammatical errors in non-native speech in english. In Proc. Spoken Language Technology (SLT) Workshop, pages 89—92, Goa, India, 2008.

[9] O. Knutsson, T. C. Pargman, K. S. Eklundh, and S. Westlund. Designing and developing a language environment for second language writers. Computers & Education, 49(4): 1122—1146, 2007.

[10] C. Leacock, M. Gamon, and C. Brockett. User input and interactions on microsoft research esl assistant. In Proc. 4th Workshop on Innovative Use of NLP for Building Educational Applications, 2009.

[11] H. Reichenbach. Elements of Symbolic Logic. MacMillan, London, England, 1947.

[12] V. Ehrich. The generation of tense. In G. Kempen, editor, Natural Language Generation. Dordrecht, The Netherlands, 1987.

[13] A. E. Goldberg. Constructions: A Construction Grammar Approach to Argument Structure. University of Chicago Press, 1995.

[14] C. M. Matthiessen. Tense in english seen through systemicfunctional theory. In M. Berry, C. Butler, R. Fawcett, and G. Huang, editors, Meaning and form: systemic functional interpretations, pages 431—499. Ablex, Norwood, NJ, 1996.

[15] J. Nerboeee. Reference time and time in narration. Linguistics and Philosophy, 9(1):83—95, 1986.

[16] B. L. Webber. Tense as discourse anaphor. Computational Linguistics, 14(2):61-72, 1988.

[17] D. K. Elson and K. R. McKeown. Tense and aspect assignment in narrative discourse. In Proc. 6th International Natural Language Generation Conference, 2010.

[18] D. Fum, P. Giangrandi, and C. Tasso. Tense generation in an intelligent tutor for foreign language teaching: some issues in the design of the verb expert. In Proc. 4th conference on European chapter of the Association for Computational Linguistics (EACL), 1989.

[19] R. Reichart and A. Rappoport. Tense sense disambiguation: a new syntactic polysemy task. In Proc. EMNLP, Cambridge, MA, 2010.

[20] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: the penn treebank. Computational Linguistics, 19(2), 1993.

[21] M. Kameyama, R. Passoeeeau, and M. Poesio. Temporal centering. In Proceedings of 31st Annual Meeting ofthe Association for Computational Linguistics (ACL-93), pages 70-77, Columbus, OH, 1993.

[22] B. J. Grosz, S. Weinstein, and A. K. Joshi. Centering: a framework for modeling the local coherence of discourse. Computational Linguistics, 21(2), 1995.

[23] Andrew McCallum, John Lafferty and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML, 2001.

[24] F. Sha and F. Pereira. Shallow parsing with conditional random fields. 2003.

[25] I.Mani and G.Wilson. Robust temporal processing of news. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL-00), pages 69-76, Hong Kong, China, 2000.

[26] A. McCallum and A. Kachites. Mallet: A machine learning for language toolkit, 2002. http://mallet.cs.umass.edu.

[27] M. Collins. Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, University of Pennsylvania, Philadelphia, PA, 1999.

[28] C. F. Baker and H. Sato. The framenet data and software. In Proc. ACL, Sapporo, Japan, 2003.

[29] B. Levin. English Verb Classes and Alternations: A Preliminary Investigation. The University of Chicago Press, Chicago, IL, 1993.