Scholarly article on topic 'Corpus Methods for Descriptive Translation Studies'

Corpus Methods for Descriptive Translation Studies Academic research paper on "Languages and literature"

Share paper
OECD Field of science
{"Corpus linguistics" / "translations studies" / "translation universals" / "corpus methods" / "corpus design" / "corpus encoding"}

Abstract of research paper on Languages and literature, author of scientific article — Federico Zanettin

Abstract Over the last 20 years corpus resources and tools have considerably affected translation research and practice. This paper focuses on the intersection of corpus linguistics and descriptive translation studies. It first provides an overview of several types of descriptive investigation based on corpora, to focus then on research on so-called translation universals, offering a survey of some 20 studies and making a distinction among descriptive features, linguistic indicators and computational operators. I briefly discuss research on universal features, and describe the corpora surveyed in terms of a typology of translation-driven corpora. I then consider some methodological issues concerning corpus design and annotation and, finally, I suggest that a combination of quantitative and qualitative approaches is crucial to further research in corpus-based translation studies.

Academic research paper on topic "Corpus Methods for Descriptive Translation Studies"

Available online at

ScienceDirect PfOCSCl ¡0

Social and Behavioral Sciences

Procedia - Social and Behavioral Sciences 95 (2013) 20 - 32 —

5th International Conference on Corpus Linguistics (CILC2013)

Corpus Methods for Descriptive Translation Studies

Federico Zanettin*

Dipartimento di Scienze Politiche, Universita di Perugia, Via Pascoli 1, 06123 Perugia Abstract

Over the last 20 years corpus resources and tools have considerably affected translation research and practice. This paper focuses on the intersection of corpus linguistics and descriptive translation studies. It first provides an overview of several types of descriptive investigation based on corpora, to focus then on research on so-called translation universals, offering a survey of some 20 studies and making a distinction among descriptive features, linguistic indicators and computational operators. I briefly discuss research on universal features, and describe the corpora surveyed in terms of a typology of translation-driven corpora. I then consider some methodological issues concerning corpus design and annotation and, finally, I suggest that a combination of quantitative and qualitative approaches is crucial to further research in corpus-based translation studies.

© 2013 TheAuthors.Publishedby ElsevierLtd. Selectionandpeer-reviewunderresponsibilityofCILC2013.

Keywords: Corpus linguistics; translations studies; translation universals; corpus methods; corpus design; corpus encoding

1. Introduction

Corpus linguistics has made a significant contribution both to translation practice and to translation theory. In translation practice, corpora have had a decisive impact as concerns the work of translation professionals, learners and users. Most professional translators today rely to a large extent on computer-assisted methodologies to carry out their work, and translation memories, which are a specific type of dynamic parallel corpora, are a standard tool of the trade. Translators often compile their terminologies from corpora, and corpus management and analysis skills have become part of translational competence. Millions of people in their everyday life use automatic machine translation systems which rely largely on corpus-based statistical machine translation techniques. This article looks

* Corresponding author. Tel.: +39 0755855415; fax: +39 07563062222 E-mail address:

1877-0428 © 2013 The Authors. Published by Elsevier Ltd. Selection and peer-review under responsibility of CILC2013. doi:10.1016/j.sbspro.2013.10.618

instead at how corpora and corpus linguistics techniques have been used in translation descriptive research, allowing for the investigation and empirical testing of a number of theoretical hypotheses.

Perhaps the first computer-assisted studies of translated texts is Gellerstam (1986), who set out to investigate the features of 'translationese', that is "all forms of translation which can in some form be viewed as having been influenced by the original text, without the term implying any value judgment" (Gellerstam, 2005:202). Now corpus-based translation studies (CTS or CBTS) is an established subfield of the descriptive branch of the discipline, and includes a number of different lines of inquiry.

The main research strand is perhaps that which investigates the hypothesis of translation universals, i.e. supposedly invariant features which characterize all translated texts independently of source language and translation direction (Baker, 1993). A second line of research focuses on individual variation rather than on universal properties. Its aim is to investigate translator style, i.e. coherent and motivated patterns of choice "recognizable across a range of translations by the same translator", which "distinguish that translator's work from that of other translators" and which "cannot be explained as directly reproducing the source text's style or as the inevitable result of linguistic constraints" (Saldanha, 2011: 240). A third area, whose level of generality stands in between these two, concerns translation norms and conventions. Like universals, norms are interpersonal and above the individual, but they refer to variant rather than invariant traits of translation since they refer to features which characterize translations produced in specific social and historical settings. The concept of translation norms is at the basis of empirical descriptive studies. However, though in her influential study Baker (1993) predicted that corpora would be used to investigate both "universal and norm-oriented features of translational behaviour" (Baker, 1993: 247), this latter area of research is seemingly under-developed.

A more recent, though connected, research strand is that which looks at translation in relation to language change that is at how evolving translation styles and norms relate to evolving language norms, and how translation affects and is in its turn affected by language change (House, 2008; Kranich et al., 2011, 2012). Other areas of research include corpus-based interpreting studies, contrastive linguistics and research using translation learner corpora. In interpreting studies, some investigations have been conducted in the framework of universals, while others have focused on specific features of spoken language, such as hesitations and disfluencies, and on linguistic indicators of social and discursive identity such as modality and interaction markers (Setton, 2011; Straniero Sergio and Falbo, 2012). Contrastive linguistics provides a basis for assessing translation-specific and source language-specific constraints. Translation learner corpora, which typically contain multiple translations of the same source texts, are used to identify patterns in student translations for pedagogical or descriptive purposes.

2. Theory, description, indicators and operators

For this article about 20 studies which have used corpora and corpus linguistics methods to investigate the hypothesis of translation universals were surveyed. Some of the studies surveyed are among the most well-known and quoted in the literature, while others were chosen more or less at random among those published in the last 10 years. The overview aims at illustrating the distinction among different tiers of abstraction and levels of linguistic analysis rather than being an attempt to provide a map of corpus-assisted studies of universal features of translation. Thus, it should be possible to apply the distinction between theory and description, indicators and operators to other areas of investigation, such as research on translator style, translation norms, or on the contribution of translation to language change.

In order to provide an interpretative grid for the studies used to prove or disprove theories and hypotheses in corpus-based translation studies research, a distinction can be made among four tiers of abstraction. The first is the tier of theory, in this case the general hypothesis that, as a result of the process of translation, all translated or interpreted texts share certain properties which distinguish them from similar non-translated texts. The second tier concerns the descriptive features which support the theory. Four descriptive features where initially posited by Baker (1993), namely simplification, explicitation, normalization and levelling out, and following research suggested others such as transfer, translation unique items, asymmetry, shining-through, etc. The third tier is represented by the linguistic indicators which realize a certain feature as concerns different levels of linguistic analysis. Finally, the fourth tier involves the computational implementation of these indicators, that is the way abstract linguistic features are instantiated through formal computational operators.

Table 1 maps the linguistic indicators of some descriptive categories as put forward in the studies surveyed. The first column lists features which have been posited as translation universals. The indicators which realize each descriptive feature are subdivided into four levels of linguistic analysis, that is lexis, syntax, semantics and discourse. This subdivision is somewhat arbitrary, in as much as different labels could be used (e.g. "grammar" or "pragmatics") and the distinction between the different levels is not always clear cut.

Table 1: Descriptive Features and Linguistic Indicators

Features Linguistic Indicator

Lexis Syntax Semantics Discourse

Simplification Lexical variety Lexical density Readability speakability

Explicitation Explicit signals of clausal relations Explicitation of optional syntactic choices Explicitating shifts in lexical cohesion conjunctive explicitness Explicative reformulation

Normalization Lexical creativity Collocational creativity Formality degree Distribution of typical and atypical register features Range of terms used to represent a conceptual domain

Transfer Distribution of most frequent words Distribution of typical and atypical register features

Translation of unique items Distribution of TL specific lexis Distribution of TL specific structures Distribution of TL lexicogrammatical realizations of a concept

Asymmetry Distribution of TL lexicogrammatical realizations of a concept Implicitating vs explicitating shifts

In studies looking at "simplification", the hypothesis to be proved is that the language contained in a corpus of translations is simpler that that contained in a corpus of comparable texts in the same target language. Laviosa (1997, 1998) first proposed as indicators of lexical simplification the two items in the cell at the top left, that is lexical variety (range of vocabulary) and lexical density (information load). Her study of lexical simplification was replicated in a variety of languages, for instance by Xiao et al., (2010) for Chinese and by Corpas-Pastor (2008) for Spanish. Indicators of syntactical simplification include readability (also proposed by Laviosa, 1997, 1998) and speakability, that is "the ease of reading aloud" (Puurtinen, 2003: 395).

Explicitation is the idea that translators consciously or unconsciously tend to make their translations more explicit than the source texts. This features has been investigated mostly at the levels of syntax and discourse, though Puurtinen (2004) also looks for linguistic indicators of explicitation at the level of lexis. At the level of syntax, indicators include the distribution in translated and non-translated texts of devices explicitating optional choices (Olohan and Baker, 2000; Kenny, 2004; Jiménez-Crespo, 2011). At the level of discourse proposed indicators

include explicitating shifts in lexical cohesion in translated texts as compared to their sources (0veras, 1998), conjunctive explicitness (Papai, 2004) and explicative reformulation (Xiao, 2011).

Normalization, also sometimes referred to as "conventionalization", "standardization", "conservatism" and "sanitization" is the (alleged) tendency of translated texts to conform to target language rather than source language patterns and norms, producing more conventional rather that unusual target strings. Sanitization for instance has been defined as the conservative rendering of creative source language features. Indicators of lexical normalization include degree of lexical and collocational creativity (Kenny, 2001; Olohan, 2004; Puurtinen, 2003), and degree of formality (De Sutter et al., 2012). Indicators of syntactical normalization include the distribution of typical and atypical register features (Hansen-Schirra, 2011), while indicators of semantic normalization include the range of terms used to represent the conceptual domain of colours (Olohan, 2004).

According to the fourth translation universal originally proposed by Mona Baker in her seminal 1993 paper, "levelling out" (also called "convergence" in Laviosa, 2002: 72), translations should be less idiosyncratic and more similar to each other than original texts are. None of the studies surveyed here, however, investigates linguistic indicators of leveling out or the way to implement them through computational operators. Some of the studies surveyed focused instead on the idea that all translations bear traces of the source language, a feature called "transfer", or "SL interference", or "shining through". Maurenen (2004), for instance, looked at the distribution of most frequent words as an indicator of lexical interference, while Hansen-Schirra (2011) and Teich (2003) looked at the distribution of typical and atypical register features as indicators of syntactic interference.

Unique items are features which tend to be "untranslatable" (unique to the target language) and which therefore should proportionally be under-represented in translated texts. Linguistic indicators include the distribution of TL specific lexical items (Tirkkonen-Condit, 2002), syntactic structures (Eskola, 2004) and lexicogrammatical realizations of the concept of manner-of-motion in verbs (Cappelle, 2012). Finally, the asymmetry hypothesis states that explicitations in one translation direction are more frequent than their corresponding implicitations in the opposite translation direction (Klaudy & Karoly, 2005; Becher, 2010).

For each descriptive feature, the linguistic indicators at the level of lexis, syntax, semantics and discourse are implemented computationally by a number of operators, that is through computer-assisted procedures of analysis and interpretation. In Table 2, the linguistic indicators of Table 1 have been replaced by the operators through which they were implemented.

Table 2: Descriptive Features and Operators

Features Formal Operator

Lexis Syntax Semantics Discourse

Simplification St. T/T ratio Ratio of function to content words Ratio of listhead to full list frequency Range/variety of synonymous amplifiers Average sentence length Nonfinite structures (e.g. constructions, pre-modified nominalizations

Explicitation Clause connectives (conjunctions, adverbs, pronouns) Optional complementizer that (English) Optional subject pronouns (Spanish) Addition, specification Discourse particles Word clusters, Reformulation


Features Formal Operator

Lexis Syntax Semantics Discourse

Normalization Hapax legomena Lexicogrammatical register features (e.g. Terms of

Creative collocations Distribution of verb tense and voice, phrase and clause structure) colour


Derivational suffix -

Colloquial words

Set of near synonyms

Transfer Word frequency bands Lexicogrammatical register features

Translation of unique Finnish clitics Nonfinite structures Motion verbs

items Finnish "sufficiency" verbs (manner)

Asymmetry Reporting verbs Omissions, substitutions, additions

The indicator of lexical simplification "lexical variety" is instantiated through the computation of the standardized type/token ratio; "lexical density" is instantiated by computing the ratio of function to content words, the ratio of high to low frequency words, the ratio of the listhead to the full list frequency (Laviosa, 1997, 1998), as well as by looking at the range and variety of synonymous amplifiers in translated and non translated comparable corpora (Jantunen, 2002). Readability is assessed by computing average sentence length (Laviosa, 1997, 1998; Corpas-Pastor, 2008), and by looking at the frequency of some nonfinite constructions and pre-modified nominalizations in translated as opposed to non-translated children's fiction (Puurtinen, 2003).

The indicators used to assess lexical explicitation (Puurtinen, 2004) are clause connectives which signal the type of relation between clauses such as conjunctions, adverbs and relative pronouns. At the level of syntax, operators include the optional complementizer 'that' after reporting verbs in English (Olohan and Baker, 2000; Kenny, 2004) and optional subject pronouns in Spanish (Jiménez-Crespo 2011). In order to assess explicitating shifts in lexical cohesion 0veras (1998) compared translated texts with their sources, looking for and classifying addition and specification shifts. Pápai (2004) looked at discourse particles as indicators of conjunctive explicitness, while Xiao (2011) used word clusters and reformulation markers as indicators of explicative reformulation.

The corpus-based studies of normalization surveyed rely on bundles of operators such as sets of rarely occurring words, pairs of near synonyms expressing different degrees of formality, or lexicogrammatical features whose cooccurrence characterizes register dimensions. For instance, it is argued that a distinctive, more restricted distribution of lexical items or collocations in translated texts would point to a preference by translators towards more conservative forms of language. Kenny (2001) investigated lexical creativity as indicated by hapax legomena and creative collocations. Dayrell (2007) looked at the distribution of collocations in translated and non translated texts as expressed by MI (mutual information) scores for collocates of frequent nouns. Both studies found that translated texts contain a narrower range of collocations. Olohan (2004) found that the English morpheme -ish is less productive in translated texts, while Puurtinen (2003) found that colloquial words are more frequent in translated texts. De Sutter et al. (2012) investigated the degree of formality in different genres in both translated and non translated texts by looking at the distribution of pairs of near synonyms. Hansen-Schirra (2011) looked at the distribution of typical and untypical dimensional features of fiction such as past or present tense, agentless passives, nominalization, etc. Finally, Olohan (2004) focused on the different range of terms used to represent the conceptual

domain of colours, finding that the variety of terms used in translations is more standard than that of non translations.

According to the transfer hypothesis the linguistic make-up of translations is affected by that of their source language, which "shines through". Thus, Mauranen found that the distribution of very frequent words in translated texts is different from their distribution in non translated texts in the same language, their profiles of deviation being correlated to specific source languages. Hansen-Schirra (2011) showed how some lexicogrammatical register features in translated texts have frequencies closer to those of the SL rather than to those of a corpus of non-translated TL texts. The hypothesis regarding the translation of unique items has been tested at the level of lexis, syntax and semantics . Tirkkonen-Condit (2002) found that language-specific lexical items such as Finnish clitics and "sufficiency" verbs occur less in translated than in original Finnish texts. Eskola (2004) tested this hypothesis for Finnish-specific syntactic structures, while Cappelle (2012) looked at lexicogrammatical realizations of the concept of manner-of-motion in verbs in texts translated into English from French and German, respectively.

Finally, the last two studies in the sample investigate the asymmetry hypothesis, concerning the over- and under-representation of explicitation and implicitation shifts in translation. Klaudy and Karoly (2005) found that Hungarian translations resort to a wider range of reporting verbs than the English originals, while English translations do not use a narrower range of reporting verbs than in the corresponding Hungarian source texts, while Becher (2010) found that different types of explicitating shifts are more frequent than the corresponding implicitating shifts in the opposite translation direction for the language pair English-German.

Research on translation universals is the area in which most studies have been conducted using corpus linguistics methods, but it is also quite controversial. First of all, not all scholars are convinced that the concept of translation universals is theoretically justified. According to some, features such as simplification, explicitation etc. are more likely to be features of language mediation, common to all situations of linguistic or cultural contact (House, 2008; Lanstyak and Heltai, 2012). Others contend that it is not clear whether regularities in translated texts should be regarded as cognitively constrained universals or socially constrained norms (Malmkjaer, 2008). Second, the explicative categories used to operationalize hypothesized universals are not always homogeneous and the mapping of formal operators into linguistic indicators and of these into descriptive features is not always clear. Descriptive categories, linguistic indicators, and formal operators thus may overlap to some extent and some of the studies cited may and have in fact been classified in a different way. While there's little dispute over normalization (and its near-synonymous features conservatism, standardization, sanitization, etc.) and transfer as universal features, other hypotheses have been explicitly challenged. Becher (2010), for instance, contents that the choice of using a more explicit syntactic structure, for instance that of not omitting the complementizer 'that' in English, could derive from SL interference, in as far this option is triggered by an underlying non-optional syntactic requirement in the SL, or be an instance of normalization, in as far as the choice between competing syntactic structures is correlated to level of formality (a higher level of formality being the default normalizing option for translated texts). Similarly, 0veras' (1998) study, dealing with implicitating as well as explicitating shifts, could be interpreted as substantiating the asymmetry hypothesis rather than the explicitation hypothesis. Explicitation in itself can be taken to be an aspect of simplification, in as far as by making something explicit translators resolve potential ambiguities and therefore produce simpler texts. The hypothesis concerning the different distribution of unique items, namely that lexicogrammatical and syntactic items which are specific to the target language, that is, that do not have a straightforward equivalent in the source language or languages, are under representation in translated text, can be seen as an indirect form of SL interference, which affects the TL by subtraction, that is not in terms of direct transfer or 'shining-through', but by causing under-use of TL features. Furthermore, the results of some of these studies can be interpreted as mixed or contradictory, for instance Jantunen's (2002) results do not confirm the lexical simplification hypothesis, while Puurtinen's (2003) results do not fully support the hypothesis of syntactic simplification.

This very partial survey has focused on corpus-based studies framed within the context of research on translation universals. However, the scope of many of these studies could be restricted to that studies of socially and historically constrained translation norms and conventions, which as opposed to universal features are not independent of language, text type, and translation direction. Thus, many of these (partial) studies of translation universals could perhaps be thought of and interpreted as studies of translation norms, for instance those pertaining to contemporary

English fiction, or to contemporary Finnish fiction translated from Russian, or to contemporary Finnish translated children fiction, or to contemporary German and English business writing, and so on.

3. Corpus design

A crucial issue in descriptive translation studies, as in all corpus-based research, is that of corpus design. As it should be clear from the examples presented, corpus based translation research is always based on a comparison between corpora of different types so that, in translation studies, a corpus is actually always a combination of at least two subcorpora, whose features are compared and contrasted.

"Translation-driven" corpora, that is corpora created for the purpose of studying translation (Zanettin, 2012), can be monolingual or bilingual as well as comparable or parallel. Table 3 provides a categorization of the corpora used in the studies overviewed. For each study it gives the acronym for the corpus (parallel corpora are underlined), the size in millions of words for each subcorpus, and an indication of the text type or types considered in each study (though some studies used only part of a corpus).

Table 3: Corpus Type and Size

Features Corpus type and size

Lexis Syntax Semantics Discourse

Simplification ECC, 1 + 0.7, fiction ZCTC+LCMC, 1 +1 , 'general' CTF, 1+ 1, academic prose, fiction CEET+CEE, 3.6+4.6, medical+technical ECC, 1 + 0.7, fiction ZCTC+LCMC, 1 +1 , 'general' CTF, 0.5. + 0.5, children fiction CEET+CEE, 3.6+4.6, medical+technical

Explicitation CTF, 0.5. + 0.5, children ECC, 3.5 + 3.5, fiction ENPC, 1000 sts x 4,

fiction GEPCOLT, 1 + 1, fiction SWCC, 8.7 + 12.6, corporate websites fiction ARRABONA, 0.015 + 0.015 + 0.015, fiction, technical writing ZCTC+LCMC, 1 +1 , 'general'

Normalization GEPCOLT, 1 + 1, fiction BPCC, 0.5 + 0.5, fiction ECC, 5 + 5, fiction CTF, 0.5. + 0.5, children fiction DPC, 1,8 + 0.8, + 1, 5 text types, 2 SL ECC, 10 + 10, fiction ECC, 5 + 5, fiction

Transfer CTF, 1+1+1, fiction, 2 SL CroCo, 0.25 + 0.25 + 0.25 + 0.25, 10 registers Teich 2003, 0,01 + 0.01 + 0.01 + 0.01, scientific writing

Translation of CTF, 2 + 2, academic CTF, 0.6 + 0.6 + 0.6, fiction TEC+BNC, 0.54 + 0.4

Features Corpus type and size

Lexis Syntax Semantics Discourse

unique items prose, fiction + 19, fiction

Asymmetry Klaudy & Károly 2005, 3 parallel texts, fiction Becher 2010, 0.02 + 0.02 + 0.02 + 0.02, business writing

Much research on translation universals has been carried out using monolingual comparable corpora, comprising a subcorpus of translated texts and a corpus of the same-size of non-translated texts in the target language selected according to similar design criteria. Several of the case studies mentioned were carried out using the English Comparable Corpus (ECC) and the Corpus of Translated Finnish (CTF). The former comprises the Translational English Corpus (TEC, Laviosa, 1997, 1998) and a comparable corpus of non-translated English texts selected from the British National Corpus, the latter two subcorpora of Finnish written prose containing translated and non-translated texts, respectively. Other case studies which are based on monolingual comparable corpora are Xiao (2011) using a combination of the ZJU Corpus of Translational Chinese (ZCTC) and of the comparable Lancaster Corpus of Mandarin Chinese (LCMC), Corpas-Pastor (2008) using two subcorpora of translated and non-translated specialized Spanish (Corpus especializado de español traducido, CEET and Corpus especializado de español, CEE), and Dayrell (2007) using the Brazilian Portuguese Comparable Corpus (BPCC), comprising two subcorpora of translated and non-translated fiction. De Sutter et al. (2012) use subcorpora of translated and non-translated Dutch texts from the Dutch Parallel Corpus (DPC). The study by Jiménez-Crespo (2011) used the 20 million word Spanish Web Comparable Corpus (SWCC) containing two subcorpora of translated and non-translated Spanish texts downloaded from Spanish and US corporate websites.

These corpora also differ according to the number of languages and directions of translation considered. For instance, the TEC contains texts translated into English from a variety of languages in order to minimize the influence of the source language, while other corpora, both comparable and parallel, contain texts translated only from one or two languages, in order to better account for source language induced variation. Mauranen (2000), for instance, considers texts translated into Finnish from English and Russian, Cappelle (2012) texts translated into English from French and German, and De Sutter et al. (2012) texts translated into Dutch from English and French. Research based primarily on parallel corpora, comprising one subcorpus of translations and one containing the source texts of these translations usually involves only one language pair. Pápai's (2004) ARRABONA corpus contains parallel source texts and translations from English into Hungarian, together with a comparable subcorpus of non-translated Hungarian texts. The corpus used by Kenny's (2000, 2004), GEPCOLT, contains translations from German into English, while the English-German CroCo corpus used by Hansen-Schirra (2011) and the English Norwegian Parallel Corpus (ENPC) used by 0veras (1998) contain translations and source texts in both directions of translations, and thus four subcorpora. Similarly Klaudy and Károly (2005) and Becher (2010) consider both translation directions between English and, respectively, Hungarian and German.

These corpora are certainly small if compared to current monolingual corpora, and contain texts belonging to only one or a few test types or genres, with fiction getting the lion's share. The largest corpora are the English Comparable Corpus and the Spanish Web Comparable Corpus, with currently contain approximately 20 million words each, considering both subcorpora. Other studies used the English Comparable Corpus at previous stages of development (at 10, 7, 2 million words altogether, respectively), while the subcorpora used by Corpas-Pastor amount to 8 million words. Smaller corpora include the three million word monolingual comparable corpus extracted from the Dutch Parallel Corpus by De Sutter et al. (2012), the two million word comparable corpus of Chinese used by Xiao (2011) and Xiao et al. (2010), the one million word Brazilian Portuguese Comparable Corpus used by Dayrell (2007), and the one million word CroCo corpus use by Hansen-Schirra (2011). Studies based on parallel corpora often use rather tiny collections of texts, consisting sometimes of only a few texts pairs (Klaudy and Károly, 2005) or a few thousand words (Becher, 2010).

Some of the studies surveyed compare data from three or four different subcorpora, in some cases distinguishing between source languages and text types or genres. This is because subcorpora containing texts translated from different languages may allow for controlling interference from specific source language systems, while subcorpora containing texts belonging to different text types may allow to distinguish between translation-induced and genre-related variation. Other studies involve a combination of comparable and parallel corpora, which may allow to distinguish between variation related to the translation process and that triggered by the source language.

A bidirectional parallel, or reciprocal corpus allows, in principle, for all types of comparisons, as it seemingly contains a combination of two types of corpora, bilingual parallel and monolingual comparable, in two directions of translation. Figure 1 is a graphical representation of this ideal configuration. A subcorpus of translations (CTj) in language A is linked to a parallel source subcorpus of non-translations (CNT1) in language B as well as to a comparable subcorpus of non-translations in language A (CNT2). At the same time, the subcorpus of non-translations in language B (CNT1) allows a comparison with a subcorpus of translated texts in the same language (CT2), the sources of which are the texts contained in the non-translational subcorpus in language A (CNT2). Additionally, the non-translational subcorpora in the two languages (CNT1 and CNT2) seem to allow for cross-language comparison.

Figure 1. A Reciprocal Corpus

However, this ideal corpus composition assumes that languages and translation practices are symmetrical, and is achieved by superimposing an abstraction on what may be the actual reality. Not all languages contain the same genres and text types -in fact it is through translation that some of them are introduced in cultures- and within each genre or text type different languages may be characterized by a different distribution of internal categories. Furthermore, not everything is translated into and out of a language pair, and not in the same proportion. The more a subcorpus of translations is designed to be maximally representative of what is translated into a language, genre or text type, the less it may be comparable to the source texts of a subcorpus of translations designed to be maximally representative of what is translated into the other language. As suggested by Leech (2007: 142) representiveness and comparability are conflicting goals: "an attempt to achieve greater comparability may actually impede representativity and vice versa" since "as one nears to perfection in comparability, one meets with distortion in terms of representativeness, and vice versa".

Figure 2 may thus in some cases be a more accurate representation of the situation which might arise when trying to combine the demands of comparability and those of representativeness using two parallel corpora in reciprocal directions of translation. First of all, the different size of and the overlap between language A and language B in Figure 2 highlights that languages are not self-contained entities and may have different extensions. In Figure 2, like in Figure 1, a subcorpus of translations (CT1) in language A, supposedly representative of what is translated in that language combination and direction of translation, is paired with to a parallel subcorpus of non-translations (CNT1) in language B (the source texts). The same subcorpus is also paired with a comparable subcorpus of non-translations (CNT2) in language A, supposedly selected according to design criteria similar to those for the texts in CT1.

However, in this case, the translations of the texts in (CNT2), i.e the parallel subcorpus (CT2) in language B, may not be comparable to the subcorpus of non-translations in language B (CNTj) as their design criteria is simply that they are the source texts for the subcorpus of translations (CTj) in language A. A different subcorpus of non-translated texts (CNT3) may thus be needed for purposes of inter-linguistic comparison in language B. For much the same reasons, the subcorpus of source texts in language B (CNTj) may also not be comparable to the subcorpus of non-translations (CNT2) in language A, and a still different subcorpus of non-translated texts (CNT4) may be needed for purposes of cross-language comparison. Finally, both Figure j and Figure 2 assume that cross-linguistic comparison is carried out between two subcorpora of non-translated texts, i.e. translated texts are not included in the comparison in either language. However, this criterion for exclusion may run against representativeness, in as much as translation may represent a very large section of what is written and published in some languages (Zanettin, 2011).

Figure 2. Directionality, Representativeness and Comparability in Parallel Corpora

Since in a "reciprocal corpus" translated and non-translated texts in the same language and across languages may not comparable, when using both comparable and parallel data it may be advisable to consider each translation direction as bringing with itself its own requirements in terms of corpus design.

4. Corpus annotation

Another important issue in corpus compilation is annotation and, in the case of parallel corpora, corpus alignment. As regards annotation, a distinction can be drawn among four different types of information encoded in text corpora, which is metadata, structural annotation, linguistic annotation and interpretative annotation.

Metatextual data are bibliographical and other documentary information necessary to classify and select texts according to translation research relevant features. Corpora to be used in descriptive translation studies require their own specific metadata, which may be different from those used for other types of corpus research or applications. Thus, while it may seem intuitive to signal whether a text is or is not a translation in a parallel corpus, this descriptive feature is overlooked in one of the largest multilingual parallel corpora available, Europarl, which was created for use in machine translation. In the annotated and aligned version of Europarl freely available on the Internet (Koehn, 2005) it is in fact not specified whether a text was originally written in one language, or if not from which language it was translated. Translation studies-specific metadata include also bibliographical information about translators and not just original authors, about the dates of publication of translations together with those of source texts, etc. The classification of texts according to a typology is also crucial in order to be able to compare textual varieties within and across subcorpora of translations and non-translations.

Structural annotation refers to segmentation and tokenization, which is of course a preliminary step before further annotation and alignment. By linguistic annotation I refer to automated or semi-automated POS tagging, lemmatization, parsing, semantic annotation such as the grouping of words into categories based on meaning, co-

reference and named entity reference annotation. As opposed to when the first translation corpus projects where envisaged, tools and resources for the linguistic annotation of corpora are now available for many languages, making it possible to add layers of linguistic annotation even to raw-text corpora previously created. Linguistic annotation may not be necessary or even useful for many types of investigations, but it can be extremely practical and sometimes necessary when looking for regularities beyond the lexical level. Finally, by interpretative annotation I refer to all other layers of annotation based on non-linguistic categories which can be superimposed to a text and which require close human supervision and manual coding. These include the classification and annotation of translation shifts, additions, omissions etc. in parallel corpora, as well as metaphorical annotation, error-tagging, etc.

Most of the investigations surveyed are based on corpora of running text rather than on annotated corpora. This is the case of almost all studies carried out at the level of lexis, except for Gloria Corpas-Pastor (2008) which is based on lemmatized corpora, on the account that lemmas are better than word forms as indicators of lexical density for a highly inflected language like Spanish. Studies concerning syntactic structures are either based on corpora of running text or on linguistically annotated corpora. In the former case, the constructions analysed are retrieved manually by looking at concordance lines for lexical strings likely to appear in the context of a given structure. For instance, in Olohan and Baker's (2000) study the forms of the verbs say and tell are used to retrieve 'that' clauses. Instead, Hansen-Schirra's (2011) study of normalization, which is based on grammatical categories, could not have been carried out without linguistic annotation.

Studies looking at the lexical realization of semantic domains such as that of colours (Olohan, 2004) or manner of motion (Cappelle, 2012) were performed by searching for lexical sets derived from external sources such as thesauruses. Arguably, this type of studies could be greatly facilitated by linguistic annotation.

Finally, studies looking at translation shifts in parallel corpora were based on manual examination of aligned segment pairs, which were coded according to a predefined classification. In this case the operators are procedural rather than linguistic, though they depend on the examination of linguistic features. I would like to stress in this respect that while electronic corpora greatly facilitate translation research, this still remains largely grounded in extensive manual analysis.

This brings me to my next point, namely parallel corpus alignment. The creation of robust and reliable parallel corpora for descriptive translation studies is demanding and laborious work. The high quality needed for descriptive translation research can only be obtained through manual alignment editing, as opposed to corpus-based machine (assisted) translation, which relies on automation and data quantity and for whose purposes automatic alignment techniques provide viable results. This is especially true for corpus based studies of genres such as fiction and news writing, whose language is often rather "noisy", that is, resistant to automatic alignment.

5. Conclusions

I would like to summarize the issues at hand under the headings of quantity and quality. As regards quantity, I think translation studies need more and larger corpora, in more languages, in more directions of translation and covering a wider range of internal varieties. Also needed are more corpora containing texts collected at different times, which can be used to gain insights into how evolving translation styles and norms relate to evolving language norms, and on how translation affects and is in its turn affected by language change. Their study therefore adds a temporal dimension to considerations of corpus design. The isolation of translation universals and norms may be demanding in terms of corpus resources, since several translation and reference subcorpora are needed in order to disentangle source language, genre-related and diachronic variables. The complexity of corpus design required by rigorous corpus-based translation studies also calls for the implementation of elaborate methods of statistical testing. Most descriptive translation studies submit their results to tests of statistical significance, and an understanding of basic statistical techniques and principles can generally be taken for granted. However, in order to assess the validity of theoretical claims based on quantitative corpus data involving a large number of variables, advanced techniques involving multidimensional analysis, clustering, scaling, and regression are sometimes called for. Thus, it seems likely that translation studies researchers will increasingly have to become familiar with tools and methods for the analysis and visualization of multiple data sets, which are often not included in standard corpus analysis software.

As regards quality, qualitative analysis concerns first of all the interpretation of quantitative data in light of contextual variables associated with different groups of texts, and which may be recovered from descriptive metadata, and on the manual sifting of automatically generated results, for instance word lists, concordance lines or collocation tables. However, together with sophisticated ways of handling quantitative data, corpus-based translation research profits from conducting in-depth analyses on small, scrupulously collected samples of translated language. Accuracy in precision and recall of the operators which instantiate a specific linguistic indicator of an assumed descriptive feature is increased by the annotation of linguistic, structural, and manually or semi-manually coded interpretative annotation. While some phenomena are immediately observable on the surface of language as automatically retrieved from raw texts, others rely on the previous encoding of finer linguistic or operational categories. The analysis of occurrences of non-linguistic, interpretative categories such as translation shifts or translation errors presupposes accurate and consistent manual annotation implemented by a human encoder. Finally, manual editing of automatic alignment is a preliminary stage for the analysis of parallel corpora and concordances.

For practical reasons corpus compilers may have to choose between focusing on quantity or on quality, often one at the expense of the other. However, quantitative and qualitative approaches are radically intertwined in corpus-based translation studies, and they are not mutually exclusive. On the one hand larger corpora which typically display little annotation can be enriched with further layers of annotation. One example is the study on syntactic normalization described by Hansen-Schirra (2011), which used a POS-tagged version of the ECC, whereas previous studies used non-annotated versions. Selections from existing corpora can be integrated with new subcorpora (parallel or comparable) or combined with selections from other corpora to form new configurations. On the other hand, small-scale qualitative studies based on intensive annotation are needed to confirm the findings from large-scale quantitative studies. Small corpora in which occurrences of interpretative categories have been manually encoded can be integrated in larger collections, thus providing a larger quantitative basis for drawing generalizations.


Baker, M. (1993). Corpus linguistics and translation studies: Implications and applications. In M. Baker, G. Francis & E. Tognini-Bonelli (Eds.),

Text and Technology (pp. 233-250). Amsterdam and Philadelphia: John Benjamins. Becher, V. (2010). Abandoning the Notion of "Translation-inherent" Explicitation: Against a Dogma of Translation Studies. Across Languages and Cultures, 11(1), 1-28.

Cappelle, B. (2012). English is less rich in manner-of-motion verbs when translated from French. Across Languages and Cultures, 13(2), 173195.

Corpas-Pastor, G. (2008). Investigar con corpus en traducción: los retos de un nuevo paradigma. Bern: Peter Lang.

Dayrell, C. (2007). A Quantitative Approach to Compare Collocational Patterns in Translated and Non-translated Texts. International Journal of Corpus Linguistics, 12(3), 375-414.

De Sutter, G., Delaere, I., & Plevoets, K. (2012). Lexical lectometry in corpus-based translation studies. Combining profile-based correspondence analysis and logistic regression modeling. In M. Oakes & M. Ji (Eds.), Quantitative Methods in Corpus-Based Translation Studies (pp. 325346), Amsterdam and Philadelphia: John Benjamins. Eskola, S. (2004). Untypical Frequencies in Translated Language: A Corpus-based Study on a Literary Corpus of Translated and Non-translated Finnish. In A. Mauranen & P. Kujamaki (Eds.), Translation Universals. Do they Exist? (pp. 83-99). Amsterdam and Philadelphia: John Benjamins.

Gellerstam, M. (1986). Translationese in Swedish Novels Translated from English. In L. Wollin & H. Lindquist (Eds.), Translation Studies in

Scandinavia (pp. 88-95). Lund: CWK Gleerup. Gellerstam, M. (2005). Fingerprints in Translation. In G. Anderman & M. Rogers (Eds.), In and Out of English: For Better, For Worse? (pp.

201-213). Clavedon: Multilingual Matters. Hansen-Schirra, S. (2011). Between normalization and shining-through. Specific properties of English-German translations and their influence on the target language. In S. Kranich, V. Becher, S. Hoder, & J. House (Eds.), Multilingual Discourse Production. Diachronic and Synchronic Perspectives (pp. 133-162). Amsterdam and Philadelphia: John Benjamins. House, J. (2008). Beyond intervention: Universals in translation?. trans-kom, 1(1). <

kom_01_01_02_House_Beyond_Intervention.20080707.pdf>. Jantunen, J. H. (2002). Synonymity and lexical simplification in translations: a corpus-based approach, Across Languages and Cultures, 2(1), 97-112.

Jiménez-Crespo, M. A. (2011). The future of general tendencies in translation: Explicitation in web localization. Target, 23(1), 3-25. Kenny, D. (2000). Lexis and Creativity in Translation: A Corpus-based Study. Manchester: St. Jerome Publishing.

Kenny, D. (2004). Parallel corpora and translation studies: old questions, new perspectives? Reporting "that" in Gepcolt: a case study. In G. Barnbrook, P. Danielsson, & M. Mahlberg (Eds.), Meaningful Texts: The Extraction of Semantic Information from Monolingual and Multilingual Corpora (pp. 154-165). London and New York: Continuum.

Klaudy, K. & K. Károly (2005). Implicitation in translation: Empirical evidence for operational asymmetry in translation. Across Languages and Cultures, 6(1), 13-28.

Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. In MT Summit X (pp. 79-86). <>.

Kranich, S., V. Becher, & J. House (2012). Changing conventions in English-German translations of popular scientific texts. In K. Braunmüller & C. Gabriel (Eds.), Multilingual Individuals and Multilingual Societies (pp. 315-334). Amsterdam and Philadelphia: John Benjamins.

Kranich, S., V. Becher and S. Hoder (2011). A tentative typology of translation-induced language change in S. Kranich, V. Becher, S. Hoder, & J. House (Eds.), Multilingual Discourse Production. Diachronic and Synchronic Perspectives (pp. 9-44). Amsterdam and Philadelphia: John Benjamins.

Lanstyák, I. and P. Heltai (2012). Universals in language contact and translation. Across Languages and Cultures, 13(1), 99-121.

Laviosa, S. (1997). How Comparable Can Comparable Corpora Be?. Target, 9(2), 289-319.

Laviosa, S. (1998). Core patterns of lexical use in a comparable Corpus of English narrative prose. Meta, 43(4), 557-570.

Leech, G. (2007). New Resources, or just Better Old ones? The Holy Grail of Representativness. In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus Linguistics and the Web (pp. 133-149). Amsterdam and New York: Rodopi.

Malmkjaer, K. (2008). Norms and nature in translation studies. In G. Anderman and M. Rogers (Eds.), Incorporating corpora: the linguist and the translator (pp. 49-59). Clavedon: Multilingual Matters.

Mauranen, A. (2004). Corpora, universals and interference. In A. Mauranen & P. Kujamaki (Eds.), Translation universals. Do they exist? (pp. 143-164). Amsterdam and Philadelphia: John Benjamins.

Olohan, M. & M. Baker (2000). Reporting that in translated English: Evidence for subconscious processes of explicitation?. Across Languages and Cultures, 1(2), 141-158.

Olohan, M. (2004). Introducing corpora in translation studies. London: Routledge.

Overas, L. (1998). In Search of the Third Code: An Investigation of Norms in Literary Translation. Meta,43(4), 571-88. <>.

Pápai, V. (2004). Explicitation: A universal of translated text?. In A. Mauranen & P. Kujamaki (Eds.), Translation universals. Do they exist? (pp. 143-164). Amsterdam and Philadelphia: John Benjamins.

Puurtinen, T. (2003). Genre-specific Features of Translationese? Linguistic Differences between Translated and Non-translated Finnish Childrens Literature. Literary and Linguistic Computing, 18(4), 389-406.

Puurtinen, T. (2004). Explicitation of clausal relations: A corpus-based analysis of clause connectives in translated and non-translated Finnish childrens literature. In A. Mauranen & P. Kujamaki (Eds.), Translation universals. Do they exist? (pp. 165-176). Amsterdam and Philadelphia: John Benjamins.

Saldanha, G. (2011). Style of Translation: The Use of Source Language Words in Translations by Margaret Jull Costa and Peter Bush. In A. Kruger, K. Wallmach, & J. Munday (Eds.), Corpus Based Translation Studies: Research and Applications (pp. 237-258). London: Continuum.

Setton, R. (2011). Corpus-based Interpretation Studies: Reflections and prospects. In A. Kruger, K. Wallmach, & J. Munday (Eds.), Corpus Based Translation Studies: Research and Applications (pp. 33-75). London and New York: Continuum.

Straniero Sergio, F., & Caterina Falbo (Eds.), Breaking Ground in Corpus-based Interpreting Studies. Bern: Peter Lang.

Teich, E. (2003). Cross-Linguistic Variation in System and Text: A Methodology for the Investigation of Translations and Comparable Texts. Berlin and New York: Mouton de Gruyter.

Tirkkonen-Condit, S. (2002). Translationese - A Myth or an Empirical Fact? A Study into the Linguistic Identifiability of Translated Language. Target, 14(2), 207-20.

Xiao, R. (2011). Word clusters and reformulation markers in Chinese and English: Implications for translation universal hypotheses. Languages in Contrast, 11(2), 145-171.

Xiao, R., L. He, & M. Yue (2010). In Pursuit of the Third Code: Using the ZJU Corpus of Translational Chinese in Translation Studies. In R. Xiao (Ed.), Using Corpora in Contrastive and Translation Studies (pp. 182-214). Newcastle: Cambridge Scholars.

Zanettin, F. (2011). Translation and Corpus Design. Synaps, 26, 14-23.


Zanettin, F. (2012). Translation-Driven Corpora. Corpus Resources in Descriptive and Applied Translation Studies. Manchester: St Jerome.