ICAME Journal, Volume 39, 2015, DOI: 10.1515/icame-2015-0004

Word frequency and collocation: Using children's literature in adult learning

Ed Thomas, Kanda University of International Studies, Japan


This study involved the creation of a corpus of children's literature spanning 5.5 million words. Using concordance software, the corpus was able to show the most frequent words and collocations. These will be of interest both to literary researchers in the genre of children's literature and also teachers and applied linguists working with adult students of English.

1 Introduction

Modern computer technology offers us a view of language structure that was not available to researchers before (Sinclair 2004: 1). The growth of computer-aided corpus linguistics has seen a surge of interest in collocation (Shin and Nation 2008: 339) - something that distinguishes native from non-native language users (Schmitt 2000: 79).

Separate to that, literature and naturally-occurring text is once again becoming a primary source of input for language teachers and course designers (Khatib 2011: 201). While audiolingual and communicative language teaching (CLT) models replaced older text-reliant grammar translation methods in the last century, computer corpus linguistics is once again bringing text to the classroom as an authentic tool for learning. On a personal level, I have gainfully used children's stories in adult teaching as a way of inspiring students to develop a range of imaginative and linguistic skills. These include narrative building, character development, reading and writing skills (such as use of narrative verb tenses, or descriptive adjectives) as well as pronunciation aid through storytelling. Others too believe children's literature can play "a crucial role in the education of EFL adult [students]" as "an initial step in developing literary competence" (Ho 2000: 268). To date, a sizable corpus of children's text is yet to be compiled for pedagogical purposes. The aim of this study was to build such a corpus and extract a basic word frequency count and a set of common collocations.

Words are collocates if, in a given sample of language, they are found together more often than their individual frequencies would predict (Jones and Sinclair 1974: 19). This tendency to co-occur can be measured statistically with a computer.

2 Theoretical background

J. R. Firth stated we best know the meaning of a word not by examining it in isolation but by the company it keeps (Firth 1957: 11). This emphasis on collocational properties within meaning is something now widely accepted by linguists (Nation 2001: 56; Milton 2009: 149). A 'neo-Firthian' school has emerged, looking at words in combinations, phrases, formulaic sequences and chunks (Halliday 1973; Leech 1974; Sinclair 1991; Nattinger and DeCarrico 1992; Louw 1993; Weinert 1995; Stubbs 1995; Schmitt 2004; Hoey 2005; Almela Sanchez 2006; Biber 2009). The advent of corpus techniques has done much to bolster the position, as "Analyses of corpora by computer have now revealed detailed and hitherto unsuspected patterns of idiomaticity... and in so doing have provided descriptive substantiation of the insights that Firth expressed over fifty years earlier" (Widdowson 2007: 410).

Hoey's recent theory of lexical priming marks "the pervasiveness of collocation" as its starting point (Hoey 2005: 1). His examination of the statistical distribution of words leads him to a theory that every word is lexically 'primed' to co-occur with others, and as a word is acquired through encounters with it in speech and writing it becomes "cumulatively loaded" with context and meaning (Hoey 2005: 8). Other researchers, too, owe their impetus to Firth (Halliday 1966; Leech 1974; Sinclair 1991; Nattinger and DeCarrico 1992; Louw 1993; Weinert 1995; Stubbs 1995; Hopper 1998; Schmitt 2004). It is becoming accepted that priming within language (certain words showing greater attraction to each other, such as coffee to mug more so than tea to mug, for example) is a well-established and widely-tested phenomenon of the human mind (Pace-Sigge 2013: 167).

3 Previous studies relating to the classroom 3.1 Shin and Nation (2008)

Shin and Nation (2008) analysed a spoken corpus to aid the teaching of collocations in ELT. Motivating them was the belief that learning common collocations will develop fluency and native-like selection. There is always more than one possible way to say something, but only one or two ways will sound natural to a

native speaker (Pawley and Syder 1983). They cite examples from Korean students of English who used lying story instead of tall story or artificial teeth instead of false teeth. Both lying story and artificial teeth are grammatically and semantically correct, but they fail to sound native.

The authors used a computer to search the 10 million-word spoken section of the British National Corpus (BNC). Using only content words (nouns, adjectives, verbs and adverbs) as their pivot (search) words, they found the most frequent collocations to include you know, I think, a bit, always / never used to, as well, a lot of, [number] pounds, thank you, [number] years and in fact. They noted the "here-and-now" nature of spoken language in their findings, with many interjections and amplifiers (you know, a bit, come on and so on) in the high-frequency list.

3.2 Baker and Freebody (1989)

Corpus-based studies relating specifically to children's literature are small in number. Indeed, "The consideration of literature written for children from a linguistic perspective is a comparatively new field of study" (Knowles and Malmk-jaer 1996: 1). Baker and Freebody uploaded text from 163 primary school readers and carried out an analysis. Of particular interest was the appearance of little in the highest frequency range of words: ranked number 18 in the corpus, and the only two-syllable word in the top 20. Its relative length (compared to words such as the, and, a, to, I) together with its "grapho-phonetic irregularity" thus invited detailed examination.

Another feature of children's books their corpus showed was the relative occurrence of boys and girls; boys appeared more frequently than girls by a ratio of about 3:2. An analysis of verbs showed boys to be more energetic in their interactions with others (shout, hurt, work being amongst the frequent collocations with masculine nouns and pronouns) while girls were seen to like, play with, talk to, walk with, hold onto and kiss. Their analysis suggested girls are more emotional and less physical in their portrayal. In terms of common adjectival collocation (besides little), girls were young, pretty and dancing, while boys were brave, kind, sad and naughty. Fathers did things like paint, pump, fix, drive, pull, start, shout and let, while mothers baked, dressed, hugged, kissed, packed, picked, set, splashed and thanked. They thus argued a certain social theory was "embedded" in the literature, with computer corpus methods bringing this to light (Baker and Freebody 1989: 135).

3.3 Stubbs (1995)

Baker and Freebody's corpus comprised 83,838 words - small by today's standards. Stubbs (1995) used larger corpora and the bearings of newer thinking in lexical studies (Sinclair 1991; Halliday 1993; Weinert 1995) to place collocation and 'chunks' at the centre of an emerging theory of language learning.

He confined his search to just four pivot words - large, small, big and little - and used both a 2.3 million word corpus of contemporary English and data from the Oxford English Dictionary CD-ROM to investigate their collocates. He made insights into ambiguity and meaning, noting how word clusters take on connotations of their own which dictionary definitions are unable to describe consistently (Stubbs 1995: 381). While little and small may appear to be synonymous, he sees Little Red Riding Hood as an example where small cannot be substituted as an adjective without meaning changing in some way. He follows Baker and Freebody in emphasising how little "connotes cuteness" in a way that small does not. Apart from in certain metaphorical and pejorative phrases (small fry, small beer), small is usually about physical size. Little, however, carries more in terms of ideological message; it has a "cuddle factof' (Stubbs 1995: 383).

Stubbs finds other examples of common collocations using his corpora -big toe and little finger, but large intestine and small intestine. He notes how usage is often nonliteral: one's big toe might be small in size, one's little brother might be bigger than oneself (Stubbs 1995: 384). Big can carry with it positive connotations of being grown-up (Big boys don't cry), or negative ones of self-importance (big fish, big mouth, big head, big guns). Large, on the other hand, is usually confined to mean more-than-average in terms of quantity (large amount, large majority, large part, large-scale).

Learning a language therefore means learning such fixed and semi-fixed units which are not always coextensive with traditionally recognized syntactic units (Stubbs 1995: 386). This is a point made by Sinclair (1987, 1991) when he talks of the "idiom principle" within language. For neo-Firthians like Stubbs and Sinclair, they believe once a word is selected for use there is a high probability that other words and features of grammar are co-selected with it (Stubbs 1995: 386), and these linguistic relations must be learned when one learns a language. An L2 learner cannot simply translate from one language to another using a dictionary; so much more in terms of connotation and ideological message is conveyed through words and their collocates.

3.4 Knowles andMalmkjaer (1996)

The following year Knowles and Malmkjaer carried out a study of children's books both from a literary and linguistic point of view. They developed ideas of ideology, stereotyping and implicit messaging using research from the social sciences (for example Thompson 1990) and applied them in a genre-specific way. They argued that writers' choices aid the creation and maintenance of relations of power in society (Knowles and Malmkjaer 1996: 68) - be it men dominating women, adults dominating children, or any other relation of power. This is so whether the writer intends it or not (Knowles and Malmkjaer 1996: 68), and impressions are built up by whole texts, passages, clauses, phrases or just collocations of words. Chunks of language become "linguistic facts", and what the world is actually like is not so much an issue for writers (Knowles and Malmkjaer 1996: 69). To use their example, a girl is more likely to be described as blonde than a car - even though the car's colour may be very similar to a fair-haired person. These "linguistic facts" have become firmly established, so that everyone talks of brains and eggs being addled, but butter and bacon as rancid (Knowles and Malmkjaer 1996: 69). They borrow Louw's (1993) term "semantic prosody" for describing words' tendencies to appear together and call certain associations to mind. Pretty calls to mind smallness and femininity, bringing them to the claim: "It would not be unreasonable to suggest that in the normal course of events semantic prosodies are... learnt collocationally" (Knowles and Malmkjaer 1996: 70).

Children, when learning language, are unlikely to be explicitly told which associations are called to mind when seeing words together. Rather, they "gain this impression through exposure, gradually, as part of a developing base of implicit knowledge about the language system" (Knowles and Malmkjaer 1996: 70). It is the task of the writer to exploit this implicit knowledge in presenting passages to the reader and successfully conjuring emotions and reactions (Knowles and Malmkjaer 1996: 71). Consequently, linguistic "habits" and stereotypes are perpetuated, potentially having damaging effects on groups or individuals (Knowles and Malmkjaer 1996: 71). They consider the word black as an example. It carries negative connotations due to well-known collocations such as black magic, Black Wednesday, black sheep, black cloud, so that when it occurs alongside a word like man, a negative impression is conveyed (even if the colour of the man's skin is, or is very close to, black). Traditional male and female roles, stereotypes about family relations, power structures within society and moral codes are all present in children's literature, they argue, and these stereotypes and language habits are slowly engrained into children's minds as they acquire language.

3.5 Recent years

Unfortunately in the past two decades, not a great deal more has been done in this research area. Thompson and Sealy (2007) compiled a corpus of children's texts from the BNC for comparison with texts written for adults. Their aim was "to explore the issue of whether language deployed in writing for children can been seen to represent the world and human experience differently form the ways in which they represented in writing for adults" (Thompson and Sealy 2007: 3). However their claims were limited, with their corpus compiled of 30 children's texts amounting to 698,286 words.

A current teaching trend to emerge from corpus linguistics is "data-driven learning" or DDL (Johns 1991; Gavioli 2000; Braun 2007; Chambers 2007). Students are now being given access to corpora to discover patterns and rules within authentic language themselves. Leel (2011) took this framework and used a modern work of children's literature - J.K.Rowling's Harry Potter and the Philosopher's Stone - as his authentic language data. A concordancer allowed his students to see collocational properties of prepositions such as about and around, in and on, with which his students made regular mistakes. Leel concluded it was a useful exercise for them which yielded improvement in their literacy.

This "data-driven" approach to teaching is not without its pedagogical issues. Criticisms of Leel's study rest on the size and nature of his corpus: a single novel by a single author. To date, a significant corpus of children's work is yet to be compiled for analysis, hence the need for this study.

4 This corpus 4.1 Compilation

Hunston (2002) notes the four key features of a successful corpus: size, content, representativeness and permanence. It is generally agreed that "bigger is better" (Sinclair 1991; Flowerdew 1996), so I collected children's texts totalling 5,481,834 words. Every text was taken from the online Children's Bookshelf of Project Gutenberg, with a full list of titles in Appendix 4 below. They were all published within what literary scholars term the 'Golden Period' of children's literature 1863-1913 (see Hunt 1994: 59; Knowles and Malmkjaer 1996: 16).

Being out of copyright, the text was free to take. One might claim Hun-ston's criterion of permanence was not met in this body of work due to its dated nature. However, I argue many of these texts are still widely read and experienced today. Alice in Wonderland, for example, has been translated into hundreds of languages around the world and to date there have been 29 films and

nine television series made from it. Books like Alice and Wonderland, Treasure Island and Peter Pan are "quintessential classics" (Hunt 2001: 37) and surely rank among the permanent texts of any age. Fairy tales are from the oral and folk tradition and are therefore hard to date or assign authorship. But they were collected and printed in vast quantities during the 'Golden Age' of children's literature. Since then, they have become "ageless" and "inscribed on our minds", remaining with us from childhood throughout the rest of our lives (Zipes 1983: 1). I therefore believe there is a permanent factor to this corpus and much of the language in these texts is still current.

Regarding representativeness and balance of the corpus, both male and female authors' works were used, from a wide range of geographical locations and cultural / linguistic backgrounds. Poems were included as well as prose; novels as well as shorter fairy tales. There was also the text from a number of magazines and 'penny dreadfuls' which were popular at the time, such as The Chatterbox, The Girl's Own Paper and St. Nicholas: Scribner's Illustrated Magazine for Girls and Boys. Bearing in mind the types of text a child living 18631913 might have read (or had read to them), attempts were made at providing a representative balance of this language. Excluding editors of the magazines and the editing collectors of fairy and folk tales, some 47 authors were included. Two authors - Jules Verne and Johanna Spyri - appeared in translation, using the English versions of their work which appeared at the time. There were 65 single-author works, 21 editions of magazines, and two compendia of fairy tales, poems, stories, fables and nursery rhymes. Admitted, poetry is an entirely separate genre to novels and brings with it different language. But analyses were carried out on the corpus as a whole, rather than sub-corpora, as a child would have experienced language from all these genres at the same time. One night he / she might read Kidnapped, the next The Owl and the Pussy-Cat. A first language learner arguably experiences an "immersion pedagogy" while growing up (Gee 1994), and this has guided theories of language teaching in applied linguis-

Laurence Anthony's program AntConc was chosen as the software to analyse the data. It is free to download and has powerful tools such as word frequency lists, a KWIC concordancer and collocation generators ranked according to raw frequency, mutual information or T-score.

4.2 Research questions

Compilation of the corpus was itself a goal of this study, and took some months to complete and yield results. I therefore limited myself to two very simple research questions as a preliminary use for this corpus. I hope in future to carry

out more detailed analyses and further investigations (such as a study of collocates with the adjectives little and big, following Stubbs, or a description of collocates relating to gender and age). For now, my research questions were simply:

1. What are the most frequent words in this corpus?

2. What are the most frequent collocations of these words?

From here, possible classroom uses of the data can be discussed and an outline of a basic pedagogical application put forward.

5 Results

As will be the case in any corpus of natural language, the most frequent words were function words - the, and, to, of, a and so on. With the aim of finding meaningful chunks of language to teach, they were ignored as search words in favour of content (NAVA) words.

Many of these words are polysemous - having more than one sense (e.g. well can be a place to store water, or otherwise an adverb imparting a sense of 'goodness'. Well, it can even be used as an exclamation, to resume a narrative or change the subject!). Some of the polysemous senses see non-content words entering into consideration as collocation pivots (back as a preposition, for example, rather than the place where your spine is). Similarly, many of the verbs on the list can act as auxiliaries, which carry little content (was, had, is, were, are, do, did acting as function words as in "They had eaten lunch" or "Did she finish her homework?", rather than main verbs as in "They had lunch" or "She did her homework"). Word classification is therefore an issue in any corpus study, as a computer cannot distinguish between function and content. Trying to take this into consideration, I found the 100 most frequent content words in this corpus comprised:

Verbs - 44

Nouns - 23

Adjectives - 20 Adverbs - 13

To answer the first research question, the frequency list for the most common content words in this 5.5million word corpus is listed in Appendix 1.

Approaching the second research question, these frequent content words were used as search pivots to find frequent collocations. An initial analysis yielded the results in Appendix 2. One would expect collocations involving

verbs to be ranked highly, considering this is a corpus of largely narrative work. However, because of the computer's inability to distinguish between function and content, many of these verbal collocations failed to convey any real meaning (such as to be, had been, did not, was not and have been). Due to their lack of content, I therefore excluded these from the list. After all, the motivation behind this study is to help language learners. Chunks like to be and had been are more a part of functional grammar than lexis and should have been mastered by students before higher level vocabulary learning and literacy improvement. Similarly, article + noun collocations were excluded. A more conclusive list of collocations I found is listed in Appendix 3.

6 Discussion

Looking at Appendix 1, the first thing to note is the highest frequency content word: was. This past tense copula verb does a lot of work in this body of text, which is not surprising considering its narrative nature. Indeed, the three most common words are all past tense verbs: was, had and said. Other common verbs are do, see, go, come, know, make, think and take. From a teaching point of view, good use of these verbs amongst students should be an aim.

Perhaps the next thing to note is the only non-verb in the top ten words: little. This occurs 15659 times in the 5.5million word corpus. Neither its synonym small nor its antonym big ranks in the top 100 words, suggesting little is given special status in these texts.

Looking at the collocations list (Appendix 3), a little jumps out - ranked the tenth highest collocation. Little therefore demands further attention in future research, but a brief investigation at this stage saw common chunks were: a little girl, a little boy, a little while, a little more, a little way, a little bit, a little longer and a little later. For teachers, these collocations would be useful building blocks to use. (Try substituting small into these phrases, however, and you sound rather un-native. The same holds for big).

The most common collocation in this corpus is it was. The concordancer helped put this phrase into more perspective, and we see that it often refers to very little (acting as a 'dummy' pronoun):

It was all very well to say "Drink me" but...

It was high time to go....

It was too late. The boat struck the bank full tilt.

Similar issues apply for other common collocations such as it is, there was, there is and there were. These chunks perform essential functions within storytelling, but the pronouns often refer to nothing in particular. For example:

There was nothing else to do, so Alice soon began talking again.

There was a large mushroom growing near her...

There was the cat again, sitting on the branch of a tree.

He was and he had are the collocations ranked second and third (8,258 and 7,352 occurrences). She was is ranked 13th (3,960 occurrences), while she had does not feature in the list. Interpretation of these linguistic features must be left for future research, but for now it is useful to note the frequency of the verbs be and have within larger chunks of language.

7 Conclusion

Using corpora and the results of computer-based analysis within language learning syllabi is now widely held as a positive step (Braun 2005: 47), since language from a naturally occurring corpus better reflects linguistic reality compared to textbooks (Gavioli and Aston 2001: 238). A corpus can provide a "whole panoply" of activities for learning (Sinclair 2004: 297). Use of formulaic utterances is also a well-documented strategy for language learning success (Nattinger and DeCarrico 1992; Myles, Hooper and Mitchell 1998; Nunan 2001; Durrant and Schmitt 2009). Collocations and idioms are the building blocks of language (Murison-Bowie 1996: 183), and native-like selection involves the ability to select the preferred sequence from a number of grammatically acceptable variants (Weinert 1995: 184).

Our own use of lexical chunks may be subliminal, and the collocations we regularly use may not be obvious to us via introspection (Sinclair 1997: 29). But a corpus shows us the common word combinations we employ.

Teachers can use corpora as a base for material creation, test design, feedback and evaluation references (Braun 2005: 51). Or corpora can be used directly, by learners themselves. Tim Johns believes research is "too serious to be left to the researchers" (Johns 1991: 2), and students should be encouraged to "discover" foreign language and "learn how to learn" under a DDL framework (Johns 1991: 1). The corpus is not here a "surrogate teacher" but rather a "special type of informant" which can provide natural data (Johns 1991: 1).

The ability to comprehend language and the ability to produce language are very different things (DeKeyser 2007: 287). Using corpora in the classroom certainly involves students' passive reading skills and comprehension. But having

selected, say, frequent collocations from a body of text, teachers will want students to use them. If communication is to be successful, a relevant context has to be constructed by the discourse participants (Sperber and Wilson 1995: 52). This is where the traditional role of the teacher returns, providing communicative activities to practise new lexis. I would therefore agree with Braun that corpora provide pedagogic "enrichment", rather than a primary focus for learning (Braun 2005: 55). Successful use of corpora for learning and teaching will hinge on successful "pedagogic mediation" between the corpus materials and the corpus users (Braun 2005: 61).

The two research questions I set yielded data which brought me to two conclusions:

• Past tense verbs dominate the word frequency list of this corpus, although the adjective little ranks highly too.

• Many of the most frequent collocations in this corpus were not particularly interesting lexically (it was, was a, I am, I have). In fact, the lack of lexical content in the most frequent collocation - it was - is perhaps its most salient feature, showing how a much work the dummy pronoun does in a narrative setting.

Having created this corpus, I hope I and others can carry out more linguistic studies into children's literature and apply findings in new ways. I agree with Hoey (2005: 14) that a corpus can serve "as a kind of laboratory" for conducting experiments. Pedagogically too, I hope language teachers will see the value in using corpora to help language learners.


