Lexicography
While studies of grammatical constructions can be reliably conducted on corpora of varying length, to obtain valid information on vocabulary items, it is necessary to analyze corpora that are very large. To understand why this is the case, one need only investigate the frequency patterns of vocabulary in shorter corpora, such as the one-million-word LOB corpus.
The Bank of English Corpus has many potential uses. But it was designed primarily to help in the creation of dictionaries. Sections of the corpus were used as the basis of the BBC English Dictionary, a dictionary that was intended used as the basis of the BBC English Dictionary, a dictionary that was intended to reflect the type of vocabulary used in news broadcasts such as those on the BBC (Sinclair 1992). Consequently, the vocabulary included in the dictionary was based on sections of the Bank of English Corpus containing transcriptions of broadcasts on the BBC (70 million words) and on National Public Radio in Washington, DC ( 10 million words). The Bank of English Corpus was also used as the basis for a more general purpose dictionary, the Collins COBUILD English Dictionary, and a range of other dictionaries on such topics as idioms and phrasal verbs. Other projects have used similar corpora of other types of dictionaries. The Cambridge Language Survey has developed two corpora, the Cambridge International Corpus and the Cambridge Learners' Corpus, to assist in the writing of a number of dictionaries, including the Cambridge International Dictionary of English. Longman publishers assembled a large corpus of spoken and written American English to serve as the basis of the Longman Dictionary of American English, and used the British National Corpus as the basis of the Longman Dictionary of Contemporary English.
To understand why dictionaries are increasingly being based on corpora, it is instructive to review precisely how corpora, and the software designed to analyze them, can not only automate the process of creating a dictionary but also improve the information contained in the dictionary. A typical dictionary as Landau ( 1984:76f) observes, provides its users with various kinds of information about words: their meaning, pronunciation, etymology, part of speech, and status ( e.g. whether the word is considered " colloquial " or " non-standard"). In addition, dictionaries will contain a series of example sentences to illustrate in a meaningful context the various meanings that a given word has.
Prior to the introduction of computer corpora in lexicography, all of this information had to be collected manually. As a consequence, it took years to create a dictionary. For instance, the most comprehensive dictionary of English, the Oxford English Dictionary ( originally entitled New English Dictionary), took fifty years to complete, largely because of the many stages of production that the dictionary went through. Language (1984: 69) notes that the 5 million citations included in the OED had to be "painstakingly collected … sub sorted … analyzed by assistant editors and defined, with representative citations chosen for inclusion; and checked and redefined by [James A. H.] Murray [main editor of the OED] or one of the other supervising editors." Of course, less ambitious dictionaries than the OED took less time to create, but still the creation of a dictionary is a lengthy and arduous process.
Because so much text is now available in computer-readable form, many stages of dictionary creation can be automated. Using a relatively inexpensive piece of software called a concordancing program (cf. section 5.3.2), the lexicographer can go through the stages of dictionary production described above, and instead of spending hours and weeks obtaining information on words, can obtain this information automatically from a computerized corpus. In a matter of seconds, a concordancing program can count the frequency of words in a corpus and rank them from most frequent to least frequent. In addition, some concordancing programs can detect prefixes and suffixes and irregular forms and sort words by "lemmas": words such as runs, running, and ran will not be counted as separate entries but rather as variable forms of the lemma run.
To study the meanings of individual words, the lexicographer can have a word displayed in KWIC (key word in context) format, and easily view the varying contexts in which a word occurs and the meanings it has in these contexts. And if the lexicographer desires a copy of the sentence in which a word occurs, it can be automatically extracted from the text and stored in a file, making obsolete the handwritten citation slip stored in a filing cabinet. If each word in a corpus has been tagged (i.e. assigned a tae designating its word class; cf. section 4.3), the part of speech of each word can be automatically determined. In short, the computer corpus and associated software have completely revolutionized the creation of dictionaries.
In addition to making the proves of creating a dictionary easier, corpora can improve the kinds of information about words contained in dictionaries, and address some of the deficiencies inherent in many dictionaries. One of the criticisms of the OED, Language (1984:71) notes, is that it contains relatively little information on scientific vocabulary. But as the BBC English Dictionary illustrates, if a truly "representative" corpus of a given kind of English is created (in this case, broadcast English), it becomes quite possible to produce a dictionary of any type of English (cf. section 2.5 for a discussion of representative ness in corpus design). And with the vast amount of scientific English available in computerized form, it would now be relatively easy to create a dictionary of scientific English that is corpus-based.
Dictionaries have also been criticized for the unscientific manner in which they define words, a shortcoming that is obviously a consequence of the fact that many of the more traditional dictionaries were created during times when well defined theories of lexical meaning did not exist. But this situation is changing as semanticists turn to corpora to develop theories of lexical meaning based on the use of words in real contexts. Working within the theory of "frame" semantics, Fillmore (1992: 39-45) analyzed the meaning of the word risk in a 25-million-word corpus of written English created by the American Publishing House for the Blind. Fillmore (1992:40) began his analysis of risk in this corpus working from the assumption that all used of risk fit into a general frame of meaning that "there is a probability, greater than zero and less than one, that something bad will happen to someone or something." Within this general frame were three "frame elements," i.e. differing variations on the main meaning of risk, depending upon whether the "risk" is not caused by " someone's action" (e.g. if you stay here you risk getting shot), whether the "risk" is due in some part to what is termed " the Protagonist's Deed" (e.g. I had no idea when I stepped into that bar that I was risking my life), or whether the "risk" results from " the Protagonist's decision to perform the Deed" (e.g. I know I might lose everything, but what the hell, I'm going to risk this week's wages on my favorite horse) (Fillmore 1992: 41-2).
In a survey of ten monolingual dictionaries, Fillmore (1992: 39-40) found great variation in the meanings of risk. In his examination of the 25-million-word corpus he was working with, Fillmore(1992) found that of 1,743 instances of risk he identified, most has one of the three meanings. However, there were some examples that did not fit into the risk frame, and it is these examples that Fillmore (1992:43) finds significant, since without having examined a corpus, "we would not have thought of them on our own." Fillmore's (1992) analysis of the various meanings of the word risk in a corpus effectively illustrates the value of basing a dictionary on actual uses of a particular word. As Fillmore (1992:39) correctly observes, " the citation slips the lexicographers observed were largely limited to examples that somebody happened to notice … " But by consulting a corpus, the lexicographer can be more confident that the results obtained more accurately reflect the actual meaning of a particular word.
-English Corpus Linguistics
An Introduction
CHARLES F.MEYER
University of Massachusetts at Boston,2002