Corpora, computers and lexicography

زائر

Corpora, computers and lexicography
The most significant developments in lexicography in the past two decades have involved more extensive corpora of spoken and written language and creation of sophisticated computer-based access tools to such corpora. The greatest innovations have been stimulated by the COBUILD project at the University of Birmingham, UK and the influence of such work can be measured by the fact that by the late 1990s all major English-language learner-dictionary projects have incorporated reference to extensive language corpora and developed computational techniques for extracting lexicographically significant information from such corpora .
The COBUILD project
The COBUILD is one of the largest and most ambitious lexical research projects ever undertaken. COBUILD stands for Collins Birmingham University International Language Database and is largely funded by the publisher William Collins (now HarperCollins).It is based in the school of English at the University of Birmingham under the direction of Professor John Sinclair who is , in addition to having major responsibility for lexical and lexico-grammatical research, editor-in-chief of the major lexicographic and other related publications of COBUILD , WHICH BEGAN WITH PUBLICATION IN 1987 of the ground-breaking Collins COBUILD English Language Dictionary (CCELD). The latest edition is the Collins COBUILD English Dictionary(CCELD),published in 1995. The COBUILD corpus - previously termed the Birmingham Collection of English Text (BCOET)-was re-named The Bank of English in 1991 and at the time of writing (1997)STANDS AT 320 million words.
The principal aim underlying COBUILD research is to investigate in as much detail as possible how the English language is actually used at a given moment in time in both speech and writing , and to allow such evidence to inform publications aimed at learners of the English language. As the project developed through the 1980s , it became clear that such evidence could only be made available by building a multi-million-word corpus.
CONT.

الموضوع : Corpora, computers and lexicography

المصدر : منتديات تخاطب ta5atub.com

اضغط هنا لتحميل كتب الأدب العربي والعالمي

اضغط هنا لتحميل كتب فكرية أو ثقافية أو فلسفية

اضغط هنا لتحميل أي كتاب في تخصصات أخرى

اضغط هنا لتحميل كتب د محمد محمد يونس علي

للذهاب إلى الفهرس والمنتديات اضغط هنا

زائر

Lexicography
While studies of grammatical constructions can be reliably conducted on corpora of varying length, to obtain valid information on vocabulary items, it is necessary to analyze corpora that are very large. To understand why this is the case, one need only investigate the frequency patterns of vocabulary in shorter corpora, such as the one-million-word LOB corpus.
The Bank of English Corpus has many potential uses. But it was designed primarily to help in the creation of dictionaries. Sections of the corpus were used as the basis of the BBC English Dictionary, a dictionary that was intended used as the basis of the BBC English Dictionary, a dictionary that was intended to reflect the type of vocabulary used in news broadcasts such as those on the BBC (Sinclair 1992). Consequently, the vocabulary included in the dictionary was based on sections of the Bank of English Corpus containing transcriptions of broadcasts on the BBC (70 million words) and on National Public Radio in Washington, DC ( 10 million words). The Bank of English Corpus was also used as the basis for a more general purpose dictionary, the Collins COBUILD English Dictionary, and a range of other dictionaries on such topics as idioms and phrasal verbs. Other projects have used similar corpora of other types of dictionaries. The Cambridge Language Survey has developed two corpora, the Cambridge International Corpus and the Cambridge Learners' Corpus, to assist in the writing of a number of dictionaries, including the Cambridge International Dictionary of English. Longman publishers assembled a large corpus of spoken and written American English to serve as the basis of the Longman Dictionary of American English, and used the British National Corpus as the basis of the Longman Dictionary of Contemporary English.
To understand why dictionaries are increasingly being based on corpora, it is instructive to review precisely how corpora, and the software designed to analyze them, can not only automate the process of creating a dictionary but also improve the information contained in the dictionary. A typical dictionary as Landau ( 1984:76f) observes, provides its users with various kinds of information about words: their meaning, pronunciation, etymology, part of speech, and status ( e.g. whether the word is considered " colloquial " or " non-standard"). In addition, dictionaries will contain a series of example sentences to illustrate in a meaningful context the various meanings that a given word has.
Prior to the introduction of computer corpora in lexicography, all of this information had to be collected manually. As a consequence, it took years to create a dictionary. For instance, the most comprehensive dictionary of English, the Oxford English Dictionary ( originally entitled New English Dictionary), took fifty years to complete, largely because of the many stages of production that the dictionary went through. Language (1984: 69) notes that the 5 million citations included in the OED had to be "painstakingly collected … sub sorted … analyzed by assistant editors and defined, with representative citations chosen for inclusion; and checked and redefined by [James A. H.] Murray [main editor of the OED] or one of the other supervising editors." Of course, less ambitious dictionaries than the OED took less time to create, but still the creation of a dictionary is a lengthy and arduous process.
Because so much text is now available in computer-readable form, many stages of dictionary creation can be automated. Using a relatively inexpensive piece of software called a concordancing program (cf. section 5.3.2), the lexicographer can go through the stages of dictionary production described above, and instead of spending hours and weeks obtaining information on words, can obtain this information automatically from a computerized corpus. In a matter of seconds, a concordancing program can count the frequency of words in a corpus and rank them from most frequent to least frequent. In addition, some concordancing programs can detect prefixes and suffixes and irregular forms and sort words by "lemmas": words such as runs, running, and ran will not be counted as separate entries but rather as variable forms of the lemma run.
To study the meanings of individual words, the lexicographer can have a word displayed in KWIC (key word in context) format, and easily view the varying contexts in which a word occurs and the meanings it has in these contexts. And if the lexicographer desires a copy of the sentence in which a word occurs, it can be automatically extracted from the text and stored in a file, making obsolete the handwritten citation slip stored in a filing cabinet. If each word in a corpus has been tagged (i.e. assigned a tae designating its word class; cf. section 4.3), the part of speech of each word can be automatically determined. In short, the computer corpus and associated software have completely revolutionized the creation of dictionaries.
In addition to making the proves of creating a dictionary easier, corpora can improve the kinds of information about words contained in dictionaries, and address some of the deficiencies inherent in many dictionaries. One of the criticisms of the OED, Language (1984:71) notes, is that it contains relatively little information on scientific vocabulary. But as the BBC English Dictionary illustrates, if a truly "representative" corpus of a given kind of English is created (in this case, broadcast English), it becomes quite possible to produce a dictionary of any type of English (cf. section 2.5 for a discussion of representative ness in corpus design). And with the vast amount of scientific English available in computerized form, it would now be relatively easy to create a dictionary of scientific English that is corpus-based.
Dictionaries have also been criticized for the unscientific manner in which they define words, a shortcoming that is obviously a consequence of the fact that many of the more traditional dictionaries were created during times when well defined theories of lexical meaning did not exist. But this situation is changing as semanticists turn to corpora to develop theories of lexical meaning based on the use of words in real contexts. Working within the theory of "frame" semantics, Fillmore (1992: 39-45) analyzed the meaning of the word risk in a 25-million-word corpus of written English created by the American Publishing House for the Blind. Fillmore (1992:40) began his analysis of risk in this corpus working from the assumption that all used of risk fit into a general frame of meaning that "there is a probability, greater than zero and less than one, that something bad will happen to someone or something." Within this general frame were three "frame elements," i.e. differing variations on the main meaning of risk, depending upon whether the "risk" is not caused by " someone's action" (e.g. if you stay here you risk getting shot), whether the "risk" is due in some part to what is termed " the Protagonist's Deed" (e.g. I had no idea when I stepped into that bar that I was risking my life), or whether the "risk" results from " the Protagonist's decision to perform the Deed" (e.g. I know I might lose everything, but what the hell, I'm going to risk this week's wages on my favorite horse) (Fillmore 1992: 41-2).
In a survey of ten monolingual dictionaries, Fillmore (1992: 39-40) found great variation in the meanings of risk. In his examination of the 25-million-word corpus he was working with, Fillmore(1992) found that of 1,743 instances of risk he identified, most has one of the three meanings. However, there were some examples that did not fit into the risk frame, and it is these examples that Fillmore (1992:43) finds significant, since without having examined a corpus, "we would not have thought of them on our own." Fillmore's (1992) analysis of the various meanings of the word risk in a corpus effectively illustrates the value of basing a dictionary on actual uses of a particular word. As Fillmore (1992:39) correctly observes, " the citation slips the lexicographers observed were largely limited to examples that somebody happened to notice … " But by consulting a corpus, the lexicographer can be more confident that the results obtained more accurately reflect the actual meaning of a particular word.
-English Corpus Linguistics
An Introduction
CHARLES F.MEYER
University of Massachusetts at Boston,2002

الموضوع : Corpora, computers and lexicography

المصدر : منتديات تخاطب ta5atub.com

اضغط هنا لتحميل كتب الأدب العربي والعالمي

اضغط هنا لتحميل كتب فكرية أو ثقافية أو فلسفية

اضغط هنا لتحميل أي كتاب في تخصصات أخرى

اضغط هنا لتحميل كتب د محمد محمد يونس علي

للذهاب إلى الفهرس والمنتديات اضغط هنا