The notion of a lemma is so familiar in corpus linguistics that it scarcely
needs a formal definition.When a wordlist or a text is lemmatised, the
process is apparently transparent, so that any observer can understand how
the lemma relates to the original set or string of words. the concept of lemma is not well defined, and is
in need of a clear formal definition. The lemma is a fundamental concept in
the processing of texts in at least some languages. It so happens that English lemmas are not
typical of the general category, so that linguists who base their understanding
of the lemma on English obtain a distorted view. It is essential to reverse the
direction of argument, and to start with a general understanding of the
lemma, and to consider English lemmas in the wider context.
Lemmas in corpus linguistics
According to the mainstream view, the basic purpose of the lemma is to enable
the corpus linguist1 to generalise about the behaviour of groups of words in
cases where their individual differences are irrelevant. For example, in compiling
a word frequency list or examining collocations we might wish to treat all
the forms of a verb together, or we might concordance on a noun irrespective
of whether it is in the singular or the plural. These groups can be created when
they are needed, for example by using regular expressions.
In practice the first category to be ignored is inflection. Since inflectional
morphology depends on grammatical class, so that of the conjuga-
tion of verbs, and the declension of nouns, pronouns and perhaps adjectives,
the lemma is generally tied to traditional parts of speech. Francis and Kuˇcera
(1982: 1) define the lemma as ‘a set of lexical forms having the same stem and
belonging to the same major word class, differing only in inflection and / or
spelling’. This definition is useful in practice, but brings in two further issues
in principle. The restriction to forms with the same stem raises the issue of
what to do with suppletive forms, such as go / went and the forms of be.
Spelling variants introduce a criterion of a different order, for it is one thing
to group singular and plural as different but related linguistic items, and quite
another to group different representations of what counts as exactly the same
linguistic item.
There is an obvious connection between what corpus linguists do when
grouping words under a single lemma, and what lexicographers do when they
group word forms under headwords. Crystal (1997) defines the lemma as a
‘dictionary headword; an abstract representation, subsuming all the formal lexical
variations which may apply: the verb walk, for example, subsumes walking,
walks and walked’. It is not at all clear how this lemma differs from a lexeme.
Grouping under headwords depends on what a particular editor decides to do
and varies according to the size of the dictionary and the needs of the intended
users, and the headword is not a rigorously defined theoretical concept. Nevertheless
Kennedy (1998: 97) treats lemmas and headwords as the same thing:
‘it is normal in corpus studies to list under the same headword or lemma the
inflectional variants’.
There is a well-known ill-defined boundary between polysemy and homonymy
that effects the grouping of words under headwords and their assignment
to lemmas. For example, the metaphorical use of lion (e.g. John is a lion) is
likely to be treated as ‘the same word’, while the concrete and metaphorical uses
of crane (‘kind of bird’ and ‘machine for lifting heavy objects’) are more likely
to be treated as independent words and therefore members of different lemmas.
If it is difficult to group word meanings under headwords at the abstract level of
the dictionary, it is much more difficult to assign words in texts unambiguously
to their lemmas.
Biber et al. (1998: 29) use the lemma informally to ’consider the different
forms of the word collectively’ in connection with frequency lists. They define
it rather vaguely as the base formof a word, disregarding grammatical changes
such as tense and plurality. However, they use small capitals to represent the
lemma e.g. deal for deal, deals, dealing and dealt.
the small but highly significant further step to define the lemma as the name of
a lexical set, e.g. deal = {deal, deals, dealing, dealt, ...}.
Leech et al. (2001) treat lemmas in familiar terms as headwords and as
groups of inflectional variants (pp. 4–5), but in the main text present them
as sets of lexical items, with their members listed in italics. They also make a
distinction between the lemma and the simplex form. The usual practice of
lemmatising dealing to deal leaves the theoretical status of deal vague and unspecified.
The only interpretation that can ultimately be logically consistent is
to treat dealing as a member of the set deal. A consequence of this interpretation
is that the simplex form deal and the lemma deal are logically different
objects, and in fact deal is a member of deal. Using sets we can handle spelling
variants in a logical manner, for a member of a lemma can itself constitute a
set of spelling variants. But here Leech et al. are inconsistent, listing focussed
and focused as members of verb focus (p. 60), while treating realise and realize
as separate lemmas .
These different approaches constitute what might be called the ‘classical’
view of the lemma, treating it as a group of words that for practical purposes
can be treated as variants of the same word. However, as linguists have begun
to investigate increasingly large corpora, it has become apparent that individual
members of the lemma can behave independently and develop their own
meanings and collocations. For example, provided is the past participle of the
verb provide, but it has taken on a completely new role as a subordinating conjunction.
From a formal point of view it can be regarded as still a member of the
same lemma, but in that case we have to be careful about what inferences are
drawn from lemma membership: we certainly cannot make direct inferences
about distribution or meaning.
One of the major insights gained from the work of the ‘Birmingham’
school, at least from Sinclair (1987) on, is that as linguists examine more and
more data in greater and greater detail, generalisations about whole lemmas become less and less convincing, and we have to consider individual words (and
actually even individual word meanings). Stubbs (1996:172–173) discusses the
collocations of words related to educate: the commonest of these words is education, which collocates with terms relating to institutions (e.g. higher education) while the second most frequent is educated, which often collocates with
at and the name of a prestigious institution, e.g. educated at Cambridge. At this
level, whether or not educated and education belong to the same lemma is not
a matter of any great importance.
Tognini-Bonelli (2001:92–98) challenges the assumption that members of
a lemma ‘are bound to share the same meaning and differ only in their grammatical
profile’. She contrasts the use of facing and faced, the former having a
concrete meaning (e.g. facing forwards) in addition to the metaphorical meaning
(e.g. facing a dilemma), and the latter having only the metaphorical meaning
(e.g. faced with a dilemma). Moreover, the concrete meaning of facing occurs
in what she calls the Birmingham corpus (general English), but not in the
more specialised Economist and Wall Street Journal corpus. At this level of detail,
it is difficult to see what significant generalisations (beyond spelling and
pronunciation) are captured by assigning facing and faced to the lemma face.
As the the above discussion shows, the term lemma is currently used to
refer to a number of concepts which are undoubtedly related, but which are
logically different: an ad hoc group of words, a dictionary headword, a set of
inflectional variants, a label for a paradigm or set of paradigms, the name of a
set of lexical items, or a set of words including spelling variants. The concept
of the lemma is most useful at a general level in highly abstract discussions of
a language, but seems to be of doubtful value for detailed studies of real texts.
Source:
International Journal Of Corpus Linguistics(2004)