What is corpus linguistics?
Corpus linguistics is viewed by some as an empirical method of linguistic analysis and description, using real-life examples of language date stored in corpora as the starting point (Crystal, 1992; Jackson, 2007).(p,29) Corpus linguistics is maturing methodologically (McEnery and Wilson, 2001); it is an approach or methodologically for studying language use (Bowker and Pearson, 2002: 9). Other view corpus linguistics as theory, and so much more than methodology (Sinclair, 1994, 1996, 2001, 2004). Halliday (1993:4) asserts that corpus linguistics 're-unites data gathering and theorizing and this is leading to a qualitative change in our understanding of language'. Teubert and Krishnamurthy (2007), suggest that corpus linguistics is a 'bottom-up' approach that looks at 'the full evidence of the corpus', analyses the evidence with the aim of finding probabilities, trends, patterns, co-occurrences of elements, features or groupings of features. Corpus linguistics is regarded as a new philosophical approach to linguistic enquiry (Tognini - Bonelli, 2001: 1).
A corpus is not just any collection of texts; it is a collection of naturally occurring language texts, chosen to characterize a state or variety of a language (Sinclaire, 1991:171). In other words, a corpus is designed and compiled based on corpus design principles. Sinclair (2005a) details a set of core principles and these are listed below:
1. Corpus contents are selected based on their communicative purpose in the community without regard for the language that they contain.
2. The control of subject matter in the corpus is imposed by the use of external, and not internal, criteria.
3. Only components in the corpus that are designed to be independently contrasted are contrasted (i.e., 'orientation').
4. Criteria determining the structure of the corpus are small in number, separate from each other, and efficient at delineating a corpus that in representative.
5. Samples of language for the corpus, whenever possible, consist of entire texts.)
6. Any information about a text, such as part-of-speech tags and the typography and layout of a printed document, should be stored separately from the plain text (i.e., the words and punctuation of the text) and only merged when needed.
7. The design and composition of the corpus are fully documented with full justifications.
8. The corpus design includes, as target notions, representativeness and balance.
9. The corpus aims for consistency in its components while maintaining adequate coverage.