FAQ
Frequently Asked Questions
created by F. Müller, L. Lehnen & H. Schmitz
FAQ
Corpora provide searchable collections of texts (spoken and written) in electronic format. Corpus linguistics is generally considered not a branch of linguistics but rather a method of carrying out linguistic research. With the help of corpora you are able to access and analyse a vast amount of authentic language data.
Corpora can be used in many areas of linguistic research to answer both quantitative and qualitative research questions. Due to their specific design, i.e. mainly the texts contained in them, some corpora suit certain research questions better than others. You can find detailed account of a number of corpora in CoRD (The Corpus Resource Database) of the University of Helsinki.
A good way to get acquired with corpora and corpus analyses is to get started right away and use one. A very suitable corpus to start with is the BNC (British National Corpus). For today’s standards, its size with 100 million words is comparatively small. However, the size of a corpus does not necessarily represent its value. The BNC website further does not only consist of the corpus itself but has a built-in search engine to go along with. This search engine enables the user to perform all the basic tasks, e.g. you can create word frequency lists, one needs in order to perform a linguistic analysis. The website also offers a comprehensible help section if you are not sure about your search-setup. These features make the BNC the perfect corpus to get started with corpus linguistics. As an unregistered user, you are only able to enter 20 queries per day, if you register you can enter 50 queries per day.
Corpora basically differ in size, i.e. in the number of words and texts they include, and in the way they are compiled. The first decision when creating a corpus would normally deal with the language(s) or language variety(ies) that the corpus is supposed to represent, the time from which the texts are and the text types. General corpora aim at providing a holistic representation of a certain language or language variety, whereas specialised corpora focus on particular domains, genres, periods in time, etc.
It might seem that a corpus would be better the larger it is. Homogeneity, representativeness and comparability are, however, issues that have to be considered as well, which means that it might be more useful to restrict a corpus to certain varieties, genres, etc. so it can be used to answer more specific research questions. The ICE-Corpora, for example, have been compiled with the aim of being comparable to each other, which means that their sample frame is the same, concerning genres, number of files, words per genre etc. However, for some varieties of English, certain genres occur more (or less) frequently or with a different cultural implication, therefore, a standardised sample frame may diminish the representativeness of that corpus for that individual variety.
Another issue indeed concerns the period from which the texts were sampled since language change is a dynamic process and recent linguistic developments might not be represented in corpora whose texts were sampled some time ago. One possibility to avoid that problem is to use a so-called monitor corpus, which is not static but increases in size as new texts are constantly added. The freely available COCA (Corpus of Contemporary English) is an example, and the internet may also be considered a monitor corpus. However, the problem with these corpora often is that the selection of texts is not carried out cautiously, which causes another draw-back to the representativeness of the corpus. Therefore, when writing a research paper, it is perfectly fine to use a corpus with texts from the 90s if you make explicit that it might not fully represent today’s language use. Language change does not take place within the blink of an eye anyway, which means that it is still up to date for most research questions.
As a large and ever-growing collection of texts, the internet can be considered a corpus as well and search engines or specific interfaces, such as WebCorp Live, may serve to research it. However, within this approach, some issues have to be regarded with caution. First of all, the web is not designed as a linguistic corpus, which means that there is no division into different genres and many texts online contain errors that are only interesting to specific research questions within the field of CMC (Computer-mediated Communication), for instance. In addition, there is almost no chance to replicate found results because of the great speed with which the internet changes. Still, the internet provides rich data and web-based corpus approaches may take advantage of this fact by compiling corpora from the material according to a sampling frame.
Some corpora are available online and you can access them either directly or after prior registration, which means that you can conduct your research at home, provided you use respective software. Other corpora are licenced and may not be copied. You find a list of corpora available in data bases, such as CoRD (The Corpus Resource Database), online. There is also a variety of corpora available in the CIP-Pools of the University of Würzburg. An overview of these corpora is given in the section Types of Corpora.
This depends on which corpus you want to use for your analysis. As mentioned before, there are corpora that can be accessed online like the BNC, COCA or WebCorp Live, using the web as a corpus. The BNC and COCA have a built-in search engine that allows you to conduct analyses without the need of any additional software. To analyse the web as a corpus it is recommendable to use the online engine WebCorp Live. In the CIP-Pool of the University of Würzburg, to access corpora like FLOB or Brown, you open an analysis software (WordSmith Tools or AntConc), then load the text files of the respective corpus before you start your analysis. If you create a corpus by yourself, you need to download the corpus data onto your computer and then use respective analysis software. There is, however, enough software available for free that allows you to conduct your own research. Which software you might use for that is described below.
There are some software tools which are particularly designed for the research of certain corpora, e.g. ICECUP for ICE-GB and the DCPSE. WordSmith is another program with which you can conduct corpus analyses, provided that the sampled texts are plain texts (e.g. txt files). These programs are not for free, WordSmith however is available on the computers in the CIP-Pools. The WordSmith homepage also offers tutorials on how to use this tool. One program that is freely available is e.g. KWIC-Concordance, it contains the basic functionalities to conduct small corpus analyses and is very easy to use. A short tutorial on how to use it can be found in the section Tutorials. AntConc is another free tool that can be used to conduct corpus research on plain, tagged and annotated texts. You can find tutorials on how to use AntConc on youtube or on the website of its developer Laurence Anthony.
The sampled texts within a corpus may be completely unedited or annotated, which means that additional information is added. The most common types of annotation are structural mark-up, POS-tagging all parsing. Structural mark-up includes the coding of the structural features of the text, such as different formats, e.g. <it>...</it> to indicate that in the original text a phrase was written in italics, or pauses and overlaps. In parts-of-speech (POS)-tagging all the words of the corpus are assigned their word class, which means that you can differentiate for instance between cook as a noun and as a verb (in the corpus possibly visible as cook_NN and cook_VB), or search for specific parts of speech directly (e.g. for all adjectives in the superlative). Parsing adds the syntactic function of certain sentence parts to the POS-tagging. There are also other more specific types of annotation but generally the more detailed the added information becomes, the more work has to be done manually before the release of the corpus.
For your first analyses using corpora you don’t need broad statistical knowledge. In later stages, you should gain a general understanding of the statistics carried out by the corpus analysis tools in order for you to interpret and discuss the numerical results correctly. Usually you don’t have to carry out statistical tests yourself but if you do, it only makes sense for data larger than 100 hits.
Generally, this depends on your research question. Compiling a corpus, however, requires careful and sophisticated thinking and effort, which normally exceeds the scope of term papers. Therefore, you should probably reconsider your research question(s) if there are no existing corpora with which you can answer them. However, creating your own corpus might be an idea to keep in mind for your thesis.