Glossary
created by F. Müller
Authorship attribution
In authorship inquiries, corpora can be used by forensic linguists to identify the author or to deny authorship of a suspect of a specific text or set of texts. The analyst uses stylometric techniques such as distribution of function words, use of untypical words etc. The analyst further uses large corpora as a basis for the comparison between the text in question and texts/language of the corpora.
Collocation
A sequence of words that occur more often next to each other than they would appear by chance. Collocations possess a certain degree of fixedness. An example for a typical collocation is to brush one's teeth. Although e.g. to clean one's teeth is technically correct aswell, it is not used in natural language.
(CMC) Computer-mediated communication
A form of communication that occurs between two or more electronic devices. Typical examples for CMC are text-messaging, instant-messaging, forums, chats, etc.
Concordance
A concordance is an alphabetical list which shows the results for the search term(s) along with the respective context for each result.
Corpus-based approach
A corpus-based approach describes a method where the researcher has a hypothesis/ theory and uses the corpus as a basis to further investigate whether his hypothesis/theory proves to be valid or whether it can be refuted with the help of the data provided in the corpus. The features to search for are determined prior to the analysis of the corpus.
Every approach that is not corpus-driven is a corpus-based approach.
Corpus-driven approach
The corpus-driven approach views the corpus not as a source of data to formulate a new hypothesis. The researcher investigates the corpus without any previously postulated hypothesis and develops his/her theory while screening the data provided by the corpus. Only during the analysis of the corpus are the features decided on that should be the focus of such a study. The choice of features is ‘driven’ by the content of the corpus, so to speak.
Diachronic corpus analysis
A diachronic analysis is concerned with the language change over a predefined span of time. Such a study can deal with grammatical change or semantic change etc. Corpora that cover several periods of time offer a good source to conduct qualitative diachronic analyses. A corpus that enables the user to conduct diachronic analyses is COHA (Corpus of Historical American English). Its span reaches from 1810 down to 2009. It is regularly updated and consists of more than 400 million words (March 2016).
Keyword in context
Also referred to as KWIC, the keyword is the search term itself. A KWIC-search displays not only every token of the word in a simple list but also displays for every token the sentence in which the word was used; the search string displayed in the middle of the screen. Usually, the user can determine the range of the context shown in the results, e.g. five words before and after search string. An advantage of this display of the results is that the user can see how the word is used and in case of doubt clarify the meaning of the word by reading through the context.
There are tools that allow you to analyse corpora and extract KWIC-lists from a text. Free software is e.g. KWIC Concordance.
Monitor corpora
The data the corpus consists of is dynamic. Over the course of time, new data is added to the corpus resulting in a continuously growing corpus, if older data is not deleted simultaneously. Monitor corpora are well suited to (a) investigate most recent language developments as the data provided is up to date and (b) features with a low frequency since these corpora tend to be extremely large. They tend to be used to search for neologisms and hapax legomena. One such monitor corpus is the Bank of English. The Bank of English corpus consists of 650 million words (14.06.2016) and constitutes of written as well as spoken language from eight varieties of English. Monitor corpora may be financed by publishing companies for the compilation of dictionaries. As such they are not available to the public.
N-Gram
A sequence of n elements appearing consecutively (3 elements = a 3-gram). N-grams are closely connected to collocations in terms of the relation of the elements to each other. Google has developed an online tool called N-Gram Viewer that allows to search for a word-sequence in a certain corpus of choice and to visualise its frequency in a frequency chart.
Qualitative analysis
In corpus linguistics, a quantitative analysis should be combined with a qualitative analysis i.e. instead of just looking at the frequency of say a given word one should also look more closely at the context of use its meaning or function in individual cases.
Quantitative Analysis
In corpus linguistics, usually a large amount of data is analysed quantitatively, i.e. aided by software a lot of data is processed in the blink of an eye, to find out the frequency of occurrence of a word, morpheme, phrase etc.
Static Corpus
Is a corpus that contains data of a specific field or topic over a fixed span of time. This means that no additional data is added to the corpus. It enables the user to investigate a more specific topic than general corpora like the COCA (Corpus of Contemporary American English) allow. Static corpora further pursue to be as representative as possible in that the data gathered covers every aspect of the topic as evenly as possible.
Synchronic corpus analysis
A synchronic study focuses on one period of time rather than comparing several periods with each other e.g. 1990s or the earl modern English period.
Tagging
A form of Annotation in which information is added to each word in a corpus about its word-class. It is also referred to as Part-of-Speech-Tagging. To illustrate this, a tag, such as -VB will be coded to every verb in its base form in the corpus. For the process, a specific tagging-system with fixed underlying rules is used. Software like TreeTagger is able to tag a corpus automatically in several languages, with a relatively low error rate.
Type/token distinction
Given the case that a search query returns 265 results. These results consist of 100 verbs, 100 nouns and 65 adjectives. The amount of tokens is then equal the amount of results, in this case 265. The amount of types in this example is 3, verbs, nouns and adjectives. The amount of tokens vor the type verbs is 100 etc.
Word list
A list of the words of a corpus, typically sorted by Frequency. The sorting criterion can be changed within the analysis software.