Exploring word order in learner corpora: The Woslac Project

Mendikoetxea, Amaya (2006) Exploring word order in learner corpora: The Woslac Project. In: Corpus Research Group, 2006-11-20. (Unpublished)

[thumbnail of CORPUS_RESEARCH_SEMINAR-ultima.ppt]
Microsoft Powerpoint (CORPUS_RESEARCH_SEMINAR-ultima.ppt)
CORPUS_RESEARCH_SEMINAR-ultima.ppt

Download (2MB)

Abstract

This presentation reports on work in progress under the framework of a research project investigating word order in Second Language Acquisition (WOSLAC), based on two written learner corpora: WriCLE (L1 Spanish - L2 English) and CEDEL2 (L1 English - L2 Spanish). In the first part of the presentation I will discuss (i) the motivation and objectives of the project, (ii) data collection, (iii) query software and (iv) data analysis. In the second part, I will briefly present the results of a preliminary study on the production of postverbal subjects by Spanish learners of English. The purpose of this three-year project is to determine the properties which constrain word order in the interlanguage of L2 learners of English (with L1 Spanish) and L2 learners of Spanish (with L1 English). We examine both lexicon-syntax and syntax-discourse properties. Word order in English and Spanish differs significantly: in English word order is often said to be �fixed�, while Spanish allows for what is often referred to as �free order�. The two languages differ in the devices they employ to order constituents in the sentence In languages with free word order, information structure properties and discourse properties in general play a crucial role in the position occupied by constituents in sentences, while lexico-syntactic properties mostly determine the ordering of constituents in fixed word order languages. An in-depth investigation into word order in advanced learners of L2 English and L2 Spanish will thus offer answers to questions regarding the relative difficulty of acquiring lexical-syntactic and syntactic-discursive properties, as well as general issues related to L1 transfer and the occurrence of constructions which cannot be attributed to the L1 nor to the target language. Learner corpora are an invaluable tool to explore these issues. Our target is for WriCLE and CEDEL2 to reach 1 million words by the end of the three year period. The corpora will be annotated using UAM CorpusTool, which has been adapted for this study. The tool allows an analyst to select a text from the corpus, and annotate it in various ways. The analyst can highlight a segment (e.g., an it-cleft) and then assign features to that segment. The tool produces an XML-encoded version of the text file, including the features assigned to the segments. Because hand-annotation is slow, the tool will allow the analyst to associate lexico-syntactic patterns with each feature, allowing the tool to automatically detect instances of the pattern. For instance, a pattern like: �it be# NP that� would match sentences in the corpus like �It was John that we saw�, and tentatively mark them with the feature it-cleft. The tool would then ask the user to eliminate false matches. This approach eliminates much of the corpus annotation effort. In the second part of the talk I will present briefly the results in Lozano & Mendikoetxea (in press) - a preliminary study whose purpose is to characterise the production of postverbal subjects in the Italian and Spanish subcorpora of ICLE (Granger et al. 2002). Our approach seeks to identify the conditions under which learners produce inverted subjects, regardless of problems to do with grammaticalition. Our findings reveal that Spanish and Italian learners of L2 English produce postverbal subjects in the same contexts in which these are found in native English, though they show persistent grammaticalisation errors. That is, postverbal subjects are found when (H1) the verb is unaccusative, (H2) the subject is long or �heavy�, and (H3) the subject is new (or relatively new) information or �focus�.

Item Type:
Contribution to Conference (Paper)
Journal or Publication Title:
Corpus Research Group
Uncontrolled Keywords:
/dk/atira/pure/researchoutput/libraryofcongress/p1
Subjects:
?? LEARNER CORPORAWORD ORDERUNACCUSATICITYHEAVINESSTOPIC AND FOCUS.P PHILOLOGY. LINGUISTICS ??
ID Code:
285
Deposited By:
Users 189 not found.
Deposited On:
24 Nov 2006
Refereed?:
No
Published?:
Unpublished
Last Modified:
11 Sep 2023 11:46