Humboldt-Universität zu Berlin

The Bangla RST Discourse Treebank

Project PIs: Dr. Debopam Das and Dr. Manfred Stede; May 2017 – July 2018; University of Potsdam

This project aims to develop a corpus in Bangla (an Indo-Aryan language) annotated for coherence relations (according to RST) and relational signals. The corpus contains 266 texts, comprising 71,009 words, with an average of 267 words per text. The corpus represents newspaper genre. The texts have been collected from a popular Bangla daily called Anandabazar Patrika published in India. The corpus started with the annotation of 16 texts, which were evaluated for agreement among the annotators. The present work includes annotation of the remaining 250 more texts, representative of different sub-genres in the newspaper genre.