Corpora

The most substantial corpus of the CompLing database is TITUS 2.0. This database contains a majority of the datasets of the older TITUS database.

However, some corpora and data sets of TITUS have been taken out and built in more modernized interfaces. We will give an overview of the independent corpora here (note that some will be integrated in TITUS 2.0 later on). For dictionaries, we refer to the page Lexica.

Indo-European: Germanic

ReA/LeA

LeA – Lesekorpus Altdeutsch is a corpus with Old High German and Old Saxon texts, which are grammatically annotated.

Indo-European: Celtic

Ogam

The Ogam corpus includes photos, transliterations, transcriptions, translations, and metadata of Ogam inscriptions. For the moment, the Ogam database will be present outside of TITUS 2.0, but in the future, it will be integrated with the new corpus.

Indo-European: Tocharian

Tocharica

TITUS: Tocharica contains photos, transliterations and transcriptions of the Tocharian manuscripts of the Berlin Turfan Collection. Note that this data is integrated with the CEToM – A Comprehensive Edition of Tocharian Manuscripts database at Vienna University.

Indo-European: Baltic

SLIEKKAS

SLIEKKAS is a corpus of Old Lithuanian texts (1500-1800), containing around 10 million textwords.

CorDon

CorDon is an Old Lithuanian annotated corpus (24,000 words) of the works of the Lithuanian national poet Kristjonas Donelaitis.


Kartvelian: Georgian

Rustaveli

Rustaveli goes Digital is a database with different translations of the Medieval Georgian epic poem The Knight in the Panther’s Skin by Shota Rustaveli.

ECLING

The corpus/archive ECLING contains data in the form of texts and recordings of the endangered languages Batsbi, Svan and Udi in Georgia.


Turkic

VATEC

VATEC is a corpus of pre-Islamic Old Turkic corpora