DiACL is an open database with grammatical and lexical data for comparative and phylogenetic linguistics. It harbours data from thousands of languages, but is mainly focusing on the areas of Eurasia, the Pacific, and South America. The database has the following content:
- Lexical data sets with basic vocabularies (Swadesh lists)
- Lexical data sets with culture vocabularies, focusing on subsistence system vocabulary
- Typological/morphosyntactic data sets including the main types Word Order, Alignment, and Nominal/ Verbal Morphology.

DiACL contains data from contemporary and historical languages, and, if possible, reconstructed languages. Data is derived from dictionaries, grammars, or by new fieldwork (in particular data from Caucasus and the Amazon). All data is sourced in scientifically reliable literature, which can be retrieved through the page Sources.
Language metadata (retrievable via Language) includes names, ISO 639-3 code, Glottocode, timeframe, focal point, focus area, language area (=Glottolog area), language reliability (living, moribund, extinct, fragmentary), and position in family tree.
Data from the DiACL have been used in several studies (see Publications), most importantly in a monograph with the title Mouton Atlas of Languages and Cultures (2019). Fieldwork data (ELAN, audio/video files) from several of the targeted language have been made available via the Lund Corpus Server on the Lund Humanities Lab. Other field work data will be made available via the CompLing platform under a future Field Archive.
DiACL is a SWE-CLARIN resource, hosted by the Goethe University Frankfurt.
Lexical data
Lexical data is organized by concept lists (Word lists in the database). There are currently two main types of Word lists: Swadesh lists and Culture lists. The Swadesh lists contain either 100 or 200 concepts. These are the same independent of language family, language area or focal area. Culture lists, on the other hand, differ between focal areas. Culture lists are organized into semantic taxonomies of lexical meanings, which are adapted to the macro-areas. Culture vocabulary meanings are selected according to geography and environment (by identifying culture-relevant flora and fauna of macro-areas), relevance to subsistence system of language families, cultural function or affordance, and occurrence in reconstructed vocabularies of targeted language families. For a closer description of motivations for selecting culture lists, see Carling (2019). Further, lexemes are organized under etymologies (cognates), which are graphically reproduced as trees and maps on the database frontend. Lexical etymologies account for borrowing, morphological derivation, and semantic change.
Typological and morphosyntactic data
Typological/morphosyntactic data in DiACL are organized into a four-level hierarchy, which enables coding of polymorphic behaviour (e.g., several word orders) in individual languages. Typological/morphosyntactic features are selected to match known prototypical features of linguistic areas of included macro-areas, targeting properties which ensure a typological variation and which are known to correlate typologically to each other. Besides, typological/morphosyntactic features are selected to whether they can be identified in historical languages.
The data is different from several other similar resources, such as WALS, mainly since the data is organized according to hierarchical categorical features. The hierarchical model is described in several publications, such as Carling et al (2018), and Carling (2019).
Downloading data
Data and datasets in DiACL can be downloaded by different entrances. Either directly via the database, in the form of json or csv files, or from the DiACL Zenodo library, or alternatively via other, external sources.
The language metadata of DiACL can be achieved by various entrances. Via the Language Index Page, it is possible to download all metadata of the database as json/csv, or the data for each language (icon in the right-most column).
Family information of the database can be achieved in json format via the Language Tree Page, which gives all family trees (json icon on top), or by family (in the drop-down menu for each family).
Language metadata can also be achived via DiACL data for CLICS, published by Rzymski et al (2019) and stored as a Zenodo library, which gives the equivalent glottonames, glottocodes and families for most (but not all) DiACL languages. Latest version (3.0) is found here.
Lexical data can be downloaded in json formats from the Word List page. For each Culture list, there is an icon next to the list by which the entire list can be downloaded. For Swadesh lists, there is either the possibility of downloading the entire list (100 and 200) of all language that have data, or by family (distinguished by Swadesh 100 or 200).
Lexical data that have been published in the Mouton Atlas (Carling 2019) (that is, data from the Eurasian continent), can be downloaded from the DiACL Zenodo Library, as an xlsx file by the name of Appendix 3b. Lexical data in DiACL can also be downloaded via Lexibank on Zenodo or Github.
Typological/morphosyntactic data in the DiACL database is divided by macro-regions. Currently, there are three regions, Eurasia, Pacific, and South America, and of these, Eurasia is most well provided with data. The datasets can be downloaded from the Typological Data Sets page, where downloads will render json/xlsx files with all the data and the meta-data for languages.
The curated and recoded grammar files used in the Mouton Atlas (Carling 2019) (that is, data form the Eurasian continent), can be downloaded from the DiACL Zenodo Library. Grammar data from this publication are given as Appendix 2b and 2c (Appendix 2a of the volume is a list of features, which can be downloaded on the webpage). Appendix 2a gives the grammar features as state combinations with a labels, as they are used in maps of the atlas. Appendix 2c gives the state combinations of 2b, as they show up in languages (check the atlas, pp.211-225).
