Corpora and Demos


PADIC is a Parallel Arabic DIalect Corpus PADIC is composed of about 6400 sentences of dialects from both the Maghreb and the Middle-East. Each dialect has been aligned with Modern Standard Arabic (MSA). PADIC includes four dialects from Maghreb: two from Algeria, one from Tunisia, one from Morocco and two dialects from the Middle- East (Syria and Palestine). PADIC has been built from scratch by the members of SMarT research: Salima Harrat, Karima Meftouh and K. Smaïli and with the participation of M. Abbas. The translation of Tunisian has been done by Salma Jamoussi, Moroccan by Samia Haddouchi, Palestinian by Motaz Saad and Syrian by Charif Alchieekh Haydar.

Any use of PADIC shall include the following acknowledgement: “Programme material SMarT” and will use the following article for referencing it:

  1. Meftouh, S. Harrat, M. Abbas, S. Jamoussi, and K. Smaïli, “Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus”, PACLIC29, Shanghai, 2015
  2. Meftouh, S Harrat, Kamel Smaïli, “PADIC: extension and new experiments” 7th International Conference on Advanced Technologies ICAT, Apr 2018, Antalya, Turkey. 7th International Conference on Advanced Technologies, 2018

Note that in the first version of PADIC and in this last paper, the Moroccan dialect was not available and not mentioned. Two Moroccan from Casablanca and Rabat included the Moroccan part. Consequently, we recommend you to reference the above article.

Download: A Parallel Arabic DIalect Corpus

CALYOU is a Comparable Spoken Algerian Corpus Harvested from YouTube

We developed an approach based on word embedding  that permits to align the best comment written in Algerian dialect with a comment written in Latin script. That means that an Arabic dialect sentence could be aligned with French or Arabizi sentence. A sample is given below.

Karima Abidi, Mohamed Amine Menacer, Kamel Smaili “CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube”, 8th Annual Conference of the International Communication Association (Interspeech), Aug 2017, Stockholm, Sweden.

Download: CALYOU (Comparable Spoken Algerian Corpus harvested from YOUtube)

Lexicon of variable forms of Algerian words

This resource gathers the words with their different writing possibilities (orthographic variability). This variability is the main characteristic of the Arabic dialects especially those used in social networks. The orthographic variability is due to the lack of standardization of writing, the use of Arabizi (writing Arabic words with Latin characters) and the lack of grammatical rules for the dialects. The lexicon was built automatically using word embedding approach. Each entry is composed by a word and its different writing forms. This resource can be very useful in many applications of natural language processing. Some examples of the dictionary entries are given below.

Karima Abidi, Kamel Smaïli.” An Automatic Learning of an Algerian Dialect Lexicon by using Multilingual Word Embeddings”. 11th edition of the Language Resources and Evaluation Conference, LREC 2018, May 2018, Miyazaki, Japan.

يحفدكyahafdak yahfdek yahfedk yahefdek yahafdeek yahfdak yahafdek yahefdak yehafdak yahafadak
يرحمكyr7mk yr7mak yrhamak yarhamek yarhemak yarhamk yarhmek yr7mek yere7mek yarhamak yarhemek yrhmk yar7mak yarhmak yer7makyarahmak yar7mek yarahmk yerhemek yarahmek yerehmek yerhamek yar7mik yare7mek yerhamak yer7mek yerehemek yarhmeke rahimaka yrahmek yrahmak irahmak irhmak irahmek irhmk yra7mk yerahmak yrehmak yera7mak yerehmkyrhmak yera7mek yrehmek yara7mak yarehmek yara7mek yerahmeke yrhmek yarehmak yarhmk yerhmk yarhmeek yra7mak yerahmek ir7mak yra7mek yrahmk yarhamoka yrehmk yar7mk yerhmk ira7mak irehmek yerhmek yarahemek yerahmk yerhmek yrhmek yerahmak
فلمfilm filme
Misterمستر ميستر
Mansotichمانسوطيش منسطيش منسوطييش مانصوتيش مانسوطييش ماانسووطيش منسوطيش مانسوطيوش مانصوطوش منصوطيش