Corpora and Demos

PADIC

PADIC is a Parallel Arabic DIalect Corpus PADIC is composed of about 6400 sentences of dialects from both the Maghreb and the Middle-East. Each dialect has been aligned with Modern Standard Arabic (MSA). PADIC includes four dialects from Maghreb: two from Algeria, one from Tunisia, one from Morocco and two dialects from the Middle- East (Syria and Palestine). PADIC has been built from scratch by the members of SMarT research: Salima Harrat, Karima Meftouh and K. Smaïli and with the participation of M. Abbas. The translation of Tunisian has been done by Salma Jamoussi, Moroccan by Samia Haddouchi, Palestinian by Motaz Saad and Syrian by Charif Alchieekh Haydar.

Any use of PADIC shall include the following acknowledgement: “Programme material SMarT” and will use the following article for referencing it:

Meftouh, S. Harrat, M. Abbas, S. Jamoussi, and K. Smaïli, “Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus”, PACLIC29, Shanghai, 2015
Meftouh, S Harrat, Kamel Smaïli, “PADIC: extension and new experiments” 7th International Conference on Advanced Technologies ICAT, Apr 2018, Antalya, Turkey. 7th International Conference on Advanced Technologies, 2018

Note that in the first version of PADIC and in this last paper, the Moroccan dialect was not available and not mentioned. Two Moroccan from Casablanca and Rabat included the Moroccan part. Consequently, we recommend you to reference the above article.

Download: A Parallel Arabic DIalect Corpus

CALYOU is a Comparable Spoken Algerian Corpus Harvested from YouTube

We developed an approach based on word embedding that permits to align the best comment written in Algerian dialect with a comment written in Latin script. That means that an Arabic dialect sentence could be aligned with French or Arabizi sentence. A sample is given below.

Karima Abidi, Mohamed Amine Menacer, Kamel Smaili “CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube”, 8th Annual Conference of the International Communication Association (Interspeech), Aug 2017, Stockholm, Sweden.

Download: CALYOU (Comparable Spoken Algerian Corpus harvested from YOUtube)

Lexicon of variable forms of Algerian words

This resource gathers the words with their different writing possibilities (orthographic variability). This variability is the main characteristic of the Arabic dialects especially those used in social networks. The orthographic variability is due to the lack of standardization of writing, the use of Arabizi (writing Arabic words with Latin characters) and the lack of grammatical rules for the dialects. The lexicon was built automatically using word embedding approach. Each entry is composed by a word and its different writing forms. This resource can be very useful in many applications of natural language processing. Some examples of the dictionary entries are given below.

Karima Abidi, Kamel Smaïli.” An Automatic Learning of an Algerian Dialect Lexicon by using Multilingual Word Embeddings”. 11th edition of the Language Resources and Evaluation Conference, LREC 2018, May 2018, Miyazaki, Japan.

Entry	Forms
يحفدك	yahafdak yahfdek yahfedk yahefdek yahafdeek yahfdak yahafdek yahefdak yehafdak yahafadak
يرحمك	yr7mk yr7mak yrhamak yarhamek yarhemak yarhamk yarhmek yr7mek yere7mek yarhamak yarhemek yrhmk yar7mak yarhmak yer7makyarahmak yar7mek yarahmk yerhemek yarahmek yerehmek yerhamek yar7mik yare7mek yerhamak yer7mek yerehemek yarhmeke rahimaka yrahmek yrahmak irahmak irhmak irahmek irhmk yra7mk yerahmak yrehmak yera7mak yerehmkyrhmak yera7mek yrehmek yara7mak yarehmek yara7mek yerahmeke yrhmek yarehmak yarhmk yerhmk yarhmeek yra7mak yerahmek ir7mak yra7mek yrahmk yarhamoka yrehmk yar7mk yerhmk ira7mak irehmek yerhmek yarahemek yerahmk yerhmek yrhmek yerahmak
فلم	film filme
Mister	مستر ميستر
Mansotich	مانسوطيش منسطيش منسوطييش مانصوتيش مانسوطييش ماانسووطيش منسوطيش مانسوطيوش مانصوطوش منصوطيش

Demos

Voice Conversion

Enhancement of esophageal speech

AMIS Project for video summary generation

TRAM: Translating Arab Music

Voice Conversion

Voice conversion is an important problem in audio signal processing. The goal of voice conversion consists in transforming the speech signal of a source speaker in such a way that that it sounds as if it had been uttered by a target speaker while preserving the same linguistic content of the source original signal. We propose a novel methodology for designing the relationship between two sets of spectral envelopes. Our systems perform by: 1) cascading Deep Neural Networks and Gaussian Mixture Model to construct DNN-GMM and GMM-DNN-GMM predictors in order to find a global mapping relationship between the cepstral vocal tract vectors of the two speakers; 2) using a new spectral synthesis process with excitation and phase extracted from the target training space encoded as a KD-tree. We present in this demo samples of voice conversion outputs from male and female source speakers.