Publication Year

2012 (8)
2016 (8)
2017 (8)
2015 (3)
2011 (2)
View more
34 research data, page 1 of 4

APE Shared Task WMT17: Human Post-edits Test Data EN-DE

Turchi, Marco; Chatterjee, Rajen; Negri, Matteo (2017)
Publisher: Fondazione Bruno Kessler, Trento, Italy
Projects: EC | QT21 (645452)
Embargo end date: 2017/10/17
Human post-edited test sentences for the WMT 2017 Automatic post-editing task. This consists in 2,000 German sentences belonging to the IT domain and already tokenized. Source and target segments can be downloaded from: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2133. All data is provided by the EU project QT21 (http://www.qt21.eu/).

Restaurant Reviews CZ ABSA corpus v2

Hercig, Tomáš; Brychcín, Tomáš; Svoboda, Lukáš; Konkol, Michal; Steinberger, Josef (2016)
Publisher: University of West Bohemia, Department of Computer Science and Engineering
Projects: EC | MEDIAGIST (630786)
Embargo end date: 2016/12/07
Restaurant Reviews CZ ABSA - 2.15k reviews with their related target and category The work done is described in the paper: https://doi.org/10.13053/CyS-20-3-2469

A Small Dataset for English-to-Czech Speech Translation in the Travel Domain

Cífka, Ondřej; Bojar, Ondřej (2016)
Publisher: Charles University in Prague, UFAL
Projects: EC | QT21 (645452)
Embargo end date: 2016/06/14
This small dataset contains 3 speech corpora collected using the Alex Translate telephone service (https://ufal.mff.cuni.cz/alex#alex-translate). The "part1" and "part2" corpora contain English speech with transcriptions and Czech translations. These recordings were collected from users of the service. Part 1 contains earlier recordings, filtered to include only clean speech; Part 2 contains later recordings with no filtering applied. The "cstest" corpus contains recordings of artificia...

CsEnVi Pairwise Parallel Corpora

Hoang, Duc Tam; Bojar, Ondřej (2015)
Publisher: Charles University in Prague, UFAL
Projects: EC | QT21 (645452)
Embargo end date: 2015/12/25
CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources: - OPUS, the open parallel corpus is a growing multilingual corpus of translated open source documents. The majority of Vi-En and Vi-Cs bitexts are subtitles from movies and television series. The nature of the bitexts are paraphrasing of each other's meaning, rather than translations. - TED talks, a collecti...

Khresmoi Summary Translation Test Data 2.0

Dušek, Ondřej; Hajič, Jan; Hlaváčová, Jaroslava; Libovický, Jindřich; Pecina, Pavel; Tamchyna, Aleš; Urešová, Zdeňka (2017)
Publisher: Charles University in Prague, UFAL
Projects: EC | KHRESMOI (257528)
Embargo end date: 2017/04/03
This package contains data sets for development (Section dev) and testing (Section test) of machine translation of sentences from summaries of medical articles between Czech, English, French, German, Hungarian, Polish, Spanish and Swedish. Version 2.0 extends the previous version by adding Hungarian, Polish, Spanish, and Swedish translations.

Prague Czech-English Dependency Treebank 2.0

Texts The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed Czech-English parallel corpus sized over 1.2 million running words in almost 50,000 sentences for each part. Data The English part contains the entire Penn Treebank - Wall Street Journal Section (LDC99T42). The Czech part consists of Czech translations of all of the Penn Treebank-WSJ texts. The corpus is 1:1 ...

WMT16 Quality Estimation Shared Task Training and Development Data

Specia, Lucia; Logacheva, Varvara; Scarton, Carolina (2016)
Publisher: University of Sheffield
Projects: EC | QT21 (645452)
Embargo end date: 2016/02/29
Training and development data for the WMT16 QE task. Test data will be published as a separate item. This shared task will build on its previous four editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, sentence-level and document-level estimation. The sentence and word-level tasks will explore a large dataset produced from post-editions by professional translat...

Additional German-Czech reference translations of the WMT'11 test set

Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved.

Europarl QTLeap WSD/NED corpus

Agirre, Eneko; Branco, António; Popel, Martin; Simov, Kiril (2015)
Publisher: University of the Basque Country, UPV/EHU
Projects: EC | QTLEAP (610516)
Embargo end date: 2015/05/16
This corpora is part of Deliverable 5.5 of the European Commission project QTLeap FP7-ICT-2013.4.1-610516 (http://qtleap.eu). The texts are sentences from the Europarl parallel corpus (Koehn, 2005). We selected the monolingual sentences from parallel corpora for the following pairs: Bulgarian-English, Czech-English, Portuguese-English and Spanish-English. The English corpus is comprised by the English side of the Spanish-English corpus. Basque is not in Europarl. In addition, it con...

WMT16 Tuning Shared Task Models (Czech-to-English)

Kamran, Amir; Jawaid, Bushra; Bojar, Ondřej; Stanojevic, Milos (2016)
Publisher: Charles University in Prague, UFAL
Projects: EC | QT21 (645452)
Embargo end date: 2016/03/22
The item contains models to tune for the WMT16 Tuning shared task for Czech-to-English. CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training. Two 5-gram la...