LOGIN TO YOUR ACCOUNT

Username
Password
Remember Me
Or use your Academic/Social account:

CREATE AN ACCOUNT

Or use your Academic/Social account:

Congratulations!

You have just completed your registration at OpenAire.

Before you can login to the site, you will need to activate your account. An e-mail will be sent to you with the proper instructions.

Important!

Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version of the site upon release.

Thank you for your patience,
OpenAire Dev Team.

Close This Message

CREATE AN ACCOUNT

Name:
Username:
Password:
Verify Password:
E-mail:
Verify E-mail:
*All Fields Are Required.
Please Verify You Are Human:
fbtwitterlinkedinvimeoflicker grey 14rssslideshare1

Search filters

Refine by

Publication Year

2012 (8)
2016 (8)
2017 (7)
2014 (5)
2015 (5)
View more
Publication Year

2012 (8)
2016 (8)
2017 (7)
2014 (5)
2015 (5)
2013 (3)
2011 (2)
2010 (1)

Access Mode

Type

Dataset (31)
Software (8)

Language

39 research data, page 1 of 4

Hindi Web Texts

Bojar, Ondřej; Straňák, Pavel; Zeman, Daniel (2011)
Publisher: Charles University in Prague, UFAL
Projects: EC | EUROMATRIXPLUS (231720)
Embargo end date: 2011/11/23
A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts and an extensively cleaned-up and tokenized version suitable for language modeling. 18M sentences, 308M tokens

Eye-Tracking Recordings from a Pilot Study of WMT-style MT Outputs Ranking

Bojar, Ondřej; Děchtěrenko, Filip; Zelenina, Maria (2016)
Publisher: Charles University in Prague, UFAL
Projects: EC | QT21 (645452)
Embargo end date: 2016/04/01
This package contains the eye-tracker recordings of 8 subjects evaluating English-to-Czech machine translation quality using the WMT-style ranking of sentences. We provide the set of sentences evaluated, the exact screens presented to the annotators (including bounding box information for every area of interest and even for individual letters in the text) and finally the raw EyeLink II files with gaze trajectories. The description of the experiment can be found in the paper: Ondře...

WMT16 Tuning Shared Task Models (English-to-Czech)

Kamran, Amir; Jawaid, Bushra; Bojar, Ondřej; Stanojevic, Milos (2016)
Publisher: Charles University in Prague, UFAL
Projects: EC | QT21 (645452)
Embargo end date: 2016/03/22
This item contains models to tune for the WMT16 Tuning shared task for English-to-Czech. CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training. Two 5-gram l...

Khresmoi Summary Translation Test Data 1.1

Dušek, Ondřej; Hajič, Jan; Hlaváčová, Jaroslava; Pecina, Pavel; Tamchyna, Aleš; Urešová, Zdeňka (2014)
Publisher: Charles University in Prague, UFAL
Projects: EC | KHRESMOI (257528)
Embargo end date: 2014/04/28
This package contains data sets for development and testing of machine translation of sentences from summaries of medical articles between Czech, English, French, and German.

WMT 2011 Testing Set

Galuščáková, Petra; Bojar, Ondřej (2012)
Publisher: Charles University in Prague, UFAL
Projects: EC | EUROMATRIXPLUS (231720)
Embargo end date: 2012/05/15
Testing set from WMT 2011 [1] competition, manually translated from Czech and English into Slovak. Test set contains 3003 sentences in Czech, Slovak and English. Test set is described in [2]. References: [1] http://www.statmt.org/wmt11/evaluation-task.html [2] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 20...

Lingua::Interset 2.026

Zeman, Daniel (2014)
Publisher: Faculty of Mathematics and Physics, Charles University in Prague
Projects: EC | QTLEAP (610516)
Embargo end date: 2015/01/28
Lingua::Interset is a universal morphosyntactic feature set to which all tagsets of all corpora/languages can be mapped. Version 2.026 covers 37 different tagsets of 21 languages. Limited support of the older drivers for other languages (which are not included in this package but are available for download elsewhere) is also available; these will be fully ported to Interset 2 in future. Interset is implemented as Perl libraries. It is also available via CPAN.

Prague Czech-English Dependency Treebank 2.0

Texts The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed Czech-English parallel corpus sized over 1.2 million running words in almost 50,000 sentences for each part. Data The English part contains the entire Penn Treebank - Wall Street Journal Section (LDC99T42). The Czech part consists of Czech translations of all of the Penn Treebank-WSJ texts. The corpus is 1:1 ...

Depfix: Automatic Post-editing of SMT

Rosa, Rudolf (2015)
Publisher: Charles University in Prague, UFAL
Projects: EC | QTLEAP (610516)
Embargo end date: 2015/01/29
Depfix, a tool for Automatic Post-editing of SMT. See the project website for more information.

WMT17 Quality Estimation Shared Task Training and Development Data

Specia, Lucia; Logacheva, Varvara (2017)
Publisher: University of Sheffield
Projects: EC | QT21 (645452)
Embargo end date: 2017/02/27
Training and development data for the WMT17 QE task. Test data will be published as a separate item. This shared task will build on its previous five editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, phrase-level and sentence-level estimation. All tasks will make use of a large dataset produced from post-editions by professional translators. The data will be ...

WMT16 Tuning Shared Task Models (Czech-to-English)

Kamran, Amir; Jawaid, Bushra; Bojar, Ondřej; Stanojevic, Milos (2016)
Publisher: Charles University in Prague, UFAL
Projects: EC | QT21 (645452)
Embargo end date: 2016/03/22
The item contains models to tune for the WMT16 Tuning shared task for Czech-to-English. CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training. Two 5-gram la...