LOGIN TO YOUR ACCOUNT

Username
Password
Remember Me
Or use your Academic/Social account:

CREATE AN ACCOUNT

Or use your Academic/Social account:

Congratulations!

You have just completed your registration at OpenAire.

Before you can login to the site, you will need to activate your account. An e-mail will be sent to you with the proper instructions.

Important!

Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version of the site upon release.

Thank you for your patience,
OpenAire Dev Team.

Close This Message

CREATE AN ACCOUNT

Name:
Username:
Password:
Verify Password:
E-mail:
Verify E-mail:
*All Fields Are Required.
Please Verify You Are Human:
fbtwitterlinkedinvimeoflicker grey 14rssslideshare1

Search filters

Refine by

Publication Year

2012 (8)
2016 (8)
2017 (8)
2015 (3)
2011 (2)
View more
Publication Year

2012 (8)
2016 (8)
2017 (8)
2015 (3)
2011 (2)
2013 (2)
2014 (2)
2010 (1)

Access Mode

Type

Dataset (34)

Language

34 research data, page 1 of 4

Urdu Monolingual Corpus

Jawaid, Bushra; Kamran, Amir; Bojar, Ondřej (2014)
Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Projects: EC | MOSESCORE (288487)
Embargo end date: 2014/03/27
We release a sizeable monolingual Urdu corpus automatically tagged with part-of-speech tags. We extend the work of Jawaid and Bojar (2012) who use three different taggers and then apply a voting scheme to disambiguate among the different choices suggested by each tagger. We run this complex ensemble on a large monolingual corpus and release the both plain and tagged corpora.

Czech-Slovak Parallel Corpus

Galuščáková, Petra; Garabík, Radovan; Bojar, Ondřej (2012)
Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Projects: EC | EUROMATRIXPLUS (231720)
Embargo end date: 2012/05/15
Czech-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] – EMEA, EUConst, KDE4 and PHP) and downloaded website of European Commission [5]. Corpus is published in both in plaintext format and with an automatic morphological annotation. References: [1] http://langtech.jrc.it/JRC-Acquis.html/ [2] http://www.statmt.org/europarl/ [3] http://apertium.eu/data [4] ht...

Prague Czech-English Dependency Treebank 2.0 Coref

Nedoluzhko, Anna; Novák, Michal; Cinková, Silvie; Mikulová, Marie; Mírovský, Jiří (2016)
Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Projects: EC | QTLEAP (610516)
Embargo end date: 2016/03/30
The Prague Czech-English Dependency Treebank 2.0 Coref (PCEDT 2.0 Coref) is a parallel treebank building upon the original PCEDT 2.0 release and enriching it with the extended manual annotation of coreference, as well as with an improved automatic annotation of the coreferential expression alignment.

WMT 2011 Testing Set

Galuščáková, Petra; Bojar, Ondřej (2012)
Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Projects: EC | EUROMATRIXPLUS (231720)
Embargo end date: 2012/05/15
Testing set from WMT 2011 [1] competition, manually translated from Czech and English into Slovak. Test set contains 3003 sentences in Czech, Slovak and English. Test set is described in [2]. References: [1] http://www.statmt.org/wmt11/evaluation-task.html [2] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 20...

Eye-Tracking Recordings from a Pilot Study of WMT-style MT Outputs Ranking

Bojar, Ondřej; Děchtěrenko, Filip; Zelenina, Maria (2016)
Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Projects: EC | QT21 (645452)
Embargo end date: 2016/04/01
This package contains the eye-tracker recordings of 8 subjects evaluating English-to-Czech machine translation quality using the WMT-style ranking of sentences. We provide the set of sentences evaluated, the exact screens presented to the annotators (including bounding box information for every area of interest and even for individual letters in the text) and finally the raw EyeLink II files with gaze trajectories. The description of the experiment can be found in the paper: Ondře...

Prague Czech-English Dependency Treebank 2.0

Hajič, Jan; Hajičová, Eva; Panevová, Jarmila; Sgall, Petr; Cinková, Silvie; Fučíková, Eva; Mikulová, Marie; Pajas, Petr; Popelka, Jan; Semecký, Jiří; Šindlerová, Jana; Štěpánek, Jan; Toman, Josef; Urešová, Zdeňka; Žabokrtský, Zdeněk (2012)
Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Projects: EC | EUROMATRIXPLUS (231720)
Embargo end date: 2013/03/28
Texts The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed Czech-English parallel corpus sized over 1.2 million running words in almost 50,000 sentences for each part. Data The English part contains the entire Penn Treebank - Wall Street Journal Section (LDC99T42). The Czech part consists of Czech translations of all of the Penn Treebank-WSJ texts. The corpus is 1:1 ...

QTLeap WSD/NED corpus

Agirre, Eneko; Branco, António; Popel, Martin; Simov, Kiril (2015)
Publisher: University of the Basque Country, UPV/EHU
Projects: EC | QTLEAP (610516)
Embargo end date: 2015/05/15
This corpora is part of Deliverable 5.5 of the European Commission project QTLeap FP7-ICT-2013.4.1-610516 (http://qtleap.eu). The texts are Q&A interactions from the real-user scenario (batches 1 and 2). The interactions in this corpus are available in Basque, Bulgarian, Czech, English, Portuguese and Spanish. The texts have been automatically annotated with NLP tools, including Word Sense Disambiguation, Named Entity Disambiguation and Coreference resolution. Please check deliverab...

Additional German-Czech reference translations of the WMT'11 test set

Bojar, Ondřej; Zeman, Daniel; Dušek, Ondřej; Břečková, Jana; Farkačová, Hana; Grošpic, Pavel; Kačenová, Kristýna; Knechtová, Eva; Koubová, Anna; Lukavská, Jana; Nováková, Petra; Petrdlíková, Jana (2012)
Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Projects: EC | EUROMATRIXPLUS (231720)
Embargo end date: 2012/11/13
Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved.

WMT17 Quality Estimation Shared Test Data

Specia, Lucia; Logacheva, Varvara (2017)
Publisher: University of Sheffield
Projects: EC | QT21 (645452)
Embargo end date: 2017/04/13
Test data for the WMT17 QE task. Train data can be downloaded from http://hdl.handle.net/11372/LRT-1974 This shared task will build on its previous five editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, phrase-level and sentence-level estimation. All tasks will make use of a large dataset produced from post-editions by professional translators. The data will ...

WMT16 Tuning Shared Task Models (Czech-to-English)

Kamran, Amir; Jawaid, Bushra; Bojar, Ondřej; Stanojevic, Milos (2016)
Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Projects: EC | QT21 (645452)
Embargo end date: 2016/03/22
The item contains models to tune for the WMT16 Tuning shared task for Czech-to-English. CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training. Two 5-gram la...