You have just completed your registration at OpenAire.
Before you can login to the site, you will need to activate your account.
An e-mail will be sent to you with the proper instructions.
Important!
Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version
of the site upon release.
In this paper, we investigate the CoNLL Shared Task format, its properties and possibility of its use for complex annotations. We argue that, perhaps despite the original intent, it is one of the most important current formats for syntactically annotated data.
We show the limits of the CoNLL-ST data format in its current form and propose several simple enhancements that push those limits further and make the format more robust and future proof. We analyse several different linguistic
...
This package contains data sets for development and testing of machine translation of sentences from summaries of medical articles between Czech, English, French, and German.
We provide a few insights on data selection for
machine translation. We evaluate the quality
of the new CzEng 1.0, a parallel data source
used in WMT12. We describe a simple technique
for reducing out-of-vocabulary rate after
phrase extraction. We discuss the benefits
of tuning towards multiple reference translations
for English-Czech language pair. We
introduce a novel approach to data selection
by full-text indexing and search: we select
sentences similar to th...
Following upon the last year's CUNI system for automatic post-editing of machine translation output,
we focus on exploiting the potential of sequence-to-sequence neural models for this task. In this system description paper, we compare several encoder-decoder architectures on a smaller-scale models and present the system we submitted to WMT 2017 Automatic Post-Editing shared task based on this preliminary comparison. We also show how simple inclusion of synthetic data can improve the overa...
In this paper, we present a set of improvements introduced to MUMULS, a tagger for the automatic detection of verbal multiword expressions. Our tagger participated in the PARSEME shared task and it was the only one based on neural networks. We show that character-level embeddings can improve the performance, mainly by reducing the out-of-vocabulary rate. Furthermore, replacing the softmax layer in the decoder by a conditional random field classifier brings additional improvements. Finally, we...
In this paper, we focus on the incorporation of a valency lexicon into TectoMT system for Czech-Russian language pair. We demonstrate valency errors in MT output and describe how the introduction of a lexicon influenced the translation results. Though there was no impact on BLEU score, the manual inspection of concrete cases showed some improvement.
Translating into morphologically rich languages is difficult. Although the coverage of lemmas may be reasonable, many morphological variants cannot be learned from the training data. We present a statistical translation system that is able to produce these inflected word forms. Different from most previous work, we do not separate morphological prediction from lexical choice into two consecutive steps. Our approach is novel in that it is integrated in decoding and takes advanta...
A summary of research activities in the area of machine translation in the EU (Technologies – Demands – Gaps – Roadmaps) has been presented, including contributions from multiple other EU-funded projects.
Maximum Entropy Principle has been
used successfully in various NLP tasks. In
this paper we propose a forward translation
model consisting of a set of maximum
entropy classifiers: a separate classifier
is trained for each (sufficiently frequent)
source-side lemma. In this way
the estimates of translation probabilities
can be sensitive to a large number of features
derived from the source sentence (including
non-local features, features making
use of sentence s...