Why such a bad job for Moshe using Europarl?

I started playing with Moses and tried to do what I thought would be a standard standard base. I basically performed the steps described on the website , but instead of using news-commentary

I used Europarl v7 for learning, with a set of development and original WMT 2006 general test Europarl. My idea was to do something similar to Le Nagard and Koehn (2010) , which received a BLEU score of 0.68 in the original Anglo-French system.

To summarize, my workflow was more or less:

  • tokenizer.perl

    all over
  • lowercase.perl

    (instead of truecase

    )
  • clean-corpus-n.perl

  • IRSTLM train using only French data from Europarl v7
  • train-model.perl

    exactly as described
  • mert-moses.pl

    using WMT 2006 dev
  • Testing and performance measurement as described

And the BLEU result is .26 ... This leads me to two questions:

  • Is this a typical BLEU score for this basic system? I understand that Europarl is a pretty small corpus for teaching a monolingual language model, although this is how they do things on Moses' website.
  • Are there any typical pitfalls for someone just starting out with SMT and / or Moses, maybe I fell? Or do researchers such as Le Nagard and Koehn build their underlying systems differently from the Moses site, for example using some larger, undisclosed corpus to train the language model?
+3


source to share


1 answer


Just to transfer things directly: the .68 you are talking about has nothing to do with BLEU.

My idea was to do something similar to Le Nagard and Koehn (2010), getting a BLEU score of 0.68 in the original Anglo-French system.

The article you linked states that 68% of pronouns (using permission with link) were translated correctly. It is not mentioned anywhere that a .68 BLEU score was obtained. In fact, no scores were given, probably because the qualitative improvement suggested in the article cannot be measured by statistical significance (which is a lot if you are only improving a small number of words). For this reason, the document uses only manual evaluation of pronouns:

The best scoring metric is the number of correctly translated pronouns. This requires guidance checking the translation results.

This is where the .68 game starts.



Now, to answer your questions regarding .26, you got:

Is this the typical BLEU score for this basic system? I realize that Europarl is a pretty small corpus for teaching a monolingual language model, although this is how they do things on Moses' website.

Yes it is. You can find the performance of WMT word pairs here http://matrix.statmt.org/

Are there typical pitfalls for someone just starting out with SMT and / or Moses, maybe I fell? Or do researchers such as Le Nagard and Koehn build their underlying systems differently from the Moses site, for example using some larger, undisclosed corpus to train the language model?

I am assuming you have trained your system correctly. Regarding the "undisclosed corpus" question, members of the academic community typically state for each experiment which datasets were used to test and customize learning, at least in peer-reviewed publications. The only exception is the WMT task (see, for example, http://www.statmt.org/wmt14/translation-task.html ), where private corporations can be used if the system participates in an unconditional track. But even then, people will mention that they used additional data.

+5


source







All Articles