Sentence segmentation and alimony in a noisy text corpus

Question

Sentence segmentation and alimony in a noisy text corpus

I have a parallel corpus containing about 100,000 aligned paragraphs in Arabic and Persian.

My corpus is a noisy corpus, its paragraphs are incomplete translations of each other (i.e. parts of the Arabic paragraphs are not translated into Persian, and the punctuation marks do not match either).

I used punctuation marks to split the paragraphs into sentences, but the number of sentences does not match.

Then I used Microsoft Aligner to align sentences, but the result is really wrong.

How to segment and align corpus offerings?

+3

alignment text-segmentation nlp corpus giza ++

htaghizadeh Jan 31. 13 at 12:48

source to share

1 answer

Ben allison · Answer 1 · 2013-02-06T09:47:09+0000

You used the Giza ++ tag in your question: Have you looked with alignment tools? Another option that I know quite a few people is Moses , which is a full-featured statistical MT package, but I believe you can reference the alignment models in isolation if that's really all you want.

Sentence segmentation and alimony in a noisy text corpus

More articles: