Sentence segmentation and alimony in a noisy text corpus

I have a parallel corpus containing about 100,000 aligned paragraphs in Arabic and Persian.

My corpus is a noisy corpus, its paragraphs are incomplete translations of each other (i.e. parts of the Arabic paragraphs are not translated into Persian, and the punctuation marks do not match either).

I used punctuation marks to split the paragraphs into sentences, but the number of sentences does not match.

Then I used Microsoft Aligner to align sentences, but the result is really wrong.

How to segment and align corpus offerings?

+3


source to share


1 answer


You used the Giza ++ tag in your question: Have you looked with alignment tools? Another option that I know quite a few people is Moses , which is a full-featured statistical MT package, but I believe you can reference the alignment models in isolation if that's really all you want.



0


source







All Articles