Sentence segmentation and alimony in a noisy text corpus
I have a parallel corpus containing about 100,000 aligned paragraphs in Arabic and Persian.
My corpus is a noisy corpus, its paragraphs are incomplete translations of each other (i.e. parts of the Arabic paragraphs are not translated into Persian, and the punctuation marks do not match either).
I used punctuation marks to split the paragraphs into sentences, but the number of sentences does not match.
Then I used Microsoft Aligner to align sentences, but the result is really wrong.
How to segment and align corpus offerings?
source to share
You used the Giza ++ tag in your question: Have you looked with alignment tools? Another option that I know quite a few people is Moses , which is a full-featured statistical MT package, but I believe you can reference the alignment models in isolation if that's really all you want.
source to share