How can I capture each page of text in a Word document separately (using .NET)?

I need to determine which pages of a Word document have a keyword. I have some tools that can get me the text of the document, but nothing that tells me which pages the text is going on. Does anyone have a good place for me? I am using .NET

Thank!

edit: Additional restriction: I cannot use any Interop stuff.

edit2: If anyone knows of stable libraries that can do this, that would be helpful as well. I am using Aspose, but as far as I know it has nothing.

+1


source to share


4 answers


This is how I get the text, I believe you can set the selection range to the page, then you can test that text, maybe a little backward from what you need, but maybe a place to start.



Microsoft.Office.Interop.Word.Application wordApplication = new Microsoft.Office.Interop.Word.Application();
object missing = Type.Missing;
object fileName = @"c:\file.doc";
object objFalse = false;

wordApplication.DisplayAlerts = Microsoft.Office.Interop.Word.WdAlertLevel.wdAlertsNone;
Microsoft.Office.Interop.Word.Document doc = wordApplication.Documents.Open(ref fileName, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,ref objFalse, ref missing, ref missing, ref missing, ref missing);

//I belevie you can define a SelectionRange and insert here
doc.ActiveWindow.Selection.WholeStory();
doc.ActiveWindow.Selection.Copy();

IDataObject data = Clipboard.GetDataObject();
string text = data.GetData(DataFormats.Text).ToString();

doc.Close(ref missing, ref missing, ref missing);
doc = null;

wordApplication.Quit(ref missing, ref missing, ref missing);
wordApplication = null;

      

+2


source


How do you define a page?



If you think the section / hard page is broken, it's tricky but doable. If you want to think of page spread breaks, the task becomes very difficult and somewhat pointless. Note that determining where the page markup on soft pages is dynamically generated at runtime is not stored in the file itself. It depends on a lot of factors, including the active printer driver (yes, it can change for the same file on a different computer), fonts, kerning, line spacing, margins, etc. Etc.

0


source


One crappy way to do it with Aspose is to convert the Word file to PDF and then grab the text on each page.

I don't know anything about the internals of Aspose or how they define their soft pages when converted, but this is the best I have so far.

0


source


Thanks for using Aspose.Words.

In the public API we currently only have stream-document information, for example. paragraphs, tables, lists, etc. Internally, we are building a page layout model that has classes like page, block of text, line of text, etc. Of course, there are internal links between the document model and the layout model, and one can find out which page ends there and that's all. Providing this information via a public API is (well, still) on our priority list.

Have you registered your request in the Aspose.Words support forums? We use this information to maintain the voting system and will work on features that get more votes.

0


source







All Articles