Microsoft Text Parser word in "C"

I would like to know the procedure to be taken to parse and retrieve text content from Microsoft documents (.doc and .docx). the programming language used must be plain "C" (must be gcc).

Are there any libraries that already do the job,

Expansion

: Can I use the same procedure to parse text from Microsoft Power Point files?

0


source to share


4 answers


Microsoft Word documents are a huge beast - you definitely don't want to write this code yourself. Take a look at existing free Word library like antiword or wvWare .



+1


source


I don't know about existing libraries, but the format specifications are available from Microsoft for free and under a promise not to sue you for using them.



+1


source


on windows, let the word do the job and interface with the COM object, on linux, the job is done in antiword . Or you can automate OpenOffice.org on any platform with UNO .

+1


source


If you want to take advantage of the effort of using the C COM interface, you can use the IFilter interface built into every version of Windows since Windows 2000. You can use it to extract text from any office document (Word, Excel, etc.), file PDF or any file type with IFilter support installed.

I wrote a blog post about this a few years ago. This is all C ++, but you can use COM objects from C.

+1


source







All Articles