How to extract Word documents from data recovered from USB device?

I was able to copy the raw data from an inaccessible USB stick into a monolithic file of about 250MB in size. Somewhere in this block of bytes, there are about 40 Word documents.

  • Where can I find documentation on the internal structure of Word documents so that I can parse the byte stream, find out where the Word document starts and ends, and retrieve a copy?

  • Are there any programming language libraries specific for this task?

  • Can anyone suggest an already existing software solution for this problem?

+1


source to share


2 answers


Two approaches:

You can mount files as volumes on Linux. If your binary isn't too corrupted, you can probably partition the filesystem to find out where your files are. Is (was) a FAT or NTFS partition?

If that doesn't work, I would search for this string of bytes :



D0 CF 11 E0 A1 B1 1A E1

      

These are the "magic bytes" of signatures on document files in the office. They may appear randomly in other data, but this is a start. If the files are fragmented, you will run into BASIC problems.

Also, try recreating the chunks of the document (s) in Word as is, save it to a file, and extract the chunks for blob search (using the grep binary or whatever). If you have information from all parts of the file, you should be able to decode WHERE in the blob they are. Pulling it back into a working DOC binary seems a long way off, but recovering the rest of the text shouldn't be impossible.

+5


source


The Apache POI project has a library for reading and writing all kinds of MS Office documents. If the files are in the new OOXML XML database , you will be looking for the beginning of the zip file, since the XML is compressed.



+2


source







All Articles