HD Regular Expression Search

I am working on a project for my computer security class and I have a couple of questions. I got the idea to write a program that will search the entire hard drive to find email addresses. I'm just looking for addresses stored in plain text as it would be hard to find anything else. I figured that the best way to find addresses is to use a regular expression.

I've written an app in C # that works well enough, but I'd like to see if anyone has any better ideas. I am totally ready to write this in another language, since I assume C # is not the best for this type of thing. So far, the application I created just starts at C: / and recursively finds all files on disk that skip over the ones that are not available. It also skips all common images, videos, audio, compressed and 512MB files. This speeds it up quite a bit, but there is a slight chance that a large file might contain something useful. It takes about 12 seconds to create a list of files, and I guess about an hour to check them all. One drawback is that it uses about 50% of the CPU when scanning.

I am looking for ideas on how to improve the search. Is there a faster way, more efficient way, more thorough way, things like this? I was trying to figure out if there is some way you could tell if the file will contain simple text lines or not. Just let me know if you have any interesting ideas. Thank you.

+2


source to share


3 answers


Honestly, the easiest way to do this is using grep. As you improve your program, compare your speeds to it, and when you get closer, stop worrying about optimization. Alternatively, look at its source for an existing product that does what you are looking for.



+5


source


As mentioned, tools already exist for this if you install Win32 ports for UNIX tools. Alternatively the Windows equivalent:



for /r c:\ %i in (*.*) do findstr /i /r "regular expression" "%i"

      

+1


source


you should just use grep

+ find

. grep

optimized for quickly finding files, and find

optimized for providing lists of matching files for things like that. people have been optimizing these tools for a long time - no need to reinvent the wheel.

0


source







All Articles