Lucene.NET - indexing one large file> 1 GB

Question

Lucene.NET - indexing one large file> 1 GB

I have one XML file that I want to index using Lucene.NET. The file is basically a large collection of logs. Since one file itself is over 5GB and I am developing code on a system with 2GB of RAM, how can I do the indexing when I am not parsing the file and creating no other fields other than "text" that the file should contain data ?

I am using some code from CodeClimber and am not sure at the moment what would be the best approach for indexing such a large single file,

Is there a way to pass the file data to the index in chunks? Below is a line of code that basically creates a textbox and associated file data.

Document doc = new Document();
doc.Add(new Field("Body", text, Field.Store.YES, Field.Index.TOKENIZED));
writer.AddDocument(doc);

Thanks for the guidance

+3

.net lucene lucene.net

user349026 19 Mar '12 at 9:30

source to share

2 answers

Indexing such large files is not a problem. Just parse your XML file with a SAX parser (which is event based and doesn't require the file to be loaded into memory to process it), buffer your input, and then add the document to your index at the end of each log event.

0

jpountz 19 Mar 12 at 11:05

source to share

LB · Accepted Answer · 2012-03-19T23:07:52+0000

You should use something like System.Xml.XmlReader

that doesn't load all xml into memory. But indexing the entire xml as a single document does not make sense as you will get one or one document with each search. (Found or not found). So, to be able to pass data in chunks, it wouldn't help you. Therefore, when reading your XML file, you must split it into many documents (and fields) so that you can get reasonable search results.

how can i do indexing when i don't parse the file and create no other fields besides "text" which should contain file data

What a wonderful world it will be

Lucene.NET - indexing one large file> 1 GB

More articles: