Changes to Lucene index files when document add / update / delete?

I am working on the latest Lucene 4.10.2 built in Java as an interface and Oracle 12c as a database.

I have indexed one user table that has 1 million rows. (Consider LinkedIn Users Table)

Can anyone explain to me what exactly happens to the folder (where the files are indexed) when we add document / document update / delete document?

Attaching a sample image: Common Lucene Index folder

I am trying to understand the file structure of the Lucene folder where all indexed files are located.

This is a one-to-many relationship structure (we are looking for no login), later I will move on to many-to-many relationships (Connections, connection connection, 1: 1 for users).

Let me know if I'm right / wrong.

+3


source to share


1 answer


The Lucene metric consists of several "segments". Each segment is written only once, when you call commit()

, or when commit()

called automatically (by setting the IndexWriter to auto-lock when RAM usage reaches a given threshold). Typically, in an index lookup, each segment is executed sequentially and the results are merged together. The reason Luzen works this way is because segment change will be a very slow process. Segments can be combined to improve search performance. [1]

In your example, files starting with _0

are the first segment and files starting with _1

are the second segment. Files .cfe

and .cfs

are "compound file", they contain all the index files for this segment (like a zip file). See Extension and File Formats for Default Codec for more information .

So your three operations work like this:

Add: Documents will always be added to a new segment.



Delete: Deleted documents are not actually deleted from the index. Instead, a flag is set to indicate that the document has been deleted. Documents that are not deleted are called "living documents". Deleted documents still affect the scoring in the Document Frequency field, and this is not updated until the segments are merged.

Update: Update is just adding and removing an atom.

[1] http://blog.trifork.com/2011/11/21/simon-says-optimize-is-bad-for-you/

+7


source







All Articles