What's a good database for full-text searches on a large number of relatively small text documents? (C # backend)

I am developing a system that aims to swallow a large number of documents. I want to support full text search of document content as well as other metadata (keyword / sentiment analysis). Keyword / sentiment analysis is beyond the scope of this question. But it's worth considering that such metadata should live alongside searchable documents.

Basic assumptions:

  • by large I mean initially a few 100,000 with the goal of reaching millions.
  • documents 0-15kb.
  • These documents are text (utf-8)
  • desire to have full text search of document content
  • hosted on one machine, no cloud / distributed services
  • new documents are inserted continuously (about 1-2 per second)
  • search text messages
  • More complex use cases for queries:
    • show me all the docs that refer to "widgets" that are positive from that daterange.

C # is the language of choice for getting documents, processing, storing and retrieving from db. So having C # bindings is a big plus. Or at least an easy way to bridge the gap.

Naive approach

A naive approach is to use MySQL along with Apache Lucene. The presence of document content stored in files with links to them in the database, or the presence of document content as a text field in the database.

Then I could use one of the C # wrappers for Lucene like Lucene.Net

My problem / question with this approach is whether the size of my data and what I want to do with it is too much for MySQL. I know it's silly to do premature optimizations, and often people think they need some kind of big data solution when it turns out that a regular SQL database works just fine. My other main concern with this approach is that it will be too "awkward" and cumbersome to develop compared to some potential alternatives.

Alternatives

From a number of studies, one alternative that looks promising is using CouchDB with Lucene. I came across two libraries that solve this problem:

What I am looking for:

I haven't done much with this data size. Interesting:

  • Does this amount of data and use case mean using a non-relational database?
  • Should the documents be stored in a database or as files with links in the database?
  • Is there a database / full text search technology that is particularly suited to this scenario that I have not considered?
+3


source to share


1 answer


I suggest you take a look at RavenDb. He uses Lucene and 100% .Net. It has text parsers for full text indexing and fuzzy searches.



+1


source







All Articles