Why (or shouldn't) the search query only return document IDs?

Question

Why (or shouldn't) the search query only return document IDs?

So for a new project, I am building a system for an e-commerce site. The idea is to import products from suppliers and instead of inserting them directly into our catalog, we will store all information in a staging area. Each vendor has its own stage (i.e. a table in the database) and then I will flatten multiple staging areas into a single entity (currently one table, but later perhaps in Sphinx or Solr). Our merchandisers will then be able to search for matching intermediate product fields (name and description) and show a list of products that match, then choose to have those products migrated to the live catalog. The search will ask for one table (flattened staging areas).

My project only calls for saving searchable and filterable fields in one flattened table - eg. name, description, provider_id, provider_prod_id, etc. And the searches will only return the item matching ID and class (vendor_id) that will be used to determine the scope of the intermediate product from which the product is made.

Another senior engineer thinks that a smoothed lookup table should include other meta fields (which would not have been validated), but could be used when "pushing" products from a scene into a live catalog. He also believes that the request should return all this other information.

I only feel very strongly about the presence of lookup fields in the flattened table and the lookup in the lookup only returns class / id pairs that can be used to retrieve all other necessary metadata about the product (simple selection * from table_class where id is in (1,2,3 )).

Part of my reasoning is that it will make it easier later to switch a flattened table from the database to a search server like sphinx or solr, and the rest of the code should not be changed just because the search implementation has changed.

Am I on the right track? How can I convince another engineer why it is important to keep only searchable fields and only return IDs? Or, specifically, why does the search app only return object IDs?

+2

search full-text-search solr sphinx

safoo 29 Sep '09 at 21:30

source to share

5 answers

You must use each tool for what it does best. A full text search engine like Solr or Sphinx is great for finding text fields and ranking hits. It has no particular advantage in selectively retrieving stored data. For this, the database has been optimized. So yes, you are on the right track. See Search Engine Versus DBMS for other issues related to deciding what to store in a search engine.

+2

Yuval F 01 oct. '09 at 6:18

source to share

In the case of spinx, it only returns document IDs and named attributes (in most cases, attributes are numeric data). I would say you have the right idea as the other metadata is just JOIN

away from the flattened table if you need to.

0

Ty W Sep 30 '09 at 12:40

source to share

You can think of Solr as a powerful index, since the index returns an ID, it would make sense that solr would do the same.

You can use solr query parameter to query fl

for id results only, eg fl=id

.

There is, however, a feature that solr requires in order to get some data back to you: highlighting search terms in matched documents. If you don't need that, then using solr to get ids is only okay (I'm assuming you only want a list of documents, plus other features like faces, linked documents, or spell checker).

However, it is important how you construct your objects in your search function, either from the DB using a unique solr to get ids, or from the returned solr fields (assuming they are stored), or even a combination of both. Think solr to get "highlighted" content and db fields for others. Again, if you don't need backlighting, this is not a problem.

0

jeje 06 oct. '09 at 15:21

source to share

I use Solr with thousands of documents, but I only return IDs for the following reasons:

For Solr: - if some kind of sync error is added, it doesn't really matter (especially in your case, displaying a different price can be a big problem ... it looks like the item won't be in the right place, but the data is correct) - you will save a lot of time because when you don't ask Solr to return the "description" of the documents (I mean many lines of text)

For your database: - you can cache your results, so it's even faster with an id (you don't need all the data from Solr every time !!!) - you create results in the same way (you don't need a specific method when you want to build html from Solr and another method from your database)

I think there is a lot more ...

0

Vincent peres 13 oct. '09 at 10:47

source to share

Jacob G · Accepted Answer · 2009-09-29T21:41:32+0000

I think you are on the right track. If these other fields do not provide a value to uniquely identify a phased item, or to allow the user to filter the phased item, then the data is basically useless until the item is brought into live environment. If another engineer thinks the additional metadata will help users make a more informed decision, then you can also make those additional fields searchable (thus your stated goals for tables).

The only reason I could think of for other non-searchable data to be designed to improve performance when clicking live.

Why (or shouldn't) the search query only return document IDs?

More articles: