Bad grade due to different maxDocs IDF
I have two documents with an of field title
:
- news
- New site
If I search for a term new website
, the score for the News document is significantly higher than for the other, which is clearly not what I want. I wrapped the explanation around it and got:
'hits': [{'_explanation': {'desc': 'product of:',
'det': [{'desc': 'sum of:',
'det': [{'desc': 'product of:',
'det': [{'desc': 'sum of:',
'det': [{'desc': 'weight(title:new in 0) [PerFieldSimilarity], result of:',
'det': [{'desc': 'score(doc=0,freq=1.0), product of:',
'det': [{'desc': 'queryWeight, product of:',
'det': [{'desc': 'idf(docFreq=1, maxDocs=6)',
'value': 2.0986123},
{'desc': 'queryNorm',
'value': 0.14544667}],
'value': 0.3052362},
{'desc': 'fieldWeight in 0, product of:',
'det': [{'desc': 'tf(freq=1.0), with freq of:',
'det': [{'desc': 'termFreq=1.0',
'value': 1.0}],
'value': 1.0},
{'desc': 'idf(docFreq=1, maxDocs=6)',
'value': 2.0986123},
{'desc': 'fieldNorm(doc=0)',
'value': 0.625}],
'value': 1.3116326}],
'value': 0.40035775}],
'value': 0.40035775}],
'value': 0.40035775},
{'desc': 'coord(1/2)',
'value': 0.5}],
'value': 0.20017888}],
'value': 0.20017888},
{'desc': 'coord(1/2)',
'value': 0.5}],
'value': 0.10008944},
'_id': '2ff1307b536102e41e7daaccaf7edc69b16a348c',
'_index': 'scrapy',
'_node': 'D9SgrDb5RnO4NMAJMHiAOA',
'_score': 0.100089446,
'_shard': 3,
'_source': {'title': ['\n News ? E/CIS\n '],
'url': 'http://178.4.12.128:8888/news/'},
'_type': 'pages'},
{'_explanation': {'desc': 'product of:',
'det': [{'desc': 'sum of:',
'det': [{'desc': 'sum of:',
'det': [{'desc': 'weight(title:new in 0) [PerFieldSimilarity], result of:',
'det': [{'desc': 'score(doc=0,freq=1.0), product of:',
'det': [{'desc': 'queryWeight, product of:',
'det': [{'desc': 'idf(docFreq=1, maxDocs=1)',
'value': 0.30685282},
{'desc': 'queryNorm',
'value': 0.46183997}],
'value': 0.1417169},
{'desc': 'fieldWeight in 0, product of:',
'det': [{'desc': 'tf(freq=1.0), with freq of:',
'det': [{'desc': 'termFreq=1.0',
'value': 1.0}],
'value': 1.0},
{'desc': 'idf(docFreq=1, maxDocs=1)',
'value': 0.30685282},
{'desc': 'fieldNorm(doc=0)',
'value': 0.5}],
'value': 0.15342641}],
'value': 0.021743115}],
'value': 0.021743115},
{'desc': 'weight(title:websit in 0) [PerFieldSimilarity], result of:',
'det': [{'desc': 'score(doc=0,freq=1.0), product of:',
'det': [{'desc': 'queryWeight, product of:',
'det': [{'desc': 'idf(docFreq=1, maxDocs=1)',
'value': 0.30685282},
{'desc': 'queryNorm',
'value': 0.46183997}],
'value': 0.1417169},
{'desc': 'fieldWeight in 0, product of:',
'det': [{'desc': 'tf(freq=1.0), with freq of:',
'det': [{'desc': 'termFreq=1.0',
'value': 1.0}],
'value': 1.0},
{'desc': 'idf(docFreq=1, maxDocs=1)',
'value': 0.30685282},
{'desc': 'fieldNorm(doc=0)',
'value': 0.5}],
'value': 0.15342641}],
'value': 0.021743115}],
'value': 0.021743115}],
'value': 0.04348623}],
'value': 0.04348623},
{'desc': 'coord(1/2)',
'value': 0.5}],
'value': 0.021743115},
'_id': '265988d175a2b4a2ae2e462509089d5f701ed372',
'_index': 'scrapy',
'_node': 'D9SgrDb5RnO4NMAJMHiAOA',
'_score': 0.021743115,
'_shard': 0,
'_source': {'title': ['\n New Website ? E/CIS\n '],
'url': 'http://178.4.12.128:8888/news/2015-new-website/'},
'_type': 'pages'}],
'max_score': 0.100089446,
'total': 2}
Note. I've shortened details
to det
and description
to desc
to save space.
It looks like the biggest difference has to do with the maxDocs difference in scoring. Why do I have a difference there? I thought it was the number of documents in the index? Shouldn't it be the same?
More details
Details can be found below, but you may not need them:
Query
My request:
'multi_match': {
'query': 'new website',
'type': 'most_fields',
'fields': ['title.raw^15', 'title^10'],
'analyzer': 'whitespace_analyzer',
}
Mapping
'title': {
'type': 'string',
'store': 'yes',
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer",
'fields': {
'raw': {
'type': 'string',
'store': 'yes',
"search_analyzer": "whitespace_analyzer",
"index": "not_analyzed",
},
}
},
Analyzer and filter
'analysis': {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"html_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": ["html_strip"],
"filter": [
'english_possessive_stemmer',
"lowercase",
'english_stop',
'english_stemmer',
"asciifolding",
]
},
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": ["html_strip"], # Strips the html tags
"filter": [
'english_possessive_stemmer',
"lowercase",
'english_stop',
'english_stemmer',
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
'english_possessive_stemmer',
"lowercase",
'english_stop',
'english_stemmer',
"asciifolding",
]
}
source to share
The default search type is query_then_fetch . Both query_then_fetch and query_and_fetch include calculating the term and document frequency locally for each of the shards in the index.
However, if you need a more accurate calculation of term / document frequency, you can use dfs_query_then_fetch / dfs_query_and_fetch. Here the frequency is calculated over all the turns of the involved indices.
This article gives a more detailed explanation
source to share