Bad grade due to different maxDocs IDF

I have two documents with an of field title

:

  • news
  • New site

If I search for a term new website

, the score for the News document is significantly higher than for the other, which is clearly not what I want. I wrapped the explanation around it and got:

'hits': [{'_explanation': {'desc': 'product of:',
   'det': [{'desc': 'sum of:',
    'det': [{'desc': 'product of:',
     'det': [{'desc': 'sum of:',
      'det': [{'desc': 'weight(title:new in 0) [PerFieldSimilarity], result of:',
       'det': [{'desc': 'score(doc=0,freq=1.0), product of:',
        'det': [{'desc': 'queryWeight, product of:',
         'det': [{'desc': 'idf(docFreq=1, maxDocs=6)',
          'value': 2.0986123},
          {'desc': 'queryNorm',
           'value': 0.14544667}],
          'value': 0.3052362},
          {'desc': 'fieldWeight in 0, product of:',
           'det': [{'desc': 'tf(freq=1.0), with freq of:',
            'det': [{'desc': 'termFreq=1.0',
             'value': 1.0}],
            'value': 1.0},
            {'desc': 'idf(docFreq=1, maxDocs=6)',
             'value': 2.0986123},
            {'desc': 'fieldNorm(doc=0)',
             'value': 0.625}],
            'value': 1.3116326}],
          'value': 0.40035775}],
       'value': 0.40035775}],
      'value': 0.40035775},
      {'desc': 'coord(1/2)',
       'value': 0.5}],
      'value': 0.20017888}],
    'value': 0.20017888},
    {'desc': 'coord(1/2)',
     'value': 0.5}],
    'value': 0.10008944},
    '_id': '2ff1307b536102e41e7daaccaf7edc69b16a348c',
    '_index': 'scrapy',
    '_node': 'D9SgrDb5RnO4NMAJMHiAOA',
    '_score': 0.100089446,
    '_shard': 3,
    '_source': {'title': ['\n       News ?  E/CIS\n    '],
     'url': 'http://178.4.12.128:8888/news/'},
    '_type': 'pages'},
    {'_explanation': {'desc': 'product of:',
     'det': [{'desc': 'sum of:',
      'det': [{'desc': 'sum of:',
       'det': [{'desc': 'weight(title:new in 0) [PerFieldSimilarity], result of:',
        'det': [{'desc': 'score(doc=0,freq=1.0), product of:',
         'det': [{'desc': 'queryWeight, product of:',
          'det': [{'desc': 'idf(docFreq=1, maxDocs=1)',
           'value': 0.30685282},
           {'desc': 'queryNorm',
            'value': 0.46183997}],
           'value': 0.1417169},
           {'desc': 'fieldWeight in 0, product of:',
            'det': [{'desc': 'tf(freq=1.0), with freq of:',
             'det': [{'desc': 'termFreq=1.0',
              'value': 1.0}],
             'value': 1.0},
             {'desc': 'idf(docFreq=1, maxDocs=1)',
              'value': 0.30685282},
             {'desc': 'fieldNorm(doc=0)',
              'value': 0.5}],
             'value': 0.15342641}],
           'value': 0.021743115}],
        'value': 0.021743115},
        {'desc': 'weight(title:websit in 0) [PerFieldSimilarity], result of:',
         'det': [{'desc': 'score(doc=0,freq=1.0), product of:',
          'det': [{'desc': 'queryWeight, product of:',
           'det': [{'desc': 'idf(docFreq=1, maxDocs=1)',
            'value': 0.30685282},
            {'desc': 'queryNorm',
             'value': 0.46183997}],
            'value': 0.1417169},
            {'desc': 'fieldWeight in 0, product of:',
             'det': [{'desc': 'tf(freq=1.0), with freq of:',
              'det': [{'desc': 'termFreq=1.0',
               'value': 1.0}],
              'value': 1.0},
              {'desc': 'idf(docFreq=1, maxDocs=1)',
               'value': 0.30685282},
              {'desc': 'fieldNorm(doc=0)',
               'value': 0.5}],
              'value': 0.15342641}],
            'value': 0.021743115}],
         'value': 0.021743115}],
        'value': 0.04348623}],
      'value': 0.04348623},
      {'desc': 'coord(1/2)',
       'value': 0.5}],
      'value': 0.021743115},
      '_id': '265988d175a2b4a2ae2e462509089d5f701ed372',
      '_index': 'scrapy',
      '_node': 'D9SgrDb5RnO4NMAJMHiAOA',
    '_score': 0.021743115,
                    '_shard': 0,
                    '_source': {'title': ['\n       New Website ?  E/CIS\n    '],
                      'url': 'http://178.4.12.128:8888/news/2015-new-website/'},
                    '_type': 'pages'}],
          'max_score': 0.100089446,
          'total': 2}

      

Note. I've shortened details

to det

and description

to desc

to save space.

It looks like the biggest difference has to do with the maxDocs difference in scoring. Why do I have a difference there? I thought it was the number of documents in the index? Shouldn't it be the same?

More details

Details can be found below, but you may not need them:

Query

My request:

 'multi_match': {
    'query': 'new website',
    'type': 'most_fields',
    'fields': ['title.raw^15', 'title^10'],
    'analyzer': 'whitespace_analyzer',
 }

      

Mapping

 'title': {
     'type': 'string',
     'store': 'yes',
     "index_analyzer": "nGram_analyzer",
     "search_analyzer": "whitespace_analyzer",
     'fields': {
         'raw': {
             'type': 'string',
             'store': 'yes',
             "search_analyzer": "whitespace_analyzer",
             "index": "not_analyzed",
         },
     }
 },

      

Analyzer and filter

  'analysis': {
      "filter": {
          "nGram_filter": {
              "type": "nGram",
              "min_gram": 2,
              "max_gram": 20,
              "token_chars": [
                  "letter",
                  "digit",
                  "punctuation",
                  "symbol"
              ]
          },
          "english_stop": {
              "type":       "stop",
              "stopwords":  "_english_"
          },
          "english_stemmer": {
              "type":       "stemmer",
              "language":   "english"
          },
          "english_possessive_stemmer": {
              "type":       "stemmer",
              "language":   "possessive_english"
          }
      },
      "analyzer": {
          "html_analyzer": {
              "type": "custom",
              "tokenizer": "whitespace",
              "char_filter": ["html_strip"],
              "filter": [
                  'english_possessive_stemmer',
                  "lowercase",
                  'english_stop',
                  'english_stemmer',
                  "asciifolding",
              ]
          },
          "nGram_analyzer": {
              "type": "custom",
              "tokenizer": "whitespace",
              "char_filter": ["html_strip"], # Strips the html tags
              "filter": [
                  'english_possessive_stemmer',
                  "lowercase",
                  'english_stop',
                  'english_stemmer',
                  "asciifolding",
                  "nGram_filter"
              ]
          },
          "whitespace_analyzer": {
              "type": "custom",
              "tokenizer": "whitespace",
              "filter": [
                  'english_possessive_stemmer',
                  "lowercase",
                  'english_stop',
                  'english_stemmer',
                  "asciifolding",
              ]
          }

      

+3


source to share


1 answer


The default search type is query_then_fetch . Both query_then_fetch and query_and_fetch include calculating the term and document frequency locally for each of the shards in the index.

However, if you need a more accurate calculation of term / document frequency, you can use dfs_query_then_fetch / dfs_query_and_fetch. Here the frequency is calculated over all the turns of the involved indices.



This article gives a more detailed explanation

+4


source







All Articles