Separate account is larger than doc_count in elasticsearch aggs

I wrote some aggs query to get the total (amount) and unique invoice. but the result is a little confusing.

the unique value is greater than doc_count.
Is it possible?

I know that power aggregation is experimental and can get a rough estimate of different values.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html

but this is too bad a result. as you can see, there are many buckets that are more unique than doc_count.
any problem with the format of the request? or power?

half a million documents indexed
and there are 15 types of eventID
ES 1.4 using.

request

{
"size": 0,
"_source": false,
"aggs": {
    "eventIds": {
        "terms": {
            "field": "_EventID_",
            "size": 0
        },
        "aggs": {
            "unique": {
                "cardinality": {
                    "field": "UUID"
                }
            }
        }
    }
}  

      

answer

{
"took": 383,
"timed_out": false,
"_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
},
"hits": {
    "total": 550971,
    "max_score": 0,
    "hits": [

    ]
},
"aggregations": {
    "eventIds": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
            {
                "key": "red",
                "doc_count": 165110,
                "unique": {
                    "value": 27423
                }
            },
            {
                "key": "blue",
                "doc_count": 108376,
                "unique": {
                    "value": 94775
                }
            },
            {
                "key": "yellow",
                "doc_count": 78919,
                "unique": {
                    "value": 70094
                }
            },
            {
                "key": "green",
                "doc_count": 60580,
                "unique": {
                    "value": 78945
                }
            },
            {
                "key": "black",
                "doc_count": 49923,
                "unique": {
                    "value": 56200
                }
            },
            {
                "key": "white",
                "doc_count": 38744,
                "unique": {
                    "value": 45229
                }
            },

      

EDIT. more tests

I tried again with 1000 precision_threshold which is filtered by only one eventId
but the result error is the same. expected capacity less than 30,000 but over 66,000 (this is more than the total size of the document).

doc_count: 65 672 (no problem). power: 66,037 (more than doc_count) actual power: about 23,000 (calculated by rdbms scripts ...)

request

{
"size": 0,
"_source": false,
"query": {
    "term": {
        "_EventID_": "packdownload"
    }
},
"aggs": {
    "unique": {
        "cardinality": {
            "field": "UUID",
            "precision_threshold": 10000
        }
    }
}

      

}

answer

{
"took": 28,
"timed_out": false,
"_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
},
"hits": {
    "total": 65672,
    "max_score": 0,
    "hits": []
},
"aggregations": {
    "unique": {
        "value": 66037
    }
}

      

}

+3


source to share


1 answer


The highest value for the precision threshold is 40,000. This should slightly improve the results, but with so many different values ​​there could be an error of 20% plus or minus. This happens even at lower values.



+1


source







All Articles