Django Queryset: Need help optimizing this queryset

Question

Django Queryset: Need help optimizing this queryset

I'm trying to figure out some common combinations of tags from a list of educational question records.

In this example, I only consider an example with two tags (tag-tag), which should get an example of the result, for example: "point" + "curve" (65 records) "add" + "subtract" (40 records) ...

This is the desired output in the SQL statement:

SELECT a.tag, b.tag, count(*)
FROM examquestions.dbmanagement_tag as a
INNER JOIN examquestions.dbmanagement_tag as b on a.question_id_id = b.question_id_id
where a.tag != b.tag
group by a.tag, b.tag

Basically we get different tags with common questions that need to be identified in the list and grouped into the same combinations of matching tags.

I tried a similar query using django queryset:

    twotaglist = [] #final set of results

    alphatags = tag.objects.all().values('tag', 'type').annotate().order_by('tag')
    betatags = tag.objects.all().values('tag', 'type').annotate().order_by('tag')
    startindex = 0 #startindex reduced by 1 to shorten betatag range each time the atag changes. this is to reduce the double count of comparison of similar matches of tags
    for atag in alphatags:
        for btag in betatags[startindex:]:
            if (atag['tag'] != btag['tag']):
                commonQns = [] #to check how many common qns
                atagQns = tag.objects.filter(tag=atag['tag'], question_id__in=qnlist).values('question_id').annotate()
                btagQns = tag.objects.filter(tag=btag['tag'], question_id__in=qnlist).values('question_id').annotate()
                for atagQ in atagQns:
                    for btagQ in btagQns:
                        if (atagQ['question_id'] == btagQ['question_id']):
                            commonQns.append(atagQ['question_id'])
                if (len(commonQns) > 0):
                    twotaglist.append({'atag': atag['tag'],
                                        'btag': btag['tag'],
                                        'count': len(commonQns)})
        startindex=startindex+1

The logic works great, however, since I am fairly new to this platform, I am not sure if there is a shorter workaround to make it very efficient.

It currently takes about 45 seconds to query for roughly 5K X 5K comparisons :(

Addon: tag class

class tag(models.Model):
    id = models.IntegerField('id',primary_key=True,null=False)
    question_id = models.ForeignKey(question,null=False)
    tag = models.TextField('tag',null=True)
    type = models.CharField('type',max_length=1)

    def __str__(self):
        return str(self.tag)

+3

python sql django django-views django-queryset

jdtoh 20 jan. '13 at 9:45

source to share

2 answers

If I understood your question correctly, I would simplify things and do something like this

relevant_tags = Tag.objects.filter(question_id__in=qnlist)
#Here relevant_tags has both a and b tags

unique_tags = set()
for tag_item in relevant_tags:
    unique_tags.add(tag_item.tag)

#unique_tags should have your A and B tags

a_tag = unique_tags.pop()
b_tag = unique_tags.pop() 

#Some logic to make sure what is A and what is B

a_tags = filter(lambda t : t.tag == a_tag, relevant_tags)
b_tags = filter(lambda t : t.tag == b_tag, relevant_tags)

#a_tags and b_tags contain A and B tags filtered from relevant_tags

same_question_tags = dict()

for q in qnlist:
  a_list = filter(lambda a: a.question_id == q.id, a_tags)
  b_list = filter(lambda a: a.question_id == q.id, b_tags)
  same_question_tags[q] = a_list+b_list

The good thing is that you can expand it to N number of tags, iterate over the returned tags in a loop to get all unique ones, and then iterate to filter out the tag wise.

There are other ways to do this.

+2

Pratik mandrekar 20 jan. 13 at 10:06

source to share

Josh smeaton · Accepted Answer · 2013-01-20T11:49:21+0000

Unfortunately django doesn't allow joining unless a foreign key (or one-to-one) is used. You will need to do this in code. I found a way (completely untested) to do this with a single query, which should significantly improve the runtime.

from collections import Counter
from itertools import combinations

# Assuming Models
class Question(models.Model):
    ...

class Tag(models.Model):
    tag = models.CharField(..)
    question = models.ForeignKey(Question, related_name='tags')

c = Counter()
questions = Question.objects.all().prefetch_related('tags') # prefetch M2M
for q in questions:
    # sort them so 'point' + 'curve' == 'curve' + 'point'
    tags = sorted([tag.name for tag in q.tags.all()])
    c.update(combinations(tags,2)) # get all 2-pair combinations and update counter
c.most_common(5) # show the top 5

The above code uses Counters , itertools.combinations , and django prefetch_related , which should cover most of the above bits that might not be known. Have a look at these resources if the above code doesn't work exactly, and change accordingly.

If you are not using an M2M field in your model Question

, you can still refer to tags as if it were an M2M field using inverse relationships . See my edit, which changes the inverse relationship from tag_set

to tags

. I've made a few other changes that should work with how you defined your models.

If you don't specify related_name='tags'

, just change tags

in filters and prefetch_related to tag_set

and you should be good to go.

Django Queryset: Need help optimizing this queryset

More articles: