How do I optimize or simplify django object copy operations? (cannot use modules and db types)

Question

How do I optimize or simplify django object copy operations? (cannot use modules and db types)

I have to ask for some help with the assignment I received as an internship test in django. I had to make an imaginary api with rabbits and their carrots. Each rabbit had to have a lot of carrots, but the api had to be designed so that other vegetables could be added easily. I rejected a whole field for each vegetable and instead went for a vegetable object with the type and value of vegetables.

The problem is that the task includes a descending enumeration of rabbits sorted by carrots. They wanted me to implement heapsort, not allow database collation, no external libraries. While I had no problem with it, I did have problems with the time constraints they were thinking - for 20,000 rabbits, it needs to sort in less than 30 seconds, ideally 5 seconds. And it already takes 5 seconds with 200 rabbits (just sorting and serializing for json).

I make a request that only has rabbits with "carrot" vegetables. Then I insert it into a regular list and run the heapsort function on it.

How do I need to change it faster? Is it possible? I will be very happy if someone can help at least a little. Thank you in advance!

My models:

class Bunny(models.Model):
    """Bunny model for bunny usage"""
    def __str__(self):
        return self.name + " " + str(list(self.vegetables.all()))

    name = models.CharField("Name", max_length=50)
    userAccount = models.ForeignKey(User, on_delete=models.CASCADE)

    def getVegetable(self, vegetableType):
        for vegetable in self.vegetables.all():
            if vegetable.vegetableType == vegetableType:
                return vegetable
        return False


class Vegetable(models.Model):
    """Vegetable model for storing vegetable counts"""
    def __str__(self):
        return self.vegetableType + ":" + str(self.value)

    vegetableType = models.CharField(max_length=30, choices=vegetableChoices)
    value = models.PositiveIntegerField(default=0, validators=[MinValueValidator(0)])
    bunny = models.ForeignKey(Bunny, related_name="vegetables", on_delete=models.CASCADE)

My heapsort function:

def heapsort(bunnies, vegetableType):
    """Heapsort function for bunnies, works in place, descending"""

    for start in range((len(bunnies) - 2) // 2, -1, -1):
        siftdown(bunnies, start, len(bunnies) - 1, vegetableType)

    for end in range(len(bunnies) - 1, 0, -1):
        bunnies[end], bunnies[0] = bunnies[0], bunnies[end]
        siftdown(bunnies, 0, end - 1, vegetableType)
    return bunnies


def siftdown(bunnies, start, end, vegetableType):
    """helper function for heapsort"""
    root = start
    while True:
        child = root * 2 + 1
        if child > end:
            break
        if child + 1 <= end and bunnies[child].vegetables.get(vegetableType=vegetableType).value > bunnies[
                    child + 1].vegetables.get(vegetableType=vegetableType).value:
            child += 1
        if bunnies[root].vegetables.get(vegetableType=vegetableType).value > bunnies[child].vegetables.get(
                vegetableType=vegetableType).value:
            bunnies[root], bunnies[child] = bunnies[child], bunnies[root]
            root = child
        else:
            break

And also those performance tests they asked for (I don't know a better way. It just takes a long time to create rabbits)

def test_20000_rabbits_performance(self):
    print("Creating bunnies")
    register20000Bunnies()

    print("Created bunnies")
    timestart = time()

    url = reverse("api:list", args=["carrots"])

    response = self.client.get(url)
    timeMeasured = time() - timestart
    print("Sorted. Took: " + str(timeMeasured))

    self.assertEqual(response.status_code, status.HTTP_200_OK)

My opinion:

@api_view(["GET"])
def bunnyList(request, vegetableType):
""" Displays heap-sorted list of bunnies, in decreasing order.
    Takes word after list ("/list/xxx") as argument to determine
    which vegetable list to display"""
    if vegetableType in vegetablesChoices:
        bunnies =
    Bunny.objects.filter(vegetables__vegetableType=vegetableType)
        bunnies = list(bunnies)  # force into normal list

        if len(bunnies) == 0:
            return Response({"No bunnies": "there is %d bunnies with this vegetable" % len(bunnies)},
                        status=status.HTTP_204_NO_CONTENT)

        heapsort(bunnies, vegetableType)
        serialized = BunnySerializerPartial(bunnies, many=True)
        return Response(serialized.data, status=status.HTTP_200_OK)
    else:
        raise serializers.ValidationError("No such vegetable. Available are: " + ", ".join(vegetablesChoices))

Edit: just tested now, it currently takes 1202 seconds to sort ... My machine has 2 1.86 GHz cores, but still.

Edit2, new code:

@api_view(["GET"])
def bunnyList(request, vegetableType):
""" Displays heap-sorted list of bunnies, in decreasing order.
    Takes word after list ("/list/xxx") as argument to determine
    which vegetable list to display"""
if vegetableType in vegetablesChoices:
    vegetables =  Vegetable.objects.filter(vegetableType=vegetableType).select_related('bunny')
    vegetables = list(vegetables)

    if len(vegetables) == 0:
        return Response({"No bunnies": "there is 0 bunnies with this vegetable"},
                        status=status.HTTP_204_NO_CONTENT)

    heapsort(vegetables)

    bunnies = [vegetable.bunny for vegetable in vegetables]
    serialized = BunnySerializerPartial(bunnies, many=True)
    return Response(serialized.data, status=status.HTTP_200_OK)
else:
    raise serializers.ValidationError("No such vegetable. Available are: " + ", ".join(vegetablesChoices))

Updated folders:

def heapsort(vegetables):
"""Heapsort function for vegetables, works in place, descending"""

for start in range((len(vegetables) - 2) // 2, -1, -1):
    siftdown(vegetables, start, len(vegetables) - 1)

for end in range(len(vegetables) - 1, 0, -1):
    vegetables[end], vegetables[0] = vegetables[0], vegetables[end]
    siftdown(vegetables, 0, end - 1)
return vegetables


def siftdown(vegetables, start, end):
"""helper function for heapsort"""
root = start
while True:
    child = root * 2 + 1
    if child > end:
        break
    if child + 1 <= end and vegetables[child].value > vegetables[child+1].value:
        child += 1
    if vegetables[root].value > vegetables[child].value:
        vegetables[root], vegetables[child] = vegetables[child], vegetables[root]
        root = child
    else:
        break

My serializers:

class BunnySerializerPartial(serializers.ModelSerializer):
"""Used in list view, mirrors BunnySerializerFull but without account details"""
    vegetables = VegetableSerializer(many=True)

    class Meta:
        model = Bunny
        fields = ("name", "vegetables")


class VegetableSerializer(serializers.ModelSerializer):
"""Used for displaying vegetables, for example in list view"""
    class Meta:
        model = Vegetable
        fields = ("vegetableType", "value")

And queries from the toolbar:

SELECT ••• FROM "zajaczkowskiBoardApi_vegetable" INNER JOIN "zajaczkowskiBoardApi_bunny" ON ("zajaczkowskiBoardApi_vegetable"."bunny_id" = "zajaczkowskiBoardApi_bunny"."id") WHERE "zajaczkowskiBoardApi_vegetable"."vegetableType" = '''carrots'''


SELECT ••• FROM "zajaczkowskiBoardApi_vegetable" WHERE "zajaczkowskiBoardApi_vegetable"."bunny_id" = '141'

Second duplicate 20,000 times

+3

optimization python sorting django django-rest-framework

Nullbyte 01 jul. 17 at 15:49

source to share

1 answer

knbk · Accepted Answer · 2017-07-01T16:41:31+0000

This is a classic N + 1 query problem. You do one query to retrieve all the rabbits, but then keep doing bunnies[child].vegetables.get(vegetableType=vegetableType)

for each rabbit, which makes an additional query, and thus an additional reverse database for each rabbit. So you run 1 query for N bunnies and also around N queries to get all the vegetables (hence N + 1).

Database Feedback is one of the most expensive resources available to web developers. While the comparison takes place somewhere in the order of a nanosecond, the database feedback takes on the order of milliseconds. Make ~ 20K requests and it will add a few minutes soon.

The quick solution is to use prefetch_related('vegetables')

and exclusively use bunny.getVegetable('carrot')

to produce carrots. prefetch_related()

will execute one request to get all vegetables for all rabbits and cache them, so repeating self.vegetables.all()

in getVegetables()

won't do any additional requests.

However, there are better solutions. In this case, it seems that each rabbit should have at most 1 Vegetable

object defined vegetableType

. If you apply this at the database level, you don't have to worry about mistakes in your sorting algorithm if someone decides to add a tiny second type of Vegetable

type 'carrot'

. Instead, the database will stop them from doing this in the first place. To do this, you need a constraint unique_together

:

class Vegetable(models.Model):
    ...
    class Meta:
        unique_together = [
            ('vegetableType', 'bunny'),
        ]

Then, instead of collecting all the rabbits and preloading all the linked vegetables, you can get all the carrots and join the linked rabbits. Now you will have only one request:

carrots = Vegetable.objects.filter(vegetableType='carrot').select_related('bunny')

Since the combination is vegetableType

and is bunny

unique, you will not get any duplicate rabbits, and you will still get all bunnies that have carrots.

Of course, you'll have to adapt your algorithm to work with vegetables, not rabbits.

How do I optimize or simplify django object copy operations? (cannot use modules and db types)

More articles: