Limit child collections in original sqlalchemy query

I am creating an api that can return child resources if the user requests it. For example, it user

has messages

. I want the query to be able to limit the number of objects returned message

.

I found a helpful tip on how to simulate the number of objects in child collections here . This basically points to the following thread:

class User(...):
    # ...
    messages = relationship('Messages', order_by='desc(Messages.date)', lazy='dynamic')

user = User.query.one()
users.messages.limit(10)

      

My use case involves returning sometimes a large number of users.

If I were to follow the guidelines in this link and use .limit()

, then I would have to iterate over the entire set of users calling .limit()

for each one. It is much less efficient, say using LIMIT

in the original sql expression that created the collection.

My question is, is it possible to efficiently use declarative (N + 0) loading of a large set of objects while limiting the number of children in their child collections with sqlalchemy?

UPDATE

To be clear, below I try to avoid.

users = User.query.all()
messages = {}
for user in users:
    messages[user.id] = user.messages.limit(10).all()

      

I want to do something more:

users = User.query.option(User.messages.limit(10)).all()

      

+4


source to share


4 answers


This answer comes from Mike Bayer at sqlalchemy google group . I am posting it here to help people: TL; DR: I used version 1

Mike's answer to solve my problem, because in this case I have no foreign keys involved in this relationship and therefore cannot use LATERAL

. Version 1 worked fine, but don't forget to note the effect offset

. This threw me back during testing for a while because I hadn't noticed that he was tuned in to something other than 0

.

Code block for version 1:

subq = s.query(Messages.date).\
    filter(Messages.user_id == User.id).\
    order_by(Messages.date.desc()).\
    limit(1).offset(10).correlate(User).as_scalar()

q = s.query(User).join(
    Messages,
    and_(User.id == Messages.user_id, Messages.date > subq)
).options(contains_eager(User.messages))

      

Mike's answer so you have to ignore if he is using "declarative" which has nothing to do with the query, and is actually ignoring Query at first, because it's a SQL issue in the first place. You need one SQL statement that does this. What query in SQL would load multiple rows from the primary table joined to the first ten rows of the secondary table for each primary?

LIMIT is tricky because it is not actually part of the normal "relational algebra" computation. It's outside of that, because it's an artificial limit for strings. For example, my first thought on how to do this was wrong:

    select * from users left outer join (select * from messages limit 10) as anon_1 on users.id = anon_1.user_id

      

This is wrong because it only receives the first ten messages in aggregate, regardless of the user. We want to get the first ten messages for each user, which means that we need to do this "select from 10 message limit" for each user. That is, we need to somehow correlate. The correlated subquery, although not normally allowed as a FROM element and only allowed as an SQL expression, can only return one column and one row; we cannot properly join a correlated subquery in regular vanilla SQL. However, we can correlate within the ON clause for the JOIN to make this possible in vanilla SQL.

But first, if we are on a modern version of Postgresql, we can break this normal correlation rule and use the LATERAL keyword, which allows correlation in the FROM clause. LATERAL is only supported by modern Postgresql versions and it makes it easy:

    select * from users left outer join lateral
    (select * from message where message.user_id = users.id order by messages.date desc limit 10) as anon1 on users.id = anon_1.user_id

      

we support the LATERAL keyword. The above query looks like this:



subq = s.query(Messages).\
    filter(Messages.user_id == User.id).\
    order_by(Messages.date.desc()).limit(10).subquery().lateral()

q = s.query(User).outerjoin(subq).\
     options(contains_eager(User.messages, alias=subq))

      

Note that above, to SELECT both users and posts and produce them in the User.messages collection, the "contains_eager ()" parameter must be used, and for that the "dynamic" must go away. This is not the only option, for example, you can build a second relationship for User.messages that is not "dynamic", or just load from the request (User, Message) separately and order the tuples as needed.

unless you are using Postgresql or a version of Postgresql that does not support LATERAL, correlation should instead be included in the ON clause for the join. SQL looks like this:

select * from users left outer join messages on
users.id = messages.user_id and messages.date > (select date from messages where messages.user_id = users.id order by date desc limit 1 offset 10)

      

Here, to hush LIMIT there, we actually go through the first 10 lines with OFFSET and then we do LIMIT 1 to get a date that represents the lower bound of the date we want for each user. We then have to join the comparison with that date, which can be costly if this column is not indexed, and can also be inaccurate if there are duplicate dates.

This request looks like this:

subq = s.query(Messages.date).\
    filter(Messages.user_id == User.id).\
    order_by(Messages.date.desc()).\
    limit(1).offset(10).correlate(User).as_scalar()

q = s.query(User).join(
    Messages,
    and_(User.id == Messages.user_id, Messages.date >= subq)
).options(contains_eager(User.messages))

      

These request types are the kind that I don't trust without a good test, so both versions are presented in the POC below, including a health check.

from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base
import datetime

Base = declarative_base()


class User(Base):
    __tablename__ = 'user'
    id = Column(Integer, primary_key=True)
    messages = relationship(
        'Messages', order_by='desc(Messages.date)')

class Messages(Base):
    __tablename__ = 'message'
    id = Column(Integer, primary_key=True)
    user_id = Column(ForeignKey('user.id'))
    date = Column(Date)

e = create_engine("postgresql://scott:tiger@localhost/test", echo=True)
Base.metadata.drop_all(e)
Base.metadata.create_all(e)

s = Session(e)

s.add_all([
    User(id=i, messages=[
        Messages(id=(i * 20) + j, date=datetime.date(2017, 3, j))
        for j in range(1, 20)
    ]) for i in range(1, 51)
])

s.commit()

top_ten_dates = set(datetime.date(2017, 3, j) for j in range(10, 20))


def run_test(q):
    all_u = q.all()
    assert len(all_u) == 50
    for u in all_u:

        messages = u.messages
        assert len(messages) == 10

        for m in messages:
            assert m.user_id == u.id

        received = set(m.date for m in messages)

        assert received == top_ten_dates

# version 1.   no LATERAL

s.close()

subq = s.query(Messages.date).\
    filter(Messages.user_id == User.id).\
    order_by(Messages.date.desc()).\
    limit(1).offset(10).correlate(User).as_scalar()

q = s.query(User).join(
    Messages,
    and_(User.id == Messages.user_id, Messages.date > subq)
).options(contains_eager(User.messages))

run_test(q)

# version 2.  LATERAL

s.close()

subq = s.query(Messages).\
    filter(Messages.user_id == User.id).\
    order_by(Messages.date.desc()).limit(10).subquery().lateral()

q = s.query(User).outerjoin(subq).\
    options(contains_eager(User.messages, alias=subq))

run_test(q)

      

+1


source


If you apply a limit and then call .all()

on it, you will get all the objects once, and it won't get the objects one by one, causing the performance problems you mentioned.

just apply constraint and get all objects.

users = User.query.limit(50).all()
print(len(users))
>>50

      



Or for child objects / relations

user = User.query.one()
all_messages = user.messages.limit(10).all()


users = User.query.all()
messages = {}
for user in users:
    messages[user.id] = user.messages.limit(10).all()

      

0


source


So, I think you will need to download posts in the second request and then communicate with your users later. Below is the database dependency; like in this question , mysql doesn't support constrained queries, but sqlite will at least parse the query. I didn't look at the plan to see if it was well done. The following code will find all the message objects you are interested in. Then you need to associate them with users.
I checked this to confirm that it is calling a sqlite query that can parse; I have not confirmed that sqlite or any other database is doing the right thing with this query. I had to cheat a bit and use a text primitive to refer to the outer table user.id in the select, because SQLAlchemy continued to want to include an additional connection for users in the inner select subquery.

from sqlalchemy import Column, Integer, String, ForeignKey, alias
from sqlalchemy.sql import text

from sqlalchemy.orm import Session
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key = True)
    name = Column(String)

class Message(Base):
    __tablename__ = 'messages'
    user_id = Column(Integer, ForeignKey(User.id), nullable = False)
    id = Column(Integer, primary_key = True)


s = Session()
m1 = alias(Message.__table__)

user_query = s.query(User) # add any user filtering you want
inner_query = s.query(m1.c.id).filter(m1.c.user_id == text('users.id')).limit(10)
all_messages_you_want = s.query(Message).join(User).filter(Message.id.in_(inner_query))

      

To associate posts with users, you can do something like the following, assuming your post is user related and your custom objects have a get_child_message method that does whatever you like to do this

users_resulting = user_query.all() #load objects into session and hold a reference
for m in all_messages_you_want: m.user.got_child_message(m)

      

Since you already have users in the session, and since the relationship is relative to the user's primary key, m.user resolves query.get against the identity card. Hope this helps you.

0


source


@Melchoirs answer is the best. I mostly post it here for the future

I experimented with the above answer and it works, I need it more to limit the number of associations returned before being passed to the Zephyr serializer.

Some questions to clarify:

  • the subquery is executed for each association, hence it finds a matching one date

    for the correct base
  • Think of a constraint / offset like give me 1 (limit) entry starting at the next X (offset). Hence what is the Xth oldest record and then in the main query it returns all of that. It's fucking smart
  • It looks like if the association has less than X records, it returns nothing, because the offset is greater than the records, and henceforth the main query does not return a record.

Using the above as a template, I came up with an answer below. The initial query / count protection is due to the fact that if the related records are less than offset, nothing is found. Also, I needed to add an outer join in the event that there are no associations.

In the end I found that this query was a bit or voodoo ORM and didn't want to go that route. Instead, I exclude the histories

device from the serializer and require a second lookup history

using the ID device

. This set can be paginated and makes things a little cleaner.

Both methods work, it all comes down to the fact that why

you need to make one request against a pair. In the above description, there may have been business reasons for a more efficient return on just one request. For my use case, readability and convention outperformed voodoo

@classmethod
    def get_limited_histories(cls, uuid, limit=10):

        count = DeviceHistory.query.filter(DeviceHistory.device_id == uuid).count()

        if count > limit:
            sq = db.session.query(DeviceHistory.created_at) \
                .filter(DeviceHistory.device_id == Device.uuid) \
                .order_by(DeviceHistory.created_at.desc()) \
                .limit(1).offset(limit).correlate(Device)


        return db.session.query(Device).filter(Device.uuid == uuid) \
                .outerjoin(DeviceHistory,
                    and_(DeviceHistory.device_id == Device.uuid, DeviceHistory.created_at > sq)) \
                .options(contains_eager(Device.device_histories)).all()[0]


      

Then he behaves like Device.query.get(id)

, butDevice.get_limited_histories(id)

  • ENJOY
0


source







All Articles