Sqlalchemy is much slower if no primary key is specified

Question

Sqlalchemy is much slower if no primary key is specified

I am using SQLALchemy version 1.2.0b1

In a table that looks like this,

class Company(Base):
    __tablename__ = 'company'
    id = Column(Integer, primary_key=True, autoincrement=True)
    cik = Column(String(10), nullable=False, index=True, unique=True)
    name = Column(String(71), nullable=False)

When I insert new values into the table and I specify id

company=Company()
company.id =counter
company.cik = ...
company.name = ...

the program is very fast. The embed code issued by Sqlalchemy to the server is a bulk embed.

If I omit the id to rely on the db to generate a unique id

company=Company()
company.cik = ...
company.name = ...

The code becomes as slow as proton decay and echoes show that SQLalchemy is issuing an insert statement for every single company element. There is no bulk insert.

Is there a way to avoid this type of behavior and rely on the database to generate IDs?

+3

python mysql sqlalchemy

Andrey Dolgikh Jul 24 17 at 2:54

source to share

1 answer

Andrey Dolgikh · Answer 1 · 2017-08-02T18:57:30+0000

What I ended up doing is staging data loading. First, I create a structural copy of the table that I plan to include data in. I did it by following this recommendation: sqlalchemy build a new declarative class from existing ones

def mutate_declarative(source):
    columns = []
    omit_columns = ['created_at', 'updated_at']
    for c in source.__table__.c:
        if c.name not in omit_columns:
            columns.append(((c.name,
                             c.type),
                            {'primary_key': c.primary_key,
                             'nullable': c.nullable,
                             'doc': c.doc,
                             'default': c.default,
                             'unique': c.unique,
                             'autoincrement': c.autoincrement}))

  class Stage(get_base()):
        original = source
        __tablename__ = source.__tablename__ + '_staging'
        __table__ = Table(source.__tablename__ + '_staging',
                          get_base().metadata, *[Column(*c[0], **c[1]) for c in columns])

    return Stage

def create_staging_table(source):
    new_class = mutate_declarative(source)
    engine = get_base().metadata.bind
    new_class.__table__.drop(engine, checkfirst=True)
    new_class.__table__.create(engine)
    return new_class


def drop_staging_table(source):
    engine = get_base().metadata.bind
    source.__table__.drop(engine, checkfirst=True)
enter code here

The above code allows me to quickly create a blank page and use it as a temporary storage to load my data using the keys generated in the code. As I showed in the original text of the question, this mode is relatively fast. After that, the data from the staging table must be transferred to the main table. The problem here is that we need to align the existing data with the phased data. This can be done using the ON ON DUPLICATE KEY UPDATE clause supported by MySQL. Unfortunately SQLALchemy does not support this. To solve this problem, I am following the recommendations from here SQLAlchemy ON DUPLICATE KEY UPDATE

def move_data_from_staging_to_main(session, staging, attempt_update=False):
# attempt_update controls if new data should overwrite the existing data
# if attempt_update is set to True existing data will be overwritten with new data
# otherwise presence of conflicting existing data will result in error.
main_table = staging.original.__table__
staged_table = staging.__table__
column_list = []
for column in staging.__table__.columns:
    if not column.primary_key:
        column_list.append(column)

staged_data = staged_table.select()  #
staged_data_1 = staged_data.with_only_columns(column_list).alias("subquery1")
value_string = ''
if attempt_update:
    # here we need to introduce our own language to the query because SQLAlchemy
    # doesn't support ON DUPLICATE UPDATE see 
    # stackexchange "ON DUPLICATE KEY UPDATE in the SQL statement" and "SQLAlchemy ON DUPLICATE KEY UPDATE"
    from sqlalchemy.ext.compiler import compiles
    from sqlalchemy.sql.expression import Insert
    # we do that by introducing our own compiler modification which simply adds the string we provide as a parameter
    # to the end of query.
    @compiles(Insert, "mysql")
    def append_string(insert, compiler, **kw):
        s = compiler.visit_insert(insert, **kw)
        # if our special parameter is present AND parameter value is not None
        # The presence of "mysql_appendstring" in kwargs gets stuck for some reason
        # that is why additional testing for None is necessary
        if ('mysql_appendstring' in insert.kwargs) and insert.kwargs['mysql_appendstring']:
            return s + " " + insert.kwargs['mysql_appendstring']
        return s

    # Below statement is needed to silence some "dialect unknown" warning.
    # Unfortunately I don't know SQLAlchemy well enough yet to explain why it is needed
    Insert.argument_for("mysql", "appendstring", None)

    # we need to form correct ON DUPLICATE KEY UPDATE a=values(a), b=values(b), ... string which will added
    # at the end of the query to make insert query into insert_or_update_if_exists query
    value_string = ' ON DUPLICATE KEY UPDATE '
    value_string += ' '.join(
        [c.name + "=values(" + c.name + "), " for c in staged_data_1.columns])
    value_string = value_string[:-2]

    insert = main_table.insert(mysql_appendstring=value_string).from_select(
        [c.name for c in staged_data_1.columns],
        staged_data_1.select()
    )
else:
    insert = main_table.insert().from_select(
        [c.name for c in staged_data_1.columns],
        staged_data_1.select()
    )
session.execute(insert)

Sqlalchemy is much slower if no primary key is specified

More articles: