Renaming columns when querying SQLAlchemy in Pandas DataFrame

Is there a way to preserve the SqlAlchemy attribute names when querying data in the pandas framework?

Here's a simple display of my database. For the school desk, I renamed "SchoolDistrict", the name of the database, to the shorter "area". I am removing multiple layers from the DBA, so changing them in source is not possible.

class School(Base):
    __tablename__ = 'DimSchool'

    id = Column('SchoolKey', Integer, primary_key=True)
    name = Column('SchoolName', String)
    district = Column('SchoolDistrict', String)


class StudentScore(Base):
    __tablename__ = 'FactStudentScore'

    SchoolKey = Column('SchoolKey', Integer, ForeignKey('DimSchool.SchoolKey'), primary_key = True)
    PointsPossible = Column('PointsPossible', Integer)
    PointsReceived = Column('PointsReceived', Integer)

    school = relationship("School", backref='studentscore')

      

So, when I ask for something like:

query = session.query(StudentScore, School).join(School)
df = pd.read_sql(query.statement, query.session.bind)

      

I end up with the "SchoolDistrict" name for the column, not my attribute name, in the returned DataFrame df.

EDIT: An even more frustrating case is that duplicate column names exist in tables. For example:

class Teacher(Base):
    __tablename__ = 'DimTeacher'

    id = Column('TeacherKey', Integer, primary_key=True)
    fname = Column('FirstName', String)
    lname = Column('FirstName', String)

class Student(Base):
    __tablename__ = 'DimStudent'

    id = Column('StudentKey', Integer, primary_key=True)
    fname = Column('FirstName', String)
    lname = Column('FirstName', String)

      

Thus, a query for both tables (for example, below) creates a data block with duplicate FirstName and LastName columns.

query = session.query(StudentScore, Student, Teacher).join(Student).join(Teacher)

      

Can I rename these columns at query time? Right now I'm having trouble keeping my head straight with these two column naming systems.

+6


source to share


2 answers


This is the solution I would complain bitterly about if I had to maintain the code afterwards. But your question has so many limitations that I cannot find anything better.

First, you create a dictionary with schema and class column equivalents using introspection like this (I'm using the first example you posted):

In [132]:

def add_to_dict(c_map, t_map, table):
    name = table.__tablename__
    t_map[name] = table.__name__
    #print name
    c_map[name] = {}
    for column in dir(table):
        c_schema_name = table.__mapper__.columns.get(column)
        if isinstance(c_schema_name, Column):
            #print column, c_schema_name.name
            c_map[name][c_schema_name.name] = column

c_map = {}
t_map = {}
add_to_dict(c_map, t_map, School)
add_to_dict(c_map, t_map, StudentScore)
print c_map['DimSchool']['SchoolKey']
print c_map['FactStudentScore']['SchoolKey']
print t_map['DimSchool']
id
SchoolKey
School

      

[EDIT: clarifications on the way to build a dictionary with introspection

  • c_map - dictionary of column names mapping
  • t_map - dictionary of mapping table names
  • must be called for every class of every table
  • for table names matching is easy, since they are just attributes of the table class
  • for class column names, 1st iteration of class attributes with dir
  • for each of the class attributes (which will be table columns, but also many others) try to get the name of the database column using sqlalchemy

    mapper
  • mapper will Column

    only return an object if the attribute is indeed a column
  • thus for objects Column

    , add them to the dictionary of column names. The database name is obtained with .name

    and the other is just an attribute

Run this immediately after creating all objects in the database by calling it once for the table class. ]



Then you take the sql statement and create a list of the column translations you are going to get:

In [134]:

df_columns = []
for column in str(query.statement).split('FROM')[0].split('SELECT')[1].split(','):
    table = column.split('.')[0].replace('"', '').strip()
    c_schema = column.split('.')[1].replace('"', '').strip()
    df_columns += [t_map[table] + '.' + eq[table][c_schema]]
print df_columns
['StudentScore.SchoolKey', 'StudentScore.PointsPossible', 'StudentScore.PointsReceived', 'School.id', 'School.name', 'School.district']

      

Finally, you read the dataframe as in your question and change the column names:

In [137]:

df.columns = df_columns
In [138]:

df
Out[138]:
StudentScore.SchoolKey  StudentScore.PointsPossible StudentScore.PointsReceived School.id   School.name School.district
0   1   1   None    1   School1 None

      

(The data is just a silly register I created).

Hope it helps!

+1


source


I am by no means a SQLAlchemy expert, but I found a more generalized solution (or at least a start).

Cautions

  • Will not handle mapped columns with the same name across different models. You should solve this by adding a suffix, or you can modify my answer below to create pandas columns like <tablename/model name>.<mapper column name>

    .

It includes four key steps:

  1. Refine the query statement with labels that will result in column names in pandas <table name>_<column name>

    :
df = pd.read_sql(query.statement, query.session.bind).with_labels()

      

  1. Separate table name from (actual) column name


table_name, col = col_name.split('_', 1)

      

  1. Get model based on table name ( answer to this question )
for c in Base._decl_class_registry.values():
            if hasattr(c, '__tablename__') and c.__tablename__ == tname:
                return c

      

  1. Find the correct display name
for k, v in sa_class.__mapper__.columns.items():
        if v.name == col:
            return k

      

Putting it all together, this is the solution I came up with, with a major caveat that will cause duplicate column names in your dataframe if you (likely) have duplicate mapped names between classes.

from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class School(Base):
    __tablename__ = 'DimSchool'

    id = Column('SchoolKey', Integer, primary_key=True)
    name = Column('SchoolName', String)
    district = Column('SchoolDistrict', String)


class StudentScore(Base):
    __tablename__ = 'FactStudentScore'

    SchoolKey = Column('SchoolKey', Integer, ForeignKey('DimSchool.SchoolKey'), primary_key = True)
    PointsPossible = Column('PointsPossible', Integer)
    PointsReceived = Column('PointsReceived', Integer)

    school = relationship("School", backref='studentscore')


def mapped_col_name(col_name):
    ''' Retrieves mapped Model based on
    actual table name (as given in pandas.read_sql)
    '''

    def sa_class(table_name):
        for c in Base._decl_class_registry.values():
            if hasattr(c, '__tablename__') and c.__tablename__ == tname:
                return c

    table_name, col = col_name.split('_', 1)
    sa_class = sa_class(table_name)

    for k, v in sa_class.__mapper__.columns.items():
        if v.name == col:
            return k

query = session.query(StudentScore, School).join(School)
df = pd.read_sql(query.statement, query.session.bind).with_labels()
df.columns = map(mapped_col_name, df.columns)

      

0


source







All Articles