在回答我的问题How to speed up data wrangling A LOT in Python + Pandas + sqlAlchemy + MSSQL/T-SQL时,我被@IljaEverilä指向Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC.
NB出于测试目的,我只读/写10k行.
我添加了事件监听器和a)调用了函数,但是b)显然没有设置executemany,因为IF失败并且没有设置cursor.fast_executemay.
def namedDbSqlAEngineCreate(dbName):
# Create an engine and switch to the named db
# returns the engine if successful and None if not
# 2018-08-23 added fast_executemany accoding to this https://*.com/questions/48006551/speeding-up-pandas-dataframe-to-sql-with-fast-executemany-of-pyodbc?rq=1
engineStr = 'mssql+pyodbc://@' + defaultDSN
engine = sqla.create_engine(engineStr, echo=False)
@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
# print("FUNC call")
if executemany:
print('executemany')
cursor.fast_executemany = True
try:
engine.execute('USE ' +dbName)
return(engine)
except sqla.exc.SQLAlchemyError as ex:
if ex.orig.args[0] == '08004':
print('namedDbSqlAEngineCreate:Database %s does not exist' % dbName)
else:
print(ex.args[0])
return(None)
当然,速度没有变化.
原始问题中的代码在to_sql中没有改变
nasToFillDF.to_sql(name=tempTableName, con=engine.engine, if_exists='replace', chunksize=100, index=False)
因为我试过,根据这个例子,设置chunksize = None并收到错误信息(我以前遇到过)
(pyodbc.ProgrammingError) (‘The SQL contains -31072 parameter markers,
but 100000 parameters were supplied’, ‘HY000’)
我做错了什么?我猜没有设置receive_before_cursor_execute的executemany参数,但如果这是答案,我不知道如何解决它.
安装程序是pyodbc 4.0.23,sqlAchemy 1.2.6,Python 3.6.something
解决方法:
您收到的错误是由Pandas版本0.23.0中引入的更改引起的,在0.23.1中还原,并在0.24.0中重新引入,如here所述.生成的VALUES子句包含100,000个参数标记,并且计数似乎是存储在一个带符号的16位整数中,所以它溢出,你得到了有趣的
The SQL contains -31072 parameter markers, but 100000 parameters were supplied
你可以自己检查一下:
In [16]: 100000 % (2 ** 16) - 2 ** 16
Out[16]: -31072
如果您希望继续按原样使用Pandas,则必须计算并提供合适的chunksize值,例如您使用的100,同时考虑VALUES子句的最大行限制1,000和最大参数限制2,100的存储过程.细节再次在linked Q/A中解释.
在更改之前,Pandas在插入数据时总是使用executemany()
.较新版本检测正在使用的dialect是否支持INSERT中的VALUES子句.这种检测发生在SQLTable.insert_statement()中并且无法控制,这是一个耻辱,因为PyODBC修复了它们的executemany()性能,given the right flag is enabled.
为了强制Pandas再次使用带有PyODBC的executemany(),SQLTable必须是monkeypatched:
import pandas.io.sql
def insert_statement(self, data, conn):
return self.table.insert(), data
pandas.io.sql.SQLTable.insert_statement = insert_statement
如果未设置Cursor.fast_executemany
标志,这将非常慢,所以请记住设置正确的事件处理程序.
这是一个简单的性能比较,使用以下数据帧:
In [12]: df = pd.DataFrame({f'X{i}': range(1000000) for i in range(9)})
香草熊猫0.24.0:
In [14]: %time df.to_sql('foo', engine, chunksize=209)
CPU times: user 2min 9s, sys: 2.16 s, total: 2min 11s
Wall time: 2min 26s
启用快速执行的Monkeypatched Pandas:
In [10]: %time df.to_sql('foo', engine, chunksize=500000)
CPU times: user 12.2 s, sys: 981 ms, total: 13.2 s
Wall time: 38 s