The Question :
460 people think this question is useful
What the difference is between
commit() in SQLAlchemy?
I’ve read the docs, but am none the wiser – they seem to assume a pre-understanding that I don’t have.
I’m particularly interested in their impact on memory usage. I’m loading some data into a database from a series of files (around 5 million rows in total) and my session is occasionally falling over – it’s a large database and a machine with not much memory.
I’m wondering if I’m using too many
commit() and not enough
flush() calls – but without really understanding what the difference is, it’s hard to tell!
The Question Comments :
The Answer 1
593 people think this answer is useful
A Session object is basically an ongoing transaction of changes to a database (update, insert, delete). These operations aren’t persisted to the database until they are committed (if your program aborts for some reason in mid-session transaction, any uncommitted changes within are lost).
The session object registers transaction operations with
session.add(), but doesn’t yet communicate them to the database until
session.flush() is called.
session.flush() communicates a series of operations to the database (insert, update, delete). The database maintains them as pending operations in a transaction. The changes aren’t persisted permanently to disk, or visible to other transactions until the database receives a COMMIT for the current transaction (which is what
session.commit() commits (persists) those changes to the database.
flush() is always called as part of a call to
When you use a Session object to query the database, the query will return results both from the database and from the flushed parts of the uncommitted transaction it holds. By default, Session objects
autoflush their operations, but this can be disabled.
Hopefully this example will make this clearer:
s = Session()
s.add(Foo('A')) # The Foo('A') object has been added to the session.
# It has not been committed to the database yet,
# but is returned as part of a query.
print 1, s.query(Foo).all()
s2 = Session()
s2.autoflush = False
print 2, s2.query(Foo).all() # The Foo('B') object is *not* returned
# as part of this query because it hasn't
# been flushed yet.
s2.flush() # Now, Foo('B') is in the same state as
# Foo('A') was above.
print 3, s2.query(Foo).all()
s2.rollback() # Foo('B') has not been committed, and rolling
# back the session's transaction removes it
# from the session.
print 4, s2.query(Foo).all()
3 [<Foo('A')>, <Foo('B')>]
The Answer 2
27 people think this answer is useful
As @snapshoe says
flush() sends your SQL statements to the database
commit() commits the transaction.
session.autocommit == False:
commit() will call
flush() if you set
autoflush == True.
session.autocommit == True:
You can’t call
commit() if you haven’t started a transaction (which you probably haven’t since you would probably only use this mode to avoid manually managing transactions).
In this mode, you must call
flush() to save your ORM changes. The flush effectively also commits your data.
The Answer 3
12 people think this answer is useful
Why flush if you can commit?
As someone new to working with databases and sqlalchemy, the previous answers – that
flush() sends SQL statements to the DB and
commit() persists them – were not clear to me. The definitions make sense but it isn’t immediately clear from the definitions why you would use a flush instead of just committing.
Since a commit always flushes (https://docs.sqlalchemy.org/en/13/orm/session_basics.html#committing) these sound really similar. I think the big issue to highlight is that a flush is not permanent and can be undone, whereas a commit is permanent, in the sense that you can’t ask the database to undo the last commit (I think)
@snapshoe highlights that if you want to query the database and get results that include newly added objects, you need to have flushed first (or committed, which will flush for you). Perhaps this is useful for some people although I’m not sure why you would want to flush rather than commit (other than the trivial answer that it can be undone).
In another example I was syncing documents between a local DB and a remote server, and if the user decided to cancel, all adds/updates/deletes should be undone (i.e. no partial sync, only a full sync). When updating a single document I’ve decided to simply delete the old row and add the updated version from the remote server. It turns out that due to the way sqlalchemy is written, order of operations when committing is not guaranteed. This resulted in adding a duplicate version (before attempting to delete the old one), which resulted in the DB failing a unique constraint. To get around this I used
flush() so that order was maintained, but I could still undo if later the sync process failed.
See my post on this at: Is there any order for add versus delete when committing in sqlalchemy
Similarly, someone wanted to know whether add order is maintained when committing, i.e. if I add
object1 then add
object1 get added to the database before
Does SQLAlchemy save order when adding objects to session?
Again, here presumably the use of a flush() would ensure the desired behavior. So in summary, one use for flush is to provide order guarantees (I think), again while still allowing yourself an “undo” option that commit does not provide.
Autoflush and Autocommit
Note, autoflush can be used to ensure queries act on an updated database as sqlalchemy will flush before executing the query. https://docs.sqlalchemy.org/en/13/orm/session_api.html#sqlalchemy.orm.session.Session.params.autoflush
Autocommit is something else that I don’t completely understand but it sounds like its use is discouraged:
Now the original question actually wanted to know about the impact of flush vs. commit for memory purposes. As the ability to persist or not is something the database offers (I think), simply flushing should be sufficient to offload to the database – although committing shouldn’t hurt (actually probably helps – see below) if you don’t care about undoing.
sqlalchemy uses weak referencing for objects that have been flushed: https://docs.sqlalchemy.org/en/13/orm/session_state_management.html#session-referencing-behavior
This means if you don’t have an object explicitly held onto somewhere, like in a list or dict, sqlalchemy won’t keep it in memory.
However, then you have the database side of things to worry about. Presumably flushing without committing comes with some memory penalty to maintain the transaction. Again, I’m new to this but here’s a link that seems to suggest exactly this: https://stackoverflow.com/a/15305650/764365
In other words, commits should reduce memory usage, although presumably there is a trade-off between memory and performance here. In other words, you probably don’t want to commit every single database change, one at a time (for performance reasons), but waiting too long will increase memory usage.
The Answer 4
8 people think this answer is useful
This does not strictly answer the original question but some people have mentioned that with
session.autoflush = True you don’t have to use
session.flush()… And this is not always true.
If you want to use the id of a newly created object in the middle of a transaction, you must call
# Given a model with at least this id
id = Column(Integer, primary_key=True) # autoincrement by default on integer primary key
session.autoflush = True
a = AModel()
a.id # None
a.id # autoincremented integer
This is because
autoflush does NOT auto fill the id (although a query of the object will, which sometimes can cause confusion as in “why this works here but not there?” But snapshoe already covered this part).
One related aspect that seems pretty important to me and wasn’t really mentioned:
Why would you not commit all the time? – The answer is atomicity.
A fancy word to say: an ensemble of operations have to all be executed successfully OR none of them will take effect.
For example, if you want to create/update/delete some object (A) and then create/update/delete another (B), but if (B) fails you want to revert (A). This means those 2 operations are atomic.
Therefore, if (B) needs a result of (A), you want to call
flush after (A) and
commit after (B).
session.autoflush is True, except for the case that I mentioned above or others in Jimbo‘s answer, you will not need to call
The Answer 5
2 people think this answer is useful
Use flush when you need to write, for example to get a primary key ID from an autoincrementing counter.
john=Person(name='John Smith', parent=None)
son=Person(name='Bill Smith', parent=john.id)
Without flushing, john would never get an ID from the DB and so couldn’t represent the parent/child relationship in code.
Like others have said, without
commit() none of this will be permanently persisted to DB.