Patchwork is a handy system that takes patches sent to a mailing list and gives you a nice web interface to track their progress through review and into the respective upstream project. It's used heavily in kernel development, and I'm one of the Patchwork maintainers.
We have had some annoying bugs floating around for a while now - occasionally, patches to the Buildroot mailing list just seem to go missing.
Andrew, an OzLabber and Patchwork contributor, suggested that there might be race conditions at play, as the mail server might spin up multiple process to parse messages - and these processes could run simultaneously.
I have spent some time chasing down the bugs and recently sent a patch series which I hope will clean the up. I think the bugs are a pretty interesting example of how database processing can go wrong in interesting ways when you have unexpected parallelism.
First, some background.
There's a very straightforward TOCTTOU bug with creating Persons.
What we would do is:
This would abort processing of the patch, leading Patchwork to miss it.
We can reduce the window for the race to occur - and we did. But this isn't a full fix - you could still get very unlucky. A transaction won't save you here either. What we need to do is embrace failure: catch the integrity error, accept we lost the race, and use the winner's entry.
_mysql_exceptions.OperationalError: (1213, 'Deadlock found when trying to get lock; try restarting transaction')
Looking at the backtrace, it was thrown when we created a patch like this:
patch = Patch(...)
patch.save()
To understand why this was breaking, I looked at the SQL statements that were being generated every time a patch was created. They were weird:
UPDATE "patchwork_patch" SET ...
INSERT INTO "patchwork_patch" ("submission_ptr_id", "diff", "commit_ref", "pull_url", "delegate_id", "state_id", "archived", "hash") VALUES (...)
The update could never work, because it was trying to update a patch that didn't exist yet. My hypothesis is that Django somehow didn't quite know that because of the backend complexity of the Patch model, so it tried to do an update, failed, and then tried an insert.
To fix this, we can easily just change the code around:
patch = Patch.objects.create(...)
Sure enough, this made the UPDATE statements - and the weird MySQL errors - go away.
So, in conclusion, if you do complex database processing, check if your service is safe to be run in parallel!
We have had some annoying bugs floating around for a while now - occasionally, patches to the Buildroot mailing list just seem to go missing.
Andrew, an OzLabber and Patchwork contributor, suggested that there might be race conditions at play, as the mail server might spin up multiple process to parse messages - and these processes could run simultaneously.
I have spent some time chasing down the bugs and recently sent a patch series which I hope will clean the up. I think the bugs are a pretty interesting example of how database processing can go wrong in interesting ways when you have unexpected parallelism.
First, some background.
TOCTTOU
A couple of these bugs are TOCTTOU bugs - "Time of Check to Time of Use" - a particular class of race condition. The Wikipedia article is pretty good, but let's do a worked example from a database perspective so we're all on the same page.
Say we have a table describing items, and the items' names must be unique. If we want to create an item, we first check if it exists: SELECT * FROM items WHERE items.name = "Widget";
- If it exists, we don't need to create it, and we're done.
- If it doesn't exist, we go to create it: INSERT INTO items (name) VALUES ("Widget");
Now, what happens if someone else has inserted Widget into items between the time we checked and the time we inserted? We will violate the integrity constraint and the database will complain. We have committed a TOCTTOU error: we have assumed that the property (the absence of Widget) will not vary between the time we checked, and the time we used that assumption.
Personnel issues
A Person in Patchwork represents someone who sends an email - we keep track of their email and the name they go by, so - for example - you can see all the patches someone has sent. People must have unique email addresses.There's a very straightforward TOCTTOU bug with creating Persons.
What we would do is:
- check if a person with this email address exists.
- if not, create a Person model in Django to represent them
- check that the email is something we want to save in the database (patch, cover letter, comment on an existing patch), and if so save the person.
Hopefully it's not hard to see that if you have a series of patches from the same new email address, you could have two processes which:
- simultaneously check for the user, and both find it doesn't exist
- both create a model to represent the new person
- Process A saves first, winning the race
- Process B tries to save, loses the race, and gets an integrity error from the database.
We can reduce the window for the race to occur - and we did. But this isn't a full fix - you could still get very unlucky. A transaction won't save you here either. What we need to do is embrace failure: catch the integrity error, accept we lost the race, and use the winner's entry.
MySQL weird deadlock issue
MySQL has this extra fun quirk: occasionally we'd get the following exception:_mysql_exceptions.OperationalError: (1213, 'Deadlock found when trying to get lock; try restarting transaction')
patch = Patch(...)
patch.save()
To understand why this was breaking, I looked at the SQL statements that were being generated every time a patch was created. They were weird:
UPDATE "patchwork_patch" SET ...
INSERT INTO "patchwork_patch" ("submission_ptr_id", "diff", "commit_ref", "pull_url", "delegate_id", "state_id", "archived", "hash") VALUES (...)
The update could never work, because it was trying to update a patch that didn't exist yet. My hypothesis is that Django somehow didn't quite know that because of the backend complexity of the Patch model, so it tried to do an update, failed, and then tried an insert.
To fix this, we can easily just change the code around:
patch = Patch.objects.create(...)
Sure enough, this made the UPDATE statements - and the weird MySQL errors - go away.
Series issues
First-class support for patch series is the big feature in Patchwork 2. Unfortunately, it does not work well under parallel mail processing. I did have a long explanation of the bug and how it ended up happening, but it's conceptually similar to the Person issue and so deep in the weeds of the Patchwork schema I don't think it's of much general interest. Feel free to check out the patch! The patch is also not a complete fix so there's a decent chance I will have more to say on it later.So, in conclusion, if you do complex database processing, check if your service is safe to be run in parallel!
Comments
Post a Comment