Skip to main content

Patchwork and database races

Patchwork is a handy system that takes patches sent to a mailing list and gives you a nice web interface to track their progress through review and into the respective upstream project. It's used heavily in kernel development, and I'm one of the Patchwork maintainers.

We have had some annoying bugs floating around for a while now - occasionally, patches to the Buildroot mailing list just seem to go missing.

Andrew, an OzLabber and Patchwork contributor, suggested that there might be race conditions at play, as the mail server might spin up multiple process to parse messages - and these processes could run simultaneously.

I have spent some time chasing down the bugs and recently sent a patch series which I hope will clean the up. I think the bugs are a pretty interesting example of how database processing can go wrong in interesting ways when you have unexpected parallelism.

First, some background.

TOCTTOU

A couple of these bugs are TOCTTOU bugs - "Time of Check to Time of Use" - a particular class of race condition. The Wikipedia article is pretty good, but let's do a worked example from a database perspective so we're all on the same page.

Say we have a table describing items, and the items' names must be unique. If we want to create an item, we first check if it exists: SELECT * FROM items WHERE items.name = "Widget";
  • If it exists, we don't need to create it, and we're done.
  • If it doesn't exist, we go to create it: INSERT INTO items (name) VALUES ("Widget");
Now, what happens if someone else has inserted Widget into items between the time we checked and the time we inserted? We will violate the integrity constraint and the database will complain. We have committed a TOCTTOU error: we have assumed that the property (the absence of Widget) will not vary between the time we checked, and the time we used that assumption. 

Personnel issues

A Person in Patchwork represents someone who sends an email - we keep track of their email and the name they go by, so - for example - you can see all the patches someone has sent. People must have unique email addresses.

There's a very straightforward TOCTTOU bug with creating Persons.

What we would do is:
  • check if a person with this email address exists.
  • if not, create a Person model in Django to represent them
  • check that the email is something we want to save in the database (patch, cover letter, comment on an existing patch), and if so save the person.
Hopefully it's not hard to see that if you have a series of patches from the same new email address, you could have two processes which:
  • simultaneously check for the user, and both find it doesn't exist
  • both create a model to represent the new person
  • Process A saves first, winning the race
  • Process B tries to save, loses the race, and gets an integrity error from the database.
This would abort processing of the patch, leading Patchwork to miss it.

We can reduce the window for the race to occur - and we did. But this isn't a full fix - you could still get very unlucky. A transaction won't save you here either. What we need to do is embrace failure: catch the integrity error, accept we lost the race, and use the winner's entry.

MySQL weird deadlock issue

MySQL has this extra fun quirk: occasionally we'd get the following exception:

_mysql_exceptions.OperationalError: (1213, 'Deadlock found when trying to get lock; try restarting transaction')

Looking at the backtrace, it was thrown when we created a patch like this:

patch = Patch(...)
patch.save()

To understand why this was breaking, I looked at the SQL statements that were being generated every time a patch was created. They were weird:

UPDATE "patchwork_patch" SET ...
INSERT INTO "patchwork_patch" ("submission_ptr_id", "diff", "commit_ref", "pull_url", "delegate_id", "state_id", "archived", "hash") VALUES (...)

The update could never work, because it was trying to update a patch that didn't exist yet. My hypothesis is that Django somehow didn't quite know that because of the backend complexity of the Patch model, so it tried to do an update, failed, and then tried an insert.

To fix this, we can easily just change the code around:

patch = Patch.objects.create(...)

Sure enough, this made the UPDATE statements - and the weird MySQL errors - go away.

Series issues

First-class support for patch series is the big feature in Patchwork 2. Unfortunately, it does not work well under parallel mail processing. I did have a long explanation of the bug and how it ended up happening, but it's conceptually similar to the Person issue and so deep in the weeds of the Patchwork schema I don't think it's of much general interest. Feel free to check out the patch! The patch is also not a complete fix so there's a decent chance I will have more to say on it later.

So, in conclusion, if you do complex database processing, check if your service is safe to be run in parallel!

Comments

Popular posts from this blog

Connecting to a wifi network with netplan

How do you connect to a a wifi network with netplan? I hang out on the #netplan IRC channel on Freenode, and this comes up every so often. netplan - the default network configuration tool in Ubuntu 17.10 onwards - currently supports WPA2 Personal networks, and open (unencrypted) networks only. If you need something else, consider using NetworkManager directly, or falling back to ifupdown and wpa_supplicant for a little longer. Without further ado, here are tested, working YAML files for connection to my local WPA2 and unencrypted network. The only things that have been changed are the SSIDs and password. Both networks have a router providing dhcp4. In both cases I assume there's only one wifi device in the system - if this is not true, replace match: {} with something more specific. You can drop these in  /etc/netplan and run netplan generate; netplan apply  and things should work. The network will also be brought up on subsequent boots. Note that, as always in YAML, ind

Netplan by example

netplan  is the default network configuration system for new installs of Ubuntu 18.04 (Bionic). It uses YAML to configure network interfaces, instead of  /etc/network/interfaces . I've been testing netplan for a while, so in light of the release of Bionic, here's my set of examples, caveats, tips and tricks. Contents General tips and tricks Matching Basic IPv4 configuration MTUs Bridges, Bonds and VLANs Wifi IPv6 Supplementing or replacing netplan Going Further General tips and tricks Tabs are not allowed in YAML and currently you get a very useless error message if you use them: "Invalid YAML at //etc/netplan/10-bridge.yaml line 5 column 0: found character that cannot start any token". If you see this, check for tabs! Indentation matters in YAML. Make sure that things line up where they're supposed to. Rebooting is somewhat more reliable than netplan apply , but make sure  there are no errors in your YAML before you reboot or no network

Anonymous bridges in netplan

netplan is the default network configuration system for new installs of Ubuntu 18.04 (Bionic). Introduced as the default in Artful, it replaces /etc/network/interfaces . One question that gets asked repeatedly is: "How do I set up an anonymous bridge in netplan?" (An anonymous bridge, I discovered, is one where the bridge doesn't have an IP address; it's more akin to a switch or hub.) It's been approached on  Launchpad , and comes up on the IRC channel. If you're trying to create a bridge without an IP address, the obvious first thing to try is this: network: version: 2 ethernets: ens8: match: macaddress: 52:54:00:f9:e9:dd ens9: match: macaddress: 52:54:00:56:0d:ce bridges: br0: interfaces: [ens8, ens9] This is neat, plausible, and wrong - the bridge will be created but will stay 'down'. Per ip a : 5: br0: <BROADCAST,MULTICAST> mtu 15