Message ID | 20220802131405.318201-1-JPEWhacker@gmail.com |
---|---|
State | New |
Headers | show |
Series | [bitbake-devel,v2] siggen: Fix sigtask data not being renamed atomically | expand |
On 02/08/2022 15.14, Joshua Watt via lists.openembedded.org wrote: > Signature generation uses mkstemp() to get a file descriptor to a unique > file and then write the signature into it, however it closed the file > before doing the chmod() and rename() operations. Closing the file means > that other mkstemp() could potentially open the same file and race with > the chmod() and rename(), causing a error. While it may not sound like > this would be very likely, glibc (at least) generates the filename for > mkstemp() using the system clock, meaning that it is much more likely > for highly parallel builds sharing sstate over NFS to encounter the race > condition. > > To fix the problem, perform the chmod() and rename() while the file is > still open, since this prevents other mkstemp() calls from being able to > open the file (due to the O_EXCL flag). Eh, O_CREAT|O_EXCL should prevent opening an already existing file, whether somebody has it open or not, so I don't see how another mkstemp() could end up opening "our" file. Unless O_EXCL semantics are broken on NFS and only work by accident when the local client does have the file open - but what would then prevent other clients with, presumably, the exact same system time from clobbering our file, whether or not the chmod+rename is done with the fd still held? The patch is probably fine (as in, shouldn't make anything worse), but I think it would be nice to understand just what the problem actually is, and if this fixes it, how. Rasmus
On 8/3/22 06:40, Rasmus Villemoes wrote: > On 02/08/2022 15.14, Joshua Watt via lists.openembedded.org wrote: >> Signature generation uses mkstemp() to get a file descriptor to a unique >> file and then write the signature into it, however it closed the file >> before doing the chmod() and rename() operations. Closing the file means >> that other mkstemp() could potentially open the same file and race with >> the chmod() and rename(), causing a error. While it may not sound like >> this would be very likely, glibc (at least) generates the filename for >> mkstemp() using the system clock, meaning that it is much more likely >> for highly parallel builds sharing sstate over NFS to encounter the race >> condition. >> >> To fix the problem, perform the chmod() and rename() while the file is >> still open, since this prevents other mkstemp() calls from being able to >> open the file (due to the O_EXCL flag). > Eh, O_CREAT|O_EXCL should prevent opening an already existing file, > whether somebody has it open or not, so I don't see how another > mkstemp() could end up opening "our" file. Unless O_EXCL semantics are > broken on NFS and only work by accident when the local client does have > the file open - but what would then prevent other clients with, > presumably, the exact same system time from clobbering our file, whether > or not the chmod+rename is done with the fd still held? Yes, for some reason I had it my head that O_EXCL only affects open files; so I'm not sure. Maybe NFS is the culprit :/ > > The patch is probably fine (as in, shouldn't make anything worse), but I > think it would be nice to understand just what the problem actually is, > and if this fixes it, how. Ya, we are definetely getting errors like: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/releases/sstate/99/sigtask.wq6gkrm7' -> '/mnt/releases/sstate/99/sstate:hdparm::9.48:r0::3:990d8b39c15752a2dd68369cb3609283_unpack.tgz.siginfo' With a backtrace to the os.rename() under heavy load. I don't understand how the chmod() can succeed and the rename fail unless there is a race. It's very hard to diagnose what went wrong post-mortem. My other option was to manually add more entropy to the file names by prepending some random characters after "sigtask", since the "randomeness" of mkstemp() isn't great all tasks use the same sigtask file prefix (which increases the likely hood of a collision). > > Rasmus
On Wed, Aug 3, 2022 at 8:21 AM Joshua Watt <jpewhacker@gmail.com> wrote: > > > On 8/3/22 06:40, Rasmus Villemoes wrote: > > On 02/08/2022 15.14, Joshua Watt via lists.openembedded.org wrote: > > Signature generation uses mkstemp() to get a file descriptor to a unique > file and then write the signature into it, however it closed the file > before doing the chmod() and rename() operations. Closing the file means > that other mkstemp() could potentially open the same file and race with > the chmod() and rename(), causing a error. While it may not sound like > this would be very likely, glibc (at least) generates the filename for > mkstemp() using the system clock, meaning that it is much more likely > for highly parallel builds sharing sstate over NFS to encounter the race > condition. > > To fix the problem, perform the chmod() and rename() while the file is > still open, since this prevents other mkstemp() calls from being able to > open the file (due to the O_EXCL flag). > > Eh, O_CREAT|O_EXCL should prevent opening an already existing file, > whether somebody has it open or not, so I don't see how another > mkstemp() could end up opening "our" file. Unless O_EXCL semantics are > broken on NFS and only work by accident when the local client does have > the file open - but what would then prevent other clients with, > presumably, the exact same system time from clobbering our file, whether > or not the chmod+rename is done with the fd still held? > > Yes, for some reason I had it my head that O_EXCL only affects open files; so I'm not sure. Maybe NFS is the culprit :/ > > The patch is probably fine (as in, shouldn't make anything worse), but I > think it would be nice to understand just what the problem actually is, > and if this fixes it, how. > > Ya, we are definetely getting errors like: > > > FileNotFoundError: [Errno 2] No such file or directory: '/mnt/releases/sstate/99/sigtask.wq6gkrm7' -> '/mnt/releases/sstate/99/sstate:hdparm::9.48:r0::3:990d8b39c15752a2dd68369cb3609283_unpack.tgz.siginfo' > > With a backtrace to the os.rename() under heavy load. I don't understand how the chmod() can succeed and the rename fail unless there is a race. It's very hard to diagnose what went wrong post-mortem. > > My other option was to manually add more entropy to the file names by prepending some random characters after "sigtask", since the "randomeness" of mkstemp() isn't great all tasks use the same sigtask file prefix (which increases the likely hood of a collision). Ok, after further investigation, I realized that our NFS server (which I don't have any control over) is not Linux, so I have the sneaking suspicion that it doesn't implement O_EXCL correctly, or perhaps not correctly across multiple clients. In light of that, this patch isn't going to do what I want anyway, and I've sent a new patch that increases the entropy to mkstemp() so it's not time based which should solve the problem in all cases > > Rasmus
diff --git a/bitbake/lib/bb/siggen.py b/bitbake/lib/bb/siggen.py index 3f3d6df54d..f3c8ff796f 100644 --- a/bitbake/lib/bb/siggen.py +++ b/bitbake/lib/bb/siggen.py @@ -426,18 +426,18 @@ class SignatureGeneratorBasic(SignatureGenerator): sigfile = sigfile.replace(self.taskhash[tid], computed_taskhash) fd, tmpfile = tempfile.mkstemp(dir=os.path.dirname(sigfile), prefix="sigtask.") - try: - with bb.compress.zstd.open(fd, "wt", encoding="utf-8", num_threads=1) as f: + with bb.compress.zstd.open(fd, "wt", encoding="utf-8", num_threads=1) as f: + try: json.dump(data, f, sort_keys=True, separators=(",", ":"), cls=SetEncoder) f.flush() - os.chmod(tmpfile, 0o664) - bb.utils.rename(tmpfile, sigfile) - except (OSError, IOError) as err: - try: - os.unlink(tmpfile) - except OSError: - pass - raise err + os.fchmod(fd, 0o664) + bb.utils.rename(tmpfile, sigfile) + except (OSError, IOError) as err: + try: + os.unlink(tmpfile) + except OSError: + pass + raise err def dump_sigfn(self, fn, dataCaches, options): if fn in self.taskdeps:
Signature generation uses mkstemp() to get a file descriptor to a unique file and then write the signature into it, however it closed the file before doing the chmod() and rename() operations. Closing the file means that other mkstemp() could potentially open the same file and race with the chmod() and rename(), causing a error. While it may not sound like this would be very likely, glibc (at least) generates the filename for mkstemp() using the system clock, meaning that it is much more likely for highly parallel builds sharing sstate over NFS to encounter the race condition. To fix the problem, perform the chmod() and rename() while the file is still open, since this prevents other mkstemp() calls from being able to open the file (due to the O_EXCL flag). Signed-off-by: Joshua Watt <JPEWhacker@gmail.com> --- bitbake/lib/bb/siggen.py | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-)