Many tests are flaky #3412

Closed
opened 2020-09-12 14:39:18 +00:00 by jaraco · 11 comments

As observed in this run, part of a PR that doesn't change any code, the test_POST_upload_replace fails twice, once with a twisted.web.error and another with a KeyError.

In this build, allmydata.test.test_system.SystemTest fails with a FailTest (#3321).

this run reveals that test_filesystem is flaky.

Also this run looks flaky in test_status_path_404_error.

In this run, test_POST_rename_file failed with the same errors as test_POST_upload_replace, suggesting all of test_web is flaky.

As observed in [this run](https://app.circleci.com/pipelines/github/tahoe-lafs/tahoe-lafs/803/workflows/5d9c7811-0f9a-4c33-8a12-07d1c20714bd/jobs/28462), part of a PR that doesn't change any code, the `test_POST_upload_replace` fails twice, once with a `twisted.web.error` and another with a `KeyError`. In [this build](https://app.circleci.com/pipelines/github/tahoe-lafs/tahoe-lafs/806/workflows/32a32357-63ef-41b0-ab32-fbc10971ddd3/jobs/28503), `allmydata.test.test_system.SystemTest` fails with a `FailTest` (#3321). [this run](https://app.circleci.com/pipelines/github/tahoe-lafs/tahoe-lafs/808/workflows/71b9035e-0cba-418b-9b41-83f9af1c63ab/jobs/28538) reveals that `test_filesystem` is flaky. Also [this run looks flaky](https://app.circleci.com/pipelines/github/tahoe-lafs/tahoe-lafs/803/workflows/5d9c7811-0f9a-4c33-8a12-07d1c20714bd/jobs/28458) in `test_status_path_404_error`. In [this run](https://app.circleci.com/pipelines/github/tahoe-lafs/tahoe-lafs/810/workflows/928f94ca-f135-4472-93b0-fe804ade396c/jobs/28561), `test_POST_rename_file` failed with the same errors as `test_POST_upload_replace`, suggesting all of test_web is flaky.
jaraco added the
unknown
normal
defect
n/a
labels 2020-09-12 14:39:18 +00:00
jaraco added this to the undecided milestone 2020-09-12 14:39:18 +00:00
Author
[pull request](https://github.com/tahoe-lafs/tahoe-lafs/pull/807)
jaraco self-assigned this 2020-09-12 16:33:58 +00:00
jaraco changed title from test_POST_upload_replace is flaky to Many tests are flaky 2020-09-12 16:33:58 +00:00
Author

My plan is to just use this issue as an umbrella to capture most of the tests that are currently flaky and mark each with a 'retry' decorator to bypass the flakiness until someone has time to address the root cause.

My plan is to just use this issue as an umbrella to capture most of the tests that are currently flaky and mark each with a 'retry' decorator to bypass the flakiness until someone has time to address the root cause.
jaraco modified the milestone from undecided to Support Python 3 2020-09-12 16:47:25 +00:00
Author

I've flagged this as review-needed, but I'm not confident the PR covers all of the flaky tests. Even the most recent runs are still failing deprecations (tracked separately in #3414). Given the amount of toil this work requires, I'd like to get an initial review of the approach and feedback on the issue generally.

I've flagged this as review-needed, but I'm not confident the PR covers all of the flaky tests. Even the most recent runs are still failing deprecations (tracked separately in #3414). Given the amount of toil this work requires, I'd like to get an initial review of the approach and feedback on the issue generally.
wearpants commented 2020-09-13 15:18:45 +00:00
Owner

The collective consensus has been that we'll live with flaky tests... though personally, I'd like to see some improvement here, as it seems to be an issue multiple times per week, and CI is slow - as a result, we have eyeballed failing tests and decided to merge anyway (IIRC, this introduced a bucket into master at least once, because the test was actually failing not just flaky).

So I guess if there's a quick and dirty "fix" like retrying, that's worth doing but we should discuss further before investing significant effort in addressing root causes.

The collective consensus has been that we'll live with flaky tests... though personally, I'd like to see some improvement here, as it seems to be an issue multiple times per week, and CI is slow - as a result, we have eyeballed failing tests and decided to merge anyway (IIRC, this introduced a bucket into master at least once, because the test was actually failing not just flaky). So I guess if there's a quick and dirty "fix" like retrying, that's worth doing but we should discuss further before investing significant effort in addressing root causes.

though personally, I'd like to see some improvement here

I think everyone would like to see some improvement. It's just not clear how to make that happen.

> though personally, I'd like to see some improvement here I think everyone would like to see some improvement. It's just not clear how to make that happen.

This ticket is a partial duplicate of https://tahoe-lafs.org/trac/tahoe-lafs/ticket/3321

This ticket is a partial duplicate of <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/3321>
Author

In the PR, exarkun advised:

[retries]These only work for synchronous tests. If a test returns a Deferred and that Deferred fires with a failure, do they do anything at all?

So I reached out for advice.

In the #twisted IRC channel, I got some advice from tos9, Glyph, and others regarding the fact that trial allows test methods to return a Deferred object, which may have one or more callbacks that will only be later executed, so wrapping those methods in retry will have little effect.

I created trial-retry to work out the kinks, and there I landed on a technique that seems to have the intended effect.

Glyph had this to say about the approach:

Hopefully when we have some type-checking you'll get an error if you try to manipulate .callbacks like that
entirely "from the outside", i.e. if you have a Deferred which definitely isn't in the process of firing, you can probably get away with .callbacks=something but in the middle of execution this may have arbitrarily weird effects; running some callbacks twice, failing to run others

It remains to be seen if the issue is merely theoretical or if such limitations might be triggered and in this codebase.

Acknowledging this potential weakness of the approach, I plan to apply this technique to the work in progress and see if it improves the reliability of the test suite.

In the PR, exarkun advised: > [retries]These only work for synchronous tests. If a test returns a Deferred and that Deferred fires with a failure, do they do anything at all? So I reached out for advice. In the #twisted IRC channel, I got some advice from tos9, Glyph, and others regarding the fact that `trial` allows test methods to return a Deferred object, which may have one or more callbacks that will only be later executed, so wrapping those methods in `retry` will have little effect. I created [trial-retry](https://github.com/jaraco/trial-retry) to work out the kinks, and there I landed on a technique that seems to have the intended effect. Glyph had this to say about the approach: > Hopefully when we have some type-checking you'll get an error if you try to manipulate `.callbacks` like that > entirely "from the outside", i.e. if you have a Deferred which definitely isn't in the process of firing, you can probably get away with `.callbacks=something` but in the middle of execution this may have arbitrarily weird effects; running some callbacks twice, failing to run others It remains to be seen if the issue is merely theoretical or if such limitations might be triggered and in this codebase. Acknowledging this potential weakness of the approach, I plan to apply this technique to the work in progress and see if it improves the reliability of the test suite.
Author

Turns out the problem is even more complicated than I imagined. As illustrated here, tests can create Deferred objects and does not even need to return them for them to be honored in the test results. And because the handling of those results is outside the scope of the method, there's no decorator on the method that can intervene when those deferred tests fail... at least, not without some support by the test runner.

Turns out the problem is even more complicated than I imagined. As [illustrated here](https://github.com/tahoe-lafs/tahoe-lafs/blob/51709d55629f45be19f2c89b2d8b17a149e19881/src/allmydata/test/web/test_web.py#L1126-L1150), tests can create Deferred objects and does not even need to return them for them to be honored in the test results. And because the handling of those results is outside the scope of the method, there's no decorator on the method that can intervene when those deferred tests fail... at least, not without some support by the test runner.
Author

Given that:

  • there's little prior art for dealing with flaky tests in Trial,
  • it's possible for deferred functionality to be present without any signal on the method itself,
  • any retry functionality is going to require special handling in the test runner,
  • the consensus is that flaky tests are acceptable,

I'm going to shelve this effort for now.

Given that: - there's little prior art for dealing with flaky tests in Trial, - it's possible for deferred functionality to be present without any signal on the method itself, - any retry functionality is going to require special handling in the test runner, - the consensus is that flaky tests are acceptable, I'm going to shelve this effort for now.
Author

Here is some of the conversation about this effort:

[2020-09-25 11:33:39] <jaraco> Is there a way for a decorator on a method to detect all of the deferreds that were created during the method and wait on all of them to succeed and retry if any of them fail?
[2020-09-25 11:37:02] <jaraco> I’ve written https://github.com/jaraco/trial-retry/blob/13499eb3a5d9e8d068c28b3c40e52ca71a56a13a/test_everything.py#L40-L71 to achieve that for a single deferred in the return value.
[2020-09-25 11:38:30] <jaraco> Except, even that approach won’t work because it’s only retrying the callback. It’s not retrying the test.
[2020-09-25 11:58:17] <itamarst> there's gather_results
[2020-09-25 11:58:36] <itamarst> but...
[2020-09-25 11:58:54] <itamarst> I'm not sure why you can't retry whole method
[2020-09-25 11:59:19] <itamarst> oh, you don't want to mess with internal list
[2020-09-25 11:59:52] <itamarst> you want to addErrback a function that, for N times, returns the result of the original decorated function
[2020-09-25 12:00:50] <itamarst> a single retry would look like this:  `return result.addErrback(lambda _: f(*args, **kwargs))`
[2020-09-25 12:01:35] <itamarst> a niave "retry 3 times" is `return result.addErrback(lambda _: f(*args, **kwargs)).addErrback(lambda _: f(*args, **kwargs)).addErrback(lambda _: f(*args, **kwargs))`
[2020-09-25 12:01:53] <itamarst> you don't want to mess with `.callbacks` at all, just call `addErrback`
[2020-09-25 12:10:33] <jaraco> The problem is that the deferreds aren’t always returned by the method.
[2020-09-25 12:16:48] <jaraco> Such as in `test_status_path_404_error` - how does one attach an errback to the “up” call?
[2020-09-25 12:20:40] <jaraco> Also, adding errback isn’t enough to suppress the failure. You want to trap exceptions in the callback and errbacks of the deferreds created.
[2020-09-25 12:25:53] <itamarst> if isinstance(result, Deferred): result = result.addErrback(lambda _: [_.trap(), f(*args, **kwargs)][0])
[2020-09-25 12:25:56] <itamarst> return result
[2020-09-25 12:26:04] <itamarst> should be [1]
[2020-09-25 12:26:11] <itamarst> and really, shouldn't be lambda should be real function
[2020-09-25 12:26:16] <itamarst> but easier to do one liners in IRC :)
[2020-09-25 12:27:39] <itamarst> and it's probably not trap(), it's probably something with self.expectedFailures or something
[2020-09-25 12:27:45] <itamarst> I forget the API name
[2020-09-25 12:27:52] <itamarst> but the basic structure is as above, just addErrback
[2020-09-25 12:28:07] <itamarst> Deferreds get chained automatically
[2020-09-25 12:28:26] <itamarst> so you just need to suppress the "logged exceptions get marked as failures logic"
[2020-09-25 12:28:31] <itamarst> that Twisted's TestCase adds
[2020-09-25 12:28:43] <itamarst> back later if you want to pair
[2020-09-25 13:18:10] <exarkun> jaraco: I'm not sure this is going to be a fruitful avenue
[2020-09-25 13:18:27] <exarkun> jaraco: If just retrying the test method isn't sufficient then the test is probably so broken it needs to be rewritten
[2020-09-25 13:19:09] <exarkun> jaraco: Clobbering `Deferred.callbacks` is pretty much guaranteed to make it more broken rather than less
[2020-09-25 13:29:48] <itamarst> jaraco: if the worry is "what if someone does addCallback after the return", that will break the normal test infrastructure too, so you don't need to handle that case

I've thought about this some more and had some ideas. Like what if the asynchronous test could be synchronized then retried? That doesn't work because the event loop is already created for setup/teardown.

Here is some of the conversation about this effort: ``` [2020-09-25 11:33:39] <jaraco> Is there a way for a decorator on a method to detect all of the deferreds that were created during the method and wait on all of them to succeed and retry if any of them fail? [2020-09-25 11:37:02] <jaraco> I’ve written https://github.com/jaraco/trial-retry/blob/13499eb3a5d9e8d068c28b3c40e52ca71a56a13a/test_everything.py#L40-L71 to achieve that for a single deferred in the return value. [2020-09-25 11:38:30] <jaraco> Except, even that approach won’t work because it’s only retrying the callback. It’s not retrying the test. [2020-09-25 11:58:17] <itamarst> there's gather_results [2020-09-25 11:58:36] <itamarst> but... [2020-09-25 11:58:54] <itamarst> I'm not sure why you can't retry whole method [2020-09-25 11:59:19] <itamarst> oh, you don't want to mess with internal list [2020-09-25 11:59:52] <itamarst> you want to addErrback a function that, for N times, returns the result of the original decorated function [2020-09-25 12:00:50] <itamarst> a single retry would look like this: `return result.addErrback(lambda _: f(*args, **kwargs))` [2020-09-25 12:01:35] <itamarst> a niave "retry 3 times" is `return result.addErrback(lambda _: f(*args, **kwargs)).addErrback(lambda _: f(*args, **kwargs)).addErrback(lambda _: f(*args, **kwargs))` [2020-09-25 12:01:53] <itamarst> you don't want to mess with `.callbacks` at all, just call `addErrback` [2020-09-25 12:10:33] <jaraco> The problem is that the deferreds aren’t always returned by the method. [2020-09-25 12:16:48] <jaraco> Such as in `test_status_path_404_error` - how does one attach an errback to the “up” call? [2020-09-25 12:20:40] <jaraco> Also, adding errback isn’t enough to suppress the failure. You want to trap exceptions in the callback and errbacks of the deferreds created. [2020-09-25 12:25:53] <itamarst> if isinstance(result, Deferred): result = result.addErrback(lambda _: [_.trap(), f(*args, **kwargs)][0]) [2020-09-25 12:25:56] <itamarst> return result [2020-09-25 12:26:04] <itamarst> should be [1] [2020-09-25 12:26:11] <itamarst> and really, shouldn't be lambda should be real function [2020-09-25 12:26:16] <itamarst> but easier to do one liners in IRC :) [2020-09-25 12:27:39] <itamarst> and it's probably not trap(), it's probably something with self.expectedFailures or something [2020-09-25 12:27:45] <itamarst> I forget the API name [2020-09-25 12:27:52] <itamarst> but the basic structure is as above, just addErrback [2020-09-25 12:28:07] <itamarst> Deferreds get chained automatically [2020-09-25 12:28:26] <itamarst> so you just need to suppress the "logged exceptions get marked as failures logic" [2020-09-25 12:28:31] <itamarst> that Twisted's TestCase adds [2020-09-25 12:28:43] <itamarst> back later if you want to pair [2020-09-25 13:18:10] <exarkun> jaraco: I'm not sure this is going to be a fruitful avenue [2020-09-25 13:18:27] <exarkun> jaraco: If just retrying the test method isn't sufficient then the test is probably so broken it needs to be rewritten [2020-09-25 13:19:09] <exarkun> jaraco: Clobbering `Deferred.callbacks` is pretty much guaranteed to make it more broken rather than less [2020-09-25 13:29:48] <itamarst> jaraco: if the worry is "what if someone does addCallback after the return", that will break the normal test infrastructure too, so you don't need to handle that case ``` I've thought about this some more and had some ideas. Like what if the asynchronous test could be synchronized then retried? That doesn't work because the event loop is already created for setup/teardown.
jaraco removed their assignment 2021-01-30 14:50:29 +00:00

This is happening... a lot less? Going to close it, can open new one if new issues arise.

This is happening... a lot less? Going to close it, can open new one if new issues arise.
itamarst added the
fixed
label 2021-07-05 22:17:53 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#3412
No description provided.