Retry moody GitHub Actions steps #3945

Closed
opened 2022-11-27 02:21:29 +00:00 by sajith · 6 comments

Some workflows fail on GitHub Actions either because the tests are moody or GitHub Actions itself is moody. Example: https://github.com/tahoe-lafs/tahoe-lafs/actions/runs/3556042011/jobs/5973114477

2022-11-27T01:09:13.3236569Z [FAIL]
2022-11-27T01:09:13.3236873Z Traceback (most recent call last):
2022-11-27T01:09:13.3237795Z   File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\allmydata\util\pollmixin.py", line 47, in _convert_done
2022-11-27T01:09:13.3238340Z     f.trap(PollComplete)
2022-11-27T01:09:13.3239166Z   File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\python\failure.py", line 480, in trap
2022-11-27T01:09:13.3244610Z     self.raiseException()
2022-11-27T01:09:13.3245778Z   File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\python\failure.py", line 504, in raiseException
2022-11-27T01:09:13.3259779Z     raise self.value.with_traceback(self.tb)
2022-11-27T01:09:13.3260719Z   File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\internet\defer.py", line 206, in maybeDeferred
2022-11-27T01:09:13.3261254Z     result = f(*args, **kwargs)
2022-11-27T01:09:13.3261923Z   File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\allmydata\util\pollmixin.py", line 69, in _poll
2022-11-27T01:09:13.3262457Z     self.fail("Errors snooped, terminating early")
2022-11-27T01:09:13.3262935Z twisted.trial.unittest.FailTest: Errors snooped, terminating early
2022-11-27T01:09:13.3263257Z 
2022-11-27T01:09:13.3263547Z allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
2022-11-27T01:09:13.3263989Z ===============================================================================
2022-11-27T01:09:13.3264288Z [ERROR]
2022-11-27T01:09:13.3264609Z Traceback (most recent call last):
2022-11-27T01:09:13.3265386Z   File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\allmydata\util\rrefutil.py", line 26, in _no_get_version
2022-11-27T01:09:13.3268422Z     f.trap(Violation, RemoteException)
2022-11-27T01:09:13.3269217Z   File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\python\failure.py", line 480, in trap
2022-11-27T01:09:13.3269711Z     self.raiseException()
2022-11-27T01:09:13.3270396Z   File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\python\failure.py", line 504, in raiseException
2022-11-27T01:09:13.3270976Z     raise self.value.with_traceback(self.tb)
2022-11-27T01:09:13.3271553Z foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=4vg7) (during method=RIStorageServer.tahoe.allmydata.com:get_version)
2022-11-27T01:09:13.3271977Z 
2022-11-27T01:09:13.3272448Z allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
2022-11-27T01:09:13.3272884Z ===============================================================================
2022-11-27T01:09:13.3273207Z [ERROR]
2022-11-27T01:09:13.3273530Z Traceback (most recent call last):
2022-11-27T01:09:13.3274088Z Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=4vg7) (during method=RIUploadHelper.tahoe.allmydata.com:upload)
2022-11-27T01:09:13.3274512Z 
2022-11-27T01:09:13.3274802Z allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
2022-11-27T01:09:13.3275437Z -------------------------------------------------------------------------------
2022-11-27T01:09:13.3275958Z Ran 1776 tests in 1302.475s
2022-11-27T01:09:13.3276195Z 
2022-11-27T01:09:13.3276435Z FAILED (skips=27, failures=1, errors=2, successes=1748)

That failure has nothing to do with the changes that triggered that workflow; it might be a good idea to retry that step.

Some other workflows take a long time to run. Examples: on https://github.com/tahoe-lafs/tahoe-lafs/actions/runs/3556042011/jobs/5973114477, coverage (ubuntu-latest, pypy-37), integration (ubuntu-latest, 3.7), and integration (ubuntu-latest, 3.9). Although in this specific instance integration tests are failing due to #3943, it might be a good idea to retry them after a reasonable timeout, and give up altogether after a number of tries instead of spinning for many hours on end.

This perhaps would be a good use of actions/retry-step?

Some workflows fail on GitHub Actions either because the tests are moody or GitHub Actions itself is moody. Example: <https://github.com/tahoe-lafs/tahoe-lafs/actions/runs/3556042011/jobs/5973114477> ``` 2022-11-27T01:09:13.3236569Z [FAIL] 2022-11-27T01:09:13.3236873Z Traceback (most recent call last): 2022-11-27T01:09:13.3237795Z File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\allmydata\util\pollmixin.py", line 47, in _convert_done 2022-11-27T01:09:13.3238340Z f.trap(PollComplete) 2022-11-27T01:09:13.3239166Z File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\python\failure.py", line 480, in trap 2022-11-27T01:09:13.3244610Z self.raiseException() 2022-11-27T01:09:13.3245778Z File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\python\failure.py", line 504, in raiseException 2022-11-27T01:09:13.3259779Z raise self.value.with_traceback(self.tb) 2022-11-27T01:09:13.3260719Z File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\internet\defer.py", line 206, in maybeDeferred 2022-11-27T01:09:13.3261254Z result = f(*args, **kwargs) 2022-11-27T01:09:13.3261923Z File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\allmydata\util\pollmixin.py", line 69, in _poll 2022-11-27T01:09:13.3262457Z self.fail("Errors snooped, terminating early") 2022-11-27T01:09:13.3262935Z twisted.trial.unittest.FailTest: Errors snooped, terminating early 2022-11-27T01:09:13.3263257Z 2022-11-27T01:09:13.3263547Z allmydata.test.test_system.SystemTest.test_upload_and_download_convergent 2022-11-27T01:09:13.3263989Z =============================================================================== 2022-11-27T01:09:13.3264288Z [ERROR] 2022-11-27T01:09:13.3264609Z Traceback (most recent call last): 2022-11-27T01:09:13.3265386Z File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\allmydata\util\rrefutil.py", line 26, in _no_get_version 2022-11-27T01:09:13.3268422Z f.trap(Violation, RemoteException) 2022-11-27T01:09:13.3269217Z File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\python\failure.py", line 480, in trap 2022-11-27T01:09:13.3269711Z self.raiseException() 2022-11-27T01:09:13.3270396Z File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\python\failure.py", line 504, in raiseException 2022-11-27T01:09:13.3270976Z raise self.value.with_traceback(self.tb) 2022-11-27T01:09:13.3271553Z foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=4vg7) (during method=RIStorageServer.tahoe.allmydata.com:get_version) 2022-11-27T01:09:13.3271977Z 2022-11-27T01:09:13.3272448Z allmydata.test.test_system.SystemTest.test_upload_and_download_convergent 2022-11-27T01:09:13.3272884Z =============================================================================== 2022-11-27T01:09:13.3273207Z [ERROR] 2022-11-27T01:09:13.3273530Z Traceback (most recent call last): 2022-11-27T01:09:13.3274088Z Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=4vg7) (during method=RIUploadHelper.tahoe.allmydata.com:upload) 2022-11-27T01:09:13.3274512Z 2022-11-27T01:09:13.3274802Z allmydata.test.test_system.SystemTest.test_upload_and_download_convergent 2022-11-27T01:09:13.3275437Z ------------------------------------------------------------------------------- 2022-11-27T01:09:13.3275958Z Ran 1776 tests in 1302.475s 2022-11-27T01:09:13.3276195Z 2022-11-27T01:09:13.3276435Z FAILED (skips=27, failures=1, errors=2, successes=1748) ``` That failure has nothing to do with the changes that triggered that workflow; it might be a good idea to retry that step. Some other workflows take a long time to run. Examples: on <https://github.com/tahoe-lafs/tahoe-lafs/actions/runs/3556042011/jobs/5973114477>, `coverage (ubuntu-latest, pypy-37)`, `integration (ubuntu-latest, 3.7)`, and `integration (ubuntu-latest, 3.9)`. Although in this specific instance integration tests are failing due to #3943, it might be a good idea to retry them after a reasonable timeout, and give up altogether after a number of tries instead of spinning for many hours on end. This perhaps would be a good use of [actions/retry-step](https://github.com/marketplace/actions/retry-step)?
sajith added the
dev-infrastructure
normal
task
n/a
labels 2022-11-27 02:21:29 +00:00
sajith added this to the undecided milestone 2022-11-27 02:21:29 +00:00
sajith self-assigned this 2022-11-27 02:21:29 +00:00

I don't think automatically doing a rerun of the whole test suite when a test fails is a good idea.

If there is a real test failure then the result is CI takes N times as long to complete. If there is a spurious test failure that we're not aware of then the result is that we don't become aware of it for much longer. If there is a spurious test failure that we are aware of then the result is that it is swept under the rug and is much easier to ignore for much longer.

These all seem like downsides to me.

I don't think automatically doing a rerun of the whole test suite when a test fails is a good idea. If there is a real test failure then the result is CI takes N times as long to complete. If there is a spurious test failure that we're not aware of then the result is that we don't become aware of it for much longer. If there is a spurious test failure that we are aware of then the result is that it is swept under the rug and is much easier to ignore for much longer. These all seem like downsides to me.
Author

Hmm, that is true. Do you think there's value in using a smaller timeout value though? Sometimes running tests seem to get stuck without terminating cleanly. Like in this case, for example:

https://github.com/tahoe-lafs/tahoe-lafs/actions/runs/3525447679

Integration tests on Ubuntu ran for six hours, which I guess GitHub's default timeout. From a developer experience perspective, I guess it would be useful for them to fail sooner than that.

Hmm, that is true. Do you think there's value in using a smaller timeout value though? Sometimes running tests seem to get stuck without terminating cleanly. Like in this case, for example: <https://github.com/tahoe-lafs/tahoe-lafs/actions/runs/3525447679> Integration tests on Ubuntu ran for six hours, which I guess GitHub's default timeout. From a developer experience perspective, I guess it would be useful for them to fail sooner than that.
Owner

A timeout less than 6 hours sounds good (!!) but yeah I mostly agree with what jean-paul is saying.

That said, is there a ticket to explore that particular "known" spurious failure? (It seems somewhat "well known" that test_system sometimes has problems...)

A timeout less than 6 hours sounds good (!!) but yeah I mostly agree with what jean-paul is saying. That said, _is_ there a ticket to explore that particular "known" spurious failure? (It seems somewhat "well known" that test_system sometimes has problems...)
Author

Replying to meejah:

That said, is there a ticket to explore that particular "known" spurious failure? (It seems somewhat "well known" that test_system sometimes has problems...)

A quick search for "flaky", "spurious", and "test_upload_and_download_convergent" here in Trac turned up #3413, #3412, #1768, #1084, and this milestone: [and Unit Testing"]milestone:"Integration.

There might be more tickets. I guess all those tickets ideally should belong to that milestone.

Perhaps it might be worth collecting some data about these failures when testing master branch alone, since PR branches are likely add too much noise? https://github.com/tahoe-lafs/tahoe-lafs/actions?query=branch%3Amaster does not look ideal. However, since GitHub doesn't keep test logs long enough for organizations on free plans, collecting that data is going to be rather challenging.

Maybe fixing flaky tests is not worth the trouble, given the limited resources and the fact that this never has been annoying enough to become a priority. :-)

Replying to [meejah](/tahoe-lafs/trac-2024-07-25/issues/3945#issuecomment-108864): > That said, _is_ there a ticket to explore that particular "known" spurious failure? (It seems somewhat "well known" that test_system sometimes has problems...) A quick search for "flaky", "spurious", and "test_upload_and_download_convergent" here in Trac turned up #3413, #3412, #1768, #1084, and this milestone: [and Unit Testing"]milestone:"Integration. There might be more tickets. I guess all those tickets ideally should belong to that milestone. Perhaps it might be worth collecting some data about these failures when testing `master` branch alone, since PR branches are likely add too much noise? <https://github.com/tahoe-lafs/tahoe-lafs/actions?query=branch%3Amaster> does not look ideal. However, since GitHub doesn't keep test logs long enough for organizations on free plans, collecting that data is going to be rather challenging. Maybe fixing flaky tests is not worth the trouble, given the limited resources and the fact that this never has been annoying enough to become a priority. :-)

Maybe fixing flaky tests is not worth the trouble, given the limited resources and the fact that this never has been annoying enough to become a priority. :-)

I wouldn't say this is the case. I spent a large chunk of time last year fixing flaky tests. The test suite is currently much more reliable than it was before that effort.

> Maybe fixing flaky tests is not worth the trouble, given the limited resources and the fact that this never has been annoying enough to become a priority. :-) I wouldn't say this is the case. I spent a large chunk of time last year fixing flaky tests. The test suite is currently much more reliable than it was before that effort.
Closing per <https://github.com/tahoe-lafs/tahoe-lafs/pull/1230>
exarkun added the
wontfix
label 2022-12-12 17:45:47 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#3945
No description provided.