Retry moody GitHub Actions steps #3945
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#3945
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Some workflows fail on GitHub Actions either because the tests are moody or GitHub Actions itself is moody. Example: https://github.com/tahoe-lafs/tahoe-lafs/actions/runs/3556042011/jobs/5973114477
That failure has nothing to do with the changes that triggered that workflow; it might be a good idea to retry that step.
Some other workflows take a long time to run. Examples: on https://github.com/tahoe-lafs/tahoe-lafs/actions/runs/3556042011/jobs/5973114477,
coverage (ubuntu-latest, pypy-37)
,integration (ubuntu-latest, 3.7)
, andintegration (ubuntu-latest, 3.9)
. Although in this specific instance integration tests are failing due to #3943, it might be a good idea to retry them after a reasonable timeout, and give up altogether after a number of tries instead of spinning for many hours on end.This perhaps would be a good use of actions/retry-step?
I don't think automatically doing a rerun of the whole test suite when a test fails is a good idea.
If there is a real test failure then the result is CI takes N times as long to complete. If there is a spurious test failure that we're not aware of then the result is that we don't become aware of it for much longer. If there is a spurious test failure that we are aware of then the result is that it is swept under the rug and is much easier to ignore for much longer.
These all seem like downsides to me.
Hmm, that is true. Do you think there's value in using a smaller timeout value though? Sometimes running tests seem to get stuck without terminating cleanly. Like in this case, for example:
https://github.com/tahoe-lafs/tahoe-lafs/actions/runs/3525447679
Integration tests on Ubuntu ran for six hours, which I guess GitHub's default timeout. From a developer experience perspective, I guess it would be useful for them to fail sooner than that.
A timeout less than 6 hours sounds good (!!) but yeah I mostly agree with what jean-paul is saying.
That said, is there a ticket to explore that particular "known" spurious failure? (It seems somewhat "well known" that test_system sometimes has problems...)
Replying to meejah:
A quick search for "flaky", "spurious", and "test_upload_and_download_convergent" here in Trac turned up #3413, #3412, #1768, #1084, and this milestone: [and Unit Testing"]milestone:"Integration.
There might be more tickets. I guess all those tickets ideally should belong to that milestone.
Perhaps it might be worth collecting some data about these failures when testing
master
branch alone, since PR branches are likely add too much noise? https://github.com/tahoe-lafs/tahoe-lafs/actions?query=branch%3Amaster does not look ideal. However, since GitHub doesn't keep test logs long enough for organizations on free plans, collecting that data is going to be rather challenging.Maybe fixing flaky tests is not worth the trouble, given the limited resources and the fact that this never has been annoying enough to become a priority. :-)
I wouldn't say this is the case. I spent a large chunk of time last year fixing flaky tests. The test suite is currently much more reliable than it was before that effort.
Closing per https://github.com/tahoe-lafs/tahoe-lafs/pull/1230