Closed
Description
When graceful recovery test is run with NGF with NGINX Plus, it fails
NGF Logs excerpts:
. . .
{"level":"error","ts":"2024-05-22T21:15:36Z","logger":"eventLoop.eventHandler","msg":"failed to get upstreams from API, reloading configuration instead","batchID":16,"error":"failed to get upstreams: failed to get http/upstreams: Get \"http://nginx-plus-api/api/9/http/upstreams\": dial unix /var/run/nginx/nginx-plus-api.sock: connect: no such file or directory","stacktrace":"github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).updateUpstreamServers\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:357\ngithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:204\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/framework/events/loop.go:74"}
Acceptance criteria:
- Further identify any bugs
Metadata
Metadata
Assignees
Labels
Type
Projects
Relationships
Development
No branches or pull requests
Activity
salonichf5 commentedon Jun 3, 2024
Seeing this error in the pipeline as well
salonichf5 commentedon Jun 3, 2024
I have handled this error case in my current PR . We should update the bug description to avoid confusion
pleshakov commentedon Jun 4, 2024
@salonichf5
that could happen if NGF updated NGINX with upstream servers that are not ready yet.
we do have a check here to counter it here
https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/suite/graceful_recovery_test.go#L267
but it doesn't fully counter it, as another error can appear.
More over, I don't think it is the right thing to counter it -- I think we either have a bug in the NGF code or the test code.
Note that what we do in the test -- https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/suite/graceful_recovery_test.go#L142-L153
We redeploy the app and then send requests to NGINX
The fact that we call
Expect(resourceManager.WaitForAppsToBeReady(ns.Name)).To(Succeed())
combined with readiness probe enabled in the backend pods -- https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/suite/manifests/graceful-recovery/cafe.yaml#L57-L61 -- should have prevented unready endpoints to be propagated to NGINX upstreams in the first place.So
(1)
we could have a bug in the code that propagates not ready endpoints to NGINX upstreams.
(2)
Or
resourceManager.WaitForAppsToBeReady(ns.Name)
doesn't work as expectedor some other thing.
To test (1), I suggest deploying the cafe example https://github.com/nginxinc/nginx-gateway-fabric/tree/main/examples/cafe-example, and configure the tea pod spec with a readiness probe that sends a request to non-existing port (like 1234), so that the pod doesn't become ready:
and then check what config NGF generates and if it includes any tea endpoints. Since the test failed with Plus, let's check it with the Plus image.
Regarding (2), I suspect we have a race condition in https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/framework/resourcemanager.go#L311 as it calls https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/framework/resourcemanager.go#L312 , specifically here https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/framework/resourcemanager.go#L335
when we get the pod list, the pods might not be created, so we get 0 and the function quickly returns success.
To test this theory, we create separate functions like below that check the count:
and update the graceful recovery test to call those functions.
so instead of
do
salonichf5 commentedon Jun 4, 2024
Could not reproduce the error above, but found a race condition in graceful recovery tests so closing this. Opened another PR