Skip to content

Graceful recovery tests fails with NGINX Plus #2022

Closed
@pleshakov

Description

@pleshakov
Contributor

When graceful recovery test is run with NGF with NGINX Plus, it fails

See https://github.com/nginxinc/nginx-gateway-fabric/actions/runs/9198281036/job/25300657770

NGF Logs excerpts:

. . .
 {"level":"error","ts":"2024-05-22T21:15:36Z","logger":"eventLoop.eventHandler","msg":"failed to get upstreams from API, reloading configuration instead","batchID":16,"error":"failed to get upstreams: failed to get http/upstreams: Get \"http://nginx-plus-api/api/9/http/upstreams\": dial unix /var/run/nginx/nginx-plus-api.sock: connect: no such file or directory","stacktrace":"github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).updateUpstreamServers\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:357\ngithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:204\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/framework/events/loop.go:74"}

Acceptance criteria:

  • Further identify any bugs

Activity

added this to the v1.3.0 milestone on Jun 3, 2024
salonichf5

salonichf5 commented on Jun 3, 2024

@salonichf5
Contributor

Seeing this error in the pipeline as well

Error: 5;9m[FAILED] 2024/06/03 19:56:02 [error] 56#56: *8 no live upstreams while connecting to upstream, client: 127.0.0.1, server: cafe.example.com, request: "GET /tea HTTP/2.0", upstream: "http://graceful-recovery_tea_80/tea", host: "cafe.example.com:10443"
  Expected
Error: string>: 2024/06/03 19:56:02 [error] 56#56: *8 no live upstreams while connecting to upstream, client: 127.0.0.1, server: cafe.example.com, request: "GET /tea HTTP/2.0", upstream: "http://graceful-recovery_tea_80/tea", host: "cafe.example.com:10443"
  to contain substring
      <string>: connect() failed (111: Connection refused)
salonichf5

salonichf5 commented on Jun 3, 2024

@salonichf5
Contributor

I have handled this error case in my current PR . We should update the bug description to avoid confusion

{"level":"error","ts":"2024-05-22T21:15:18Z","logger":"usageReporter","msg":"Usage reporting must be enabled when using NGINX Plus; redeploy with usage reporting enabled","error":"usage reporting not enabled","stacktrace":"github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.createUsageWarningJob.func1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/manager.go:601\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\tpkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/wait/backoff.go:259\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tpkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/wait/backoff.go:226\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tpkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/wait/backoff.go:227\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tpkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/wait/backoff.go:204\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\tpkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/wait/backoff.go:259\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/runnables.(*CronJob).Start\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/framework/runnables/cronjob.go:53\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/manager/runnable_group.go:226"}
pleshakov

pleshakov commented on Jun 4, 2024

@pleshakov
ContributorAuthor

Seeing this error in the pipeline as well

@salonichf5

that could happen if NGF updated NGINX with upstream servers that are not ready yet.

we do have a check here to counter it here
https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/suite/graceful_recovery_test.go#L267
but it doesn't fully counter it, as another error can appear.

More over, I don't think it is the right thing to counter it -- I think we either have a bug in the NGF code or the test code.

Note that what we do in the test -- https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/suite/graceful_recovery_test.go#L142-L153

We redeploy the app and then send requests to NGINX

The fact that we call Expect(resourceManager.WaitForAppsToBeReady(ns.Name)).To(Succeed()) combined with readiness probe enabled in the backend pods -- https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/suite/manifests/graceful-recovery/cafe.yaml#L57-L61 -- should have prevented unready endpoints to be propagated to NGINX upstreams in the first place.

So

(1)
we could have a bug in the code that propagates not ready endpoints to NGINX upstreams.

(2)
Or resourceManager.WaitForAppsToBeReady(ns.Name) doesn't work as expected

or some other thing.

To test (1), I suggest deploying the cafe example https://github.com/nginxinc/nginx-gateway-fabric/tree/main/examples/cafe-example, and configure the tea pod spec with a readiness probe that sends a request to non-existing port (like 1234), so that the pod doesn't become ready:

      containers:
      - name: tea
        image: nginxdemos/nginx-hello:plain-text
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /
            port: 1234

and then check what config NGF generates and if it includes any tea endpoints. Since the test failed with Plus, let's check it with the Plus image.

Regarding (2), I suspect we have a race condition in https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/framework/resourcemanager.go#L311 as it calls https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/framework/resourcemanager.go#L312 , specifically here https://github.com/nginxinc/nginx-gateway-fabric/blob/38b5498c1008f13bed0811dbbad978a96e3eb827/tests/framework/resourcemanager.go#L335

when we get the pod list, the pods might not be created, so we get 0 and the function quickly returns success.

To test this theory, we create separate functions like below that check the count:

func (rm *ResourceManager) WaitForAppsToBeReadyWithPodCount(namespace string, podCount int) error {
	ctx, cancel := context.WithTimeout(context.Background(), rm.TimeoutConfig.CreateTimeout)
	defer cancel()

	return rm.WaitForAppsToBeReadyWithCtxWithPodCount(ctx, namespace, podCount)
}

func (rm *ResourceManager) WaitForAppsToBeReadyWithCtxWithPodCount(ctx context.Context, namespace string, podCount int) error {
	if err := rm.WaitForPodsToBeReadyWithCount(ctx, namespace, podCount); err != nil {
		return err
	}

	if err := rm.waitForHTTPRoutesToBeReady(ctx, namespace); err != nil {
		return err
	}

	if err := rm.waitForGRPCRoutesToBeReady(ctx, namespace); err != nil {
		return err
	}

	return rm.waitForGatewaysToBeReady(ctx, namespace)
}

// WaitForPodsToBeReady waits for all Pods in the specified namespace to be ready or
// until the provided context is cancelled.
func (rm *ResourceManager) WaitForPodsToBeReadyWithCount(ctx context.Context, namespace string, count int) error {
	return wait.PollUntilContextCancel(
		ctx,
		500*time.Millisecond,
		true, /* poll immediately */
		func(ctx context.Context) (bool, error) {
			var podList core.PodList
			if err := rm.K8sClient.List(ctx, &podList, client.InNamespace(namespace)); err != nil {
				return false, err
			}

			var podsReady int
			for _, pod := range podList.Items {
				for _, cond := range pod.Status.Conditions {
					if cond.Type == core.PodReady && cond.Status == core.ConditionTrue {
						podsReady++
					}
				}
			}

			return podsReady ==  count, nil
		},
	)
}

and update the graceful recovery test to call those functions.

so instead of

	Expect(resourceManager.WaitForAppsToBeReady(ns.Name)).To(Succeed())

do

	Expect(resourceManager.WaitForAppsToBeReadyWithPodCount(ns.Name, 2)).To(Succeed())
salonichf5

salonichf5 commented on Jun 4, 2024

@salonichf5
Contributor

Could not reproduce the error above, but found a race condition in graceful recovery tests so closing this. Opened another PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @pleshakov@mpstefan@salonichf5

        Issue actions

          Graceful recovery tests fails with NGINX Plus · Issue #2022 · nginx/nginx-gateway-fabric