Skip to content

Postgres cluster failing over almost daily #1163

Open
@boris-savic

Description

@boris-savic

Hi there,

we've been using this project in production for almost 6 months now and we've noticed that the database cluster is failing over almost daily - sometime multiple times per day.

Now our hardware nodes are not failing neither were they restarted. We only noticed this recently since some of our UPDATE queries have failed since the node the client was connected to switched to read-only mode.

Bellow is the output of the patronictl history:

+----+--------------+------------------------------+---------------------------+
| TL |          LSN |            Reason            |         Timestamp         |
+----+--------------+------------------------------+---------------------------+
| 40 | 138415350016 | no recovery target specified | 2020-09-19T10:31:39+00:00 |
| 41 | 147132529184 | no recovery target specified | 2020-09-29T13:21:39+00:00 |
| 42 | 147174790504 | no recovery target specified | 2020-09-29T14:56:34+00:00 |
| 43 | 147180413280 | no recovery target specified | 2020-09-29T15:12:55+00:00 |
| 44 | 148553492152 | no recovery target specified | 2020-10-01T06:46:56+00:00 |
| 45 | 148556602160 | no recovery target specified | 2020-10-01T06:54:06+00:00 |
| 46 | 148560542520 | no recovery target specified | 2020-10-01T07:15:46+00:00 |
| 47 | 148563980512 | no recovery target specified | 2020-10-01T07:31:26+00:00 |
| 48 | 148567915040 | no recovery target specified | 2020-10-01T07:40:18+00:00 |
| 49 | 148570174560 | no recovery target specified | 2020-10-01T07:40:50+00:00 |
| 50 | 148574039736 | no recovery target specified | 2020-10-01T07:48:40+00:00 |
| 51 | 148577000976 | no recovery target specified | 2020-10-01T07:58:09+00:00 |
| 52 | 148579532488 | no recovery target specified | 2020-10-01T07:59:21+00:00 |
| 53 | 148581642320 | no recovery target specified | 2020-10-01T08:00:33+00:00 |
| 54 | 149048863192 | no recovery target specified | 2020-10-01T22:15:07+00:00 |
| 55 | 151318348672 | no recovery target specified | 2020-10-04T17:59:06+00:00 |
| 56 | 151333603376 | no recovery target specified | 2020-10-04T18:58:56+00:00 |
| 57 | 151336409800 | no recovery target specified | 2020-10-04T19:04:45+00:00 |
| 58 | 151339036912 | no recovery target specified | 2020-10-04T19:07:05+00:00 |
| 59 | 151340848080 | no recovery target specified | 2020-10-04T19:07:37+00:00 |
| 60 | 152692595768 | no recovery target specified | 2020-10-06T10:46:46+00:00 |
| 61 | 153344788488 | no recovery target specified | 2020-10-07T06:12:16+00:00 |
| 62 | 154048413760 | no recovery target specified | 2020-10-08T03:40:06+00:00 |
| 63 | 154385293904 | no recovery target specified | 2020-10-08T13:08:26+00:00 |
| 64 | 154387331144 | no recovery target specified | 2020-10-08T13:10:40+00:00 |
+----+--------------+------------------------------+---------------------------+

Our cluster configuration yaml:

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: cluster-db 
  namespace: database  
spec:
  dockerImage: registry.opensource.zalan.do/acid/spilo-12:1.6-p2
  teamId: "cluster" 
  volume:
    size: 50Gi
  numberOfInstances: 3
  postgresql: 
    version: "12"       
    parameters:
      max_connections: "250"
  tolerations:
  - key: postgres
    operator: Exists
    effect: NoSchedule

Is there a way to fix this?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions