Open
Description
Hi there,
we've been using this project in production for almost 6 months now and we've noticed that the database cluster is failing over almost daily - sometime multiple times per day.
Now our hardware nodes are not failing neither were they restarted. We only noticed this recently since some of our UPDATE queries have failed since the node the client was connected to switched to read-only mode.
Bellow is the output of the patronictl history
:
+----+--------------+------------------------------+---------------------------+
| TL | LSN | Reason | Timestamp |
+----+--------------+------------------------------+---------------------------+
| 40 | 138415350016 | no recovery target specified | 2020-09-19T10:31:39+00:00 |
| 41 | 147132529184 | no recovery target specified | 2020-09-29T13:21:39+00:00 |
| 42 | 147174790504 | no recovery target specified | 2020-09-29T14:56:34+00:00 |
| 43 | 147180413280 | no recovery target specified | 2020-09-29T15:12:55+00:00 |
| 44 | 148553492152 | no recovery target specified | 2020-10-01T06:46:56+00:00 |
| 45 | 148556602160 | no recovery target specified | 2020-10-01T06:54:06+00:00 |
| 46 | 148560542520 | no recovery target specified | 2020-10-01T07:15:46+00:00 |
| 47 | 148563980512 | no recovery target specified | 2020-10-01T07:31:26+00:00 |
| 48 | 148567915040 | no recovery target specified | 2020-10-01T07:40:18+00:00 |
| 49 | 148570174560 | no recovery target specified | 2020-10-01T07:40:50+00:00 |
| 50 | 148574039736 | no recovery target specified | 2020-10-01T07:48:40+00:00 |
| 51 | 148577000976 | no recovery target specified | 2020-10-01T07:58:09+00:00 |
| 52 | 148579532488 | no recovery target specified | 2020-10-01T07:59:21+00:00 |
| 53 | 148581642320 | no recovery target specified | 2020-10-01T08:00:33+00:00 |
| 54 | 149048863192 | no recovery target specified | 2020-10-01T22:15:07+00:00 |
| 55 | 151318348672 | no recovery target specified | 2020-10-04T17:59:06+00:00 |
| 56 | 151333603376 | no recovery target specified | 2020-10-04T18:58:56+00:00 |
| 57 | 151336409800 | no recovery target specified | 2020-10-04T19:04:45+00:00 |
| 58 | 151339036912 | no recovery target specified | 2020-10-04T19:07:05+00:00 |
| 59 | 151340848080 | no recovery target specified | 2020-10-04T19:07:37+00:00 |
| 60 | 152692595768 | no recovery target specified | 2020-10-06T10:46:46+00:00 |
| 61 | 153344788488 | no recovery target specified | 2020-10-07T06:12:16+00:00 |
| 62 | 154048413760 | no recovery target specified | 2020-10-08T03:40:06+00:00 |
| 63 | 154385293904 | no recovery target specified | 2020-10-08T13:08:26+00:00 |
| 64 | 154387331144 | no recovery target specified | 2020-10-08T13:10:40+00:00 |
+----+--------------+------------------------------+---------------------------+
Our cluster configuration yaml:
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
name: cluster-db
namespace: database
spec:
dockerImage: registry.opensource.zalan.do/acid/spilo-12:1.6-p2
teamId: "cluster"
volume:
size: 50Gi
numberOfInstances: 3
postgresql:
version: "12"
parameters:
max_connections: "250"
tolerations:
- key: postgres
operator: Exists
effect: NoSchedule
Is there a way to fix this?