Same here - latest psalm + the plugin, symfony lts 5.4: all autoloaded code (including controllers) is marked as UnusedClass
.
uname -a
Linux <redacted> 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
blackbox_exporter --version
blackbox_exporter, version 0.23.0 (branch: HEAD, revision: 26fc98b9c6db21457653ed752f34d1b7fb5bba43)
build user: root@f360719453e3
build date: 20221202-12:26:32
go version: go1.19.3
platform: linux/amd64
&debug=true
to the probe URL?Logs for the probe:
ts=2023-03-19T20:59:46.593442215Z caller=main.go:181 module=dns_redacted_A target=ns1.redacted level=info msg="Beginning probe" probe=dns timeout_seconds=9.5
ts=2023-03-19T20:59:46.593527928Z caller=dns.go:200 module=dns_redacted_A target=ns1.redacted level=info msg="Resolving target address" target=ns1.redacted ip_protocol=ip6
ts=2023-03-19T20:59:46.59856768Z caller=dns.go:200 module=dns_redacted_A target=ns1.redacted level=info msg="Resolved target address" target=ns1.redacted ip=redacted
ts=2023-03-19T20:59:46.59858928Z caller=handler.go:117 module=dns_redacted_A target=ns1.redacted level=info msg="Making DNS query" target=redacted:53 dial_protocol=udp4 query=redacted type=1 class=1
ts=2023-03-19T20:59:56.094248741Z caller=handler.go:117 module=dns_redacted_A target=ns1.redacted level=error msg="Error while sending a DNS query" err="read udp4 redacted:55614->redacted:53: i/o timeout"
ts=2023-03-19T20:59:56.094292173Z caller=main.go:181 module=dns_redacted_A target=ns1.redacted level=error msg="Probe failed" duration_seconds=9.500818867
Metrics that would have been returned:
# HELP probe_dns_additional_rrs Returns number of entries in the additional resource record list
# TYPE probe_dns_additional_rrs gauge
probe_dns_additional_rrs 0
# HELP probe_dns_answer_rrs Returns number of entries in the answer resource record list
# TYPE probe_dns_answer_rrs gauge
probe_dns_answer_rrs 0
# HELP probe_dns_authority_rrs Returns number of entries in the authority resource record list
# TYPE probe_dns_authority_rrs gauge
probe_dns_authority_rrs 0
# HELP probe_dns_duration_seconds Duration of DNS request by phase
# TYPE probe_dns_duration_seconds gauge
probe_dns_duration_seconds{phase="connect"} 4.5651e-05
probe_dns_duration_seconds{phase="request"} 9.495595939
probe_dns_duration_seconds{phase="resolve"} 0.005040202
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.005040202
# HELP probe_dns_query_succeeded Displays whether or not the query was executed successfully
# TYPE probe_dns_query_succeeded gauge
probe_dns_query_succeeded 0
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 9.500818867
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 3.682327368e+09
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 0
Module configuration:
prober: dns
timeout: 21s
http:
ip_protocol_fallback: true
follow_redirects: true
enable_http2: true
tcp:
ip_protocol_fallback: true
icmp:
ip_protocol_fallback: true
ttl: 64
dns:
ip_protocol_fallback: true
query_name: redacted
query_type: A
recursion_desired: true
validate_answer_rrs:
fail_if_not_matches_regexp:
- <redacted>
See that the timeout is set to 21s
yet the prober times out after just 9.5s
@stayallive oh wow, didn't know about JSON_PARTIAL_OUTPUT_ON_ERROR
yet, it would suit better indeed.
One suggested solution might look like:
public static function encode($data, int $options = 0, int $maxDepth = 512): string
{
if ($maxDepth < 1) {
throw new \InvalidArgumentException('The $maxDepth argument must be an integer greater than 0.');
}
$options |= \JSON_UNESCAPED_UNICODE | \JSON_INVALID_UTF8_SUBSTITUTE;
$normalised = self::truncate($data, 0, $maxDepth - 1);
$encodedData = json_encode($normalised, $options, $maxDepth);
if (\JSON_ERROR_NONE !== json_last_error()) {
throw new JsonException(sprintf('Could not encode value into JSON format. Error was: "%s".', json_last_error_msg()));
}
return $encodedData;
}
private static function truncate($data, int $depth, int $maxDepth)
{
if ($depth >= $maxDepth) {
return 'max serialisation depth reached';
}
if (!is_array($data)) {
return $data;
}
foreach ($data as $k => $v) {
unset($data[$k]);
$data[$k] = self::truncate($v, $depth + 1, $maxDepth);
}
return $data;
}
Sentry Saas (sentry.io)
3.13.1
$data = [
'foo' => 'bar',
];
$data['ref'] = &$data;
$event = \Sentry\Event::createEvent();
$event->setExtra([
'data' => $data,
]);
\Sentry\captureEvent($event);
Event reported to the sentry server.
No event is sent at all.
It happens because Sentry\Util\JSON::encode
method throws.
Capturing anything would be better, at the moment no event is sent to the server at all.
I'm keen to create a PR with similar to monolog/monolog
approach where they substitute a too nested key with a "reached max level" message.
Sentry Saas (sentry.io)
3.13.1
$serializer
is not usedIt should be used, or documented as "kept for historical reasons" or something along the lines.
.
Not stale.
/remove-lifecycle stale
And, during another failover it's here again.
The fix as expecting works:
postgres=# SELECT
CASE
WHEN (pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()) OR (pg_is_in_recovery() = false) THEN 0
ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
END
AS replay_time
, CASE
WHEN pg_is_in_recovery() = false THEN 0
ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
END
AS received_time;
replay_time | received_time
-------------+---------------
0 | 0
(1 row)
Other stats:
postgres=# SELECT
CASE
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
END
AS replay_time
, EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
AS received_time
;
replay_time | received_time
-------------+---------------
3968 | 3968
(1 row)
postgres=# select pg_last_wal_receive_lsn(), pg_last_wal_receive_lsn();
pg_last_wal_receive_lsn | pg_last_wal_receive_lsn
-------------------------+-------------------------
15E/15704000 | 15E/15704000
(1 row)
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
f
(1 row)
Failover logs:
2023-01-29 20:50:05,257 INFO: no action. I am (db-db-64qb-0), a secondary, and following a leader (db-db-ndwz-0)
2023-01-29 20:50:15,316 INFO: no action. I am (db-db-64qb-0), a secondary, and following a leader (db-db-ndwz-0)
2023-01-29 20:50:25,337 INFO: no action. I am (db-db-64qb-0), a secondary, and following a leader (db-db-ndwz-0)
2023-01-29 20:50:32,603 WARNING: Request failed to db-db-ndwz-0: GET https://db-db-ndwz-0.db-pods:8008/patroni (HTTPSConnectionPool(host='db-db-ndwz-0.db-pods', port=8008): Max retries exceeded with url: /patroni (Caused by ReadTimeoutError("HTTPSConnectionPool(host='db-db-ndwz-0.db-pods', port=8008): Read timed out. (read timeout=2)",)))
server signaled
2023-01-29 20:50:32,823 INFO: promoted self to leader by acquiring session lock
server promoting
2023-01-29 20:50:32,861 INFO: cleared rewind state after becoming the leader
2023-01-29 20:50:32,858 INFO: Lock owner: db-db-64qb-0; I am db-db-64qb-0
2023-01-29 20:50:32,997 INFO: updated leader lock during promote
2023-01-29 20:50:37,373 INFO: no action. I am (db-db-64qb-0), the leader with the lock
2023-01-29 20:50:39,503 INFO: Lock owner: db-db-64qb-0; I am db-db-64qb-0
2023-01-29 20:50:39,530 INFO: Assigning synchronous standby status to ['db-db-8p9d-0']
server signaled
2023-01-29 20:50:39,604 INFO: Synchronous standby status assigned to ['db-db-8p9d-0']
2023-01-29 20:50:39,666 INFO: no action. I am (db-db-64qb-0), the leader with the lock
2023-01-29 20:50:50,331 INFO: no action. I am (db-db-64qb-0), the leader with the lock
2023-01-29 20:51:00,153 INFO: no action. I am (db-db-64qb-0), the leader with the lock
2023-01-29 20:51:10,018 INFO: no action. I am (db-db-64qb-0), the leader with the lock
Yep, I understand what it means - I'm just wondering why it's used as a criteria? postgresql
documentat defines pg_last_wal_receive_lsn
and pg_last_wal_replay_lsn
really poorly for master replicas. Why it is used instead of pg_is_in_recovery
?
role
already is presented, it's added by the scraper I think. This is a metric I already have without changing anything:
ccp_replication_lag_replay_time{deployment="db-db-5hnc", instance="10.51.0.83:9187", ip="10.51.0.83", job="crunchy-postgres-exporter", kubernetes_namespace="<redacted>", pg_cluster="<redacted>", pod="db-db-5hnc-0", role="replica", server="localhost:5432"}
It's not consistent: sometimes it happens, sometimes not. For 14.5 -> 14.6 it happened with higher chance, but most often failover happens fine too. So it's not after every failover.
after using these metrics for quite a long time to monitor replication.
What expression do you use to trigger alert? Eg, does it not exclude role=master
for instance?
so it doesn't clear until some writes occur
It's totally not the case: I had observed it on a write-heavy databases too. I'm not a professional postgresql administrator, but I suspect it's the ccp_replication_lag
that misuses pg_last_wal_receive_lsn
and pg_last_wal_replay_lsn
functions (or those functions are broken in postgresql 14.x).
Okay, during 14.5 -> 14.6 upgrade I have another cluster in the same state.
Queries from master:
postgres=# SELECT
postgres-# CASE
postgres-# WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
postgres-# ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
postgres-# END
postgres-# AS replay_time
postgres-# , EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
postgres-# AS received_time;
replay_time | received_time
-------------+---------------
3859 | 3859
(1 row)
postgres=# select pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();
pg_last_wal_receive_lsn | pg_last_wal_replay_lsn
-------------------------+------------------------
12A/50E80000 | 12A/51000000
(1 row)
postgres=# select pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();
pg_last_wal_receive_lsn | pg_last_wal_replay_lsn
-------------------------+------------------------
12A/50E80000 | 12A/51000000
(1 row)
postgres=# SELECT pg_is_in_recovery()
postgres-# ;
pg_is_in_recovery
-------------------
f
(1 row)
And from one of the replicas:
postgres=# select pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();
pg_last_wal_receive_lsn | pg_last_wal_replay_lsn
-------------------------+------------------------
12A/60000000 | 12A/60000000
(1 row)
postgres=# select pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();
pg_last_wal_receive_lsn | pg_last_wal_replay_lsn
-------------------------+------------------------
12A/60000000 | 12A/60000000
(1 row)
postgres=# SELECT pg_is_in_recovery();
pg_is_in_recovery
-------------------
t
(1 row)
postgres=# SELECT
postgres-# CASE
postgres-# WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
postgres-# ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
postgres-# END
postgres-# AS replay_time
postgres-# , EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
postgres-# AS received_time
postgres-# ;
replay_time | received_time
-------------+---------------
0 | 127
(1 row)
Shouldn't the ccp_replication_lag
query use pg_is_in_recovery()
in case condition instead of pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
?
If it happens again, can you run that last query while the exporter is reporting a replay_time value? If the database itself is also reporting a value here, then it's not the exporter reporting erroneous data.
Isn't it there in my previous post already?
postgres=# SELECT
postgres-# CASE
postgres-# WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
postgres-# ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
postgres-# END
postgres-# AS replay_time
postgres-# , EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
postgres-# AS received_time
postgres-# ;
replay_time | received_time
-------------+---------------
1402 | 1402
(1 row)
Okay, it happened again (this time during 14.5 -> 14.6 rollout).
Answering to all your questions (I'm sorry I didn't do proper investigation prior the report):
what do you see in the Pod logs for the primary when this occurs, e.g. is the HA system (Patroni) logging anything?
2023-01-24 20:30:25,283 INFO: no action. I am (db-db-lqkx-0), a secondary, and following a leader (db-db-8xhw-0)
2023-01-24 20:30:32,952 INFO: no action. I am (db-db-lqkx-0), a secondary, and following a leader (db-db-8xhw-0)
2023-01-24 20:30:43,015 INFO: no action. I am (db-db-lqkx-0), a secondary, and following a leader (db-db-8xhw-0)
2023-01-24 20:30:52,972 INFO: no action. I am (db-db-lqkx-0), a secondary, and following a leader (db-db-8xhw-0)
2023-01-24 20:31:02,441 INFO: no action. I am (db-db-lqkx-0), a secondary, and following a leader (db-db-8xhw-0)
/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py:1052: InsecureRequestWarning: Unverified HTTPS request is being made to host 'db-db-f9tf-0.db-pods'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.ht
InsecureRequestWarning,
2023-01-24 20:31:05,041 INFO: Got response from db-db-f9tf-0 https://db-db-f9tf-0.db-pods:8008/patroni: {"state": "running", "postmaster_start_time": "2023-01-24 20:26:32.768802+00:00", "role": "replica", "server_version": 140006, "xlog": {"received_location": 69966364672, "replaye
2023-01-24 20:31:05,181 INFO: Cleaning up failover key after acquiring leader lock...
server signaled
2023-01-24 20:31:05,299 INFO: promoted self to leader by acquiring session lock
server promoting
2023-01-24 20:31:05,323 INFO: cleared rewind state after becoming the leader
2023-01-24 20:31:06,468 INFO: no action. I am (db-db-lqkx-0), the leader with the lock
2023-01-24 20:31:06,653 INFO: no action. I am (db-db-lqkx-0), the leader with the lock
2023-01-24 20:31:17,084 INFO: Lock owner: db-db-lqkx-0; I am db-db-lqkx-0
2023-01-24 20:31:17,103 INFO: Assigning synchronous standby status to ['db-db-f9tf-0']
server signaled
2023-01-24 20:31:17,221 INFO: Synchronous standby status assigned to ['db-db-f9tf-0']
2023-01-24 20:31:17,270 INFO: no action. I am (db-db-lqkx-0), the leader with the lock
2023-01-24 20:31:27,154 INFO: no action. I am (db-db-lqkx-0), the leader with the lock
2023-01-24 20:31:37,119 INFO: no action. I am (db-db-lqkx-0), the leader with the lock
2023-01-24 20:31:47,181 INFO: no action. I am (db-db-lqkx-0), the leader with the lock
2023-01-24 20:31:57,103 INFO: no action. I am (db-db-lqkx-0), the leader with the lock
2023-01-24 20:32:07,194 INFO: no action. I am (db-db-lqkx-0), the leader with the lock
2023-01-24 20:32:17,102 INFO: no action. I am (db-db-lqkx-0), the leader with the lock
2023-01-24 20:32:27,120 INFO: no action. I am (db-db-lqkx-0), the leader with the lock
And then db-db-lqkx-0
has a master
role, acts as a master yet has a replication lag metric exported
Curious, do the ccp_replication_lag_size_bytes or ccp_replication_lag_received_time metric also start reporting any lag as well?
These are the metrics from a pod with misbehaving exporter:
# TYPE ccp_replication_lag_size_bytes gauge
ccp_replication_lag_size_bytes{replica="10.31.0.144",replica_hostname="",replica_port="53868",server="localhost:5432"} 0
ccp_replication_lag_size_bytes{replica="10.31.2.195",replica_hostname="",replica_port="39438",server="localhost:5432"} 0
# TYPE ccp_replication_lag_received_time gauge
ccp_replication_lag_received_time{server="localhost:5432"} 1278
# TYPE ccp_replication_lag_received_time gauge
ccp_replication_lag_replay_time{server="localhost:5432"} 1278
Curious if you can look at the actual systems themselves and see what values are being reported by running the queries directly next time this happens. You can see the queries here:
psql (14.6)
Type "help" for help.
postgres=# SELECT
postgres-# CASE
postgres-# WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
postgres-# ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
postgres-# END
postgres-# AS replay_time
postgres-# , EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
postgres-# AS received_time
postgres-# ;
replay_time | received_time
-------------+---------------
1402 | 1402
(1 row)
postgres=# SELECT client_addr as replica
postgres-# , client_hostname as replica_hostname
postgres-# , client_port as replica_port
postgres-# , pg_wal_lsn_diff(sent_lsn, replay_lsn) as bytes
postgres-# FROM pg_catalog.pg_stat_replication
postgres-# ;
replica | replica_hostname | replica_port | bytes
-------------+------------------+--------------+-------
10.31.0.144 | | 53868 | 0
10.31.2.195 | | 39438 | 0
(2 rows)
And is it just the pod with the exporter that you kill, or are you also doing any sorts of restarts/rebuilds on any of the database systems themselves when this happens?
I just delete the current master pod and let kubernetes to reschedule it again somewhere else, more often than not it happens quick enough to start up and be re-elected as master again. But then it behaves fine.
I have just killed db-db-lqkx-0
and within several seconds it was up and running, but this time another replica picked up the master role.
Now after restart if I run the first query on a new master it's this:
postgres=# SELECT
CASE
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
END
AS replay_time
, EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
AS received_time;
replay_time | received_time
-------------+---------------
0 | 149
First column is constantly 0 and the second monotonously grows (but this db has almost no traffic, if any at all, so no clients read/write from it).
Describe the issue: sometimes during an intermittent short network hiccup exporter starts exporting replication lag-related metrics as if the pod is replica, not master (while it certainly is master).
Here is an example of a metric with corresponding labels
ccp_replication_lag_replay_time{deployment="db-db-j555", instance="10.51.7.155:9187", ip="10.51.7.155", job="crunchy-postgres-exporter", kubernetes_namespace="<redacted>", pg_cluster="<redacted>:db", pod="db-db-j555-0", role="master", server="localhost:5432"}
At the time of writing this message it had a constantly monotonously growing value (of over 1000 seconds).
I resolve it with just killing the pod.
Describe the expected behavior: exporter should more carefully watch the node status.
Tell us about your environment:
registry.developers.crunchydata.com/crunchydata/crunchy-postgres-exporter:ubi8-5.3.0-0
Not stale.
It is still relevant.