jacksontj
Repos
111
Followers
68

Events

Merge pull request #21 from wish/jyewrd

fix

add metrics

update

Revert "update"

This reverts commit f93af050886f9547f90cbadfdc84a207f21410e1.

update

update metric name

Merge pull request #22 from wish/mjudsr

add metrics

Add nohint plugin

Created at 5 days ago

add metrics

update

Revert "update"

This reverts commit f93af050886f9547f90cbadfdc84a207f21410e1.

update

update metric name

Add nohint plugin

Created at 5 days ago

Add nohint plugin

Created at 5 days ago

Merge pull request #21 from wish/jyewrd

fix

add metrics

update

Revert "update"

This reverts commit f93af050886f9547f90cbadfdc84a207f21410e1.

update

update metric name

Merge pull request #22 from wish/mjudsr

add metrics

Add nohint plugin

Created at 5 days ago

add metrics

update

Revert "update"

This reverts commit f93af050886f9547f90cbadfdc84a207f21410e1.

update

update metric name

Add nohint plugin

Created at 5 days ago
closed issue
Queries fail immediately when all targets in one group are down

I have Promxy set up with two server groups, each containing two VictoriaMetrics targets. One goal of this setup is that the user doesn't need to know which of the target groups contains the data they're looking for; the other is that for some metrics it's possible to provide some redundancy across both groups.

However, Promxy seems to effectively go down if all targets in a group go down: all queries fail immediately. A grafana dashboard will fail to load even if it doesn't need any data from the unavailable group. This feels unnecessary.

The culprit might be the requiredCount in MultiAPI, which is designed to fail as early as possible:

				// If there aren't enough outstanding requests to possibly succeed, no reason to wait
				if (outstandingRequests[ret.ls] + successMap[ret.ls]) < m.requiredCount {
					return nil, warnings.Warnings(), ret.err
				}

and

	// Verify that we hit the requiredCount for all of the buckets
	for k := range outstandingRequests {
		if successMap[k] < m.requiredCount {
			return nil, warnings.Warnings(), errors.Wrap(lastError, "Unable to fetch from downstream servers")
		}
	}

It would be nice to be able to configure this requiredCount, or to otherwise influence this behaviour.

Otherwise, if one is supposed to use a single server group for two groups of targets, where one group is designed to be a (partial) fallback for the other group, maybe this could be addressed somewhere in the docs, for instance after the line in the README:

A ServerGroup is a set of prometheus hosts configured the same.

Created at 6 days ago
issue comment
Queries fail immediately when all targets in one group are down

FWIW I've been using promxy for alerting since ~2018 without major issue -- so definitely an intended use-case. The main driver for that was to support global aggregate alerts (e.g. error rate across all shards, latency across all shards, etc.).

It sounds like for this issue we're all buttoned up, so I'll go ahead and close this out. If you have any other issues please feel free to re-open or create another issue!

Created at 6 days ago
delete branch
jacksontj delete branch cleanup_trace
Created at 1 week ago

Remove legacy clroot tracing from mongoengine

this tracing is long dead

Ref https://github.com/ContextLogic/clroot/pull/66352

Created at 1 week ago
pull request closed
Remove legacy clroot tracing from mongoengine

this tracing is long dead

Ref https://github.com/ContextLogic/clroot/pull/66352

Created at 1 week ago
closed issue
relabel_configs doesn't do relabeling

My metrics look like this:

metricA{host="ip-172-16-25-14.eu-west-2.compute.internal", input="cloudwatch", sg="promxy"} metricA{host="ip-172-16-25-15.eu-west-2.compute.internal", input="cloudwatch", sg="promxy"}

What I intend to do is to get rid of 2nd duplicate metric which comes through different telegraf output, but since host is registered it appears to be different hence not deduplicated.

Problem I have is that I need to relabel host only if another label input is set to "cloudwatch". Therefore my relabel_configs looks like this:

relabel_configs:
   - source_labels: [input]
     regex: "cloudwatch"
     target_label: host
     replacement: "telegraf"

As written in https://github.com/jacksontj/promxy/issues/260, relabel_configs should be in the server_group section in promxy. However this doesn't work. My full promxy config looks like this:

global:
  scrape_interval: 10s
  external_labels:
    source: vm

promxy:
  server_groups:
    - static_configs:
        - targets:
            - 172.16.1.1:8428
            - 172.16.2.1:8428
      labels:
        sg: promxy
      anti_affinity: 10s
      remote_read: false
      query_params:
        nocache: 1
      scheme: http
      http_client:
        dial_timeout: 1s
        tls_config:
          insecure_skip_verify: true
      ignore_error: true

      relabel_configs:
        - source_labels: [input]
          regex: "cloudwatch"
          target_label: host
          replacement: "telegraf"

I tried to make sense of relabel_configs vs metric_relabel_configs from https://github.com/jacksontj/promxy/issues/136, but still no luck. What is wrong? Can function I need be implemented?

Created at 1 week ago
issue comment
relabel_configs doesn't do relabeling

Closing as a dupe of #258

Created at 1 week ago
issue comment
Queries fail immediately when all targets in one group are down

Yea, its a bit of a philisophical question. In the event of "plain" data fetching missing data is maybe okay -- but if you are relying on this for alerting you get a false sense of security. For example, if the alert was based on error rate, and a bunch of nodes were missing metrics (because of an SG missing) you'd have an alert green (because it was missing the data) instead of red (because it was unable to fetch).

Given that context my decision was to have promxy err on the side of visibility here (since it is effectively an error fetching data). So I definitely caution against the use of ignore_error -- but in some use-cases that is the desired behavior.

Created at 1 week ago
issue comment
No result when using relative time_range

This definitely seems like a unique use-case. Given that your issue is with the "shortterm" metrics missing, maybe its some sort of caching issue in VM (https://github.com/jacksontj/promxy/blob/master/cmd/promxy/config.yaml#L54). If that doesn't solve it we'd require some more debugging. So ideally a repro case -- but if not a pcap or trace logs should be sufficient.

Created at 1 week ago
issue comment
Queries fail immediately when all targets in one group are down

@nemobis sorry, I mis-sent before -- I just finished editing the comment with the rest of the context :)

Created at 1 week ago
issue comment
Queries fail immediately when all targets in one group are down

Thanks for reaching out!

From reading over your issue it sounds like there may just be some misunderstanding/miscommunication around the servergroups.

So as you quoted a servergroup is a set of prometheus hosts that are configured the same. So based on your description of 2 servergroups with 2 targets each, I expect you have 4 total VM nodes:

  • SG1 - VM1
    • SG1 - VM1
    • SG1 - VM1
Created at 1 week ago

Update to Alpine 3.16.2

Created at 1 week ago
pull request closed
Update to Alpine 3.16.2

@jacksontj Could I have a +1 here?

Created at 1 week ago
pull request opened
Remove legacy clroot tracing from mongoengine

this tracing is long dead

Ref https://github.com/ContextLogic/clroot/pull/66352

Created at 1 week ago
create branch
jacksontj create branch cleanup_trace
Created at 1 week ago
issue comment
Fix goroutine leak on config reload

@camathieu A friendly reminder here -- this should work after a rebase (to include the fix for CI) :)

Created at 1 month ago
pull request opened
Add nohint plugin
Created at 1 month ago
create branch
jacksontj create branch nohint
Created at 1 month ago