gouthamve
Repos
160
Followers
461
Following
27

The Prometheus time series database layer.

831
176

The Prometheus monitoring system and time series database.

45715
7394

Contrib repository for the OpenTelemetry Collector

1285
1070

Events

pull request opened
[otlp] Highlight common issues with OTLP --> Prometheus

Signed-off-by: Goutham Veeramachaneni gouthamve@gmail.com

Created at 5 hours ago
create branch
gouthamve create branch add-otlp-conversion-docs
Created at 5 hours ago
Span name for http server spans should include http.method.

Current spec defines the span name as the route, for example, /users/:userID?.

But having the http.method in the span name helps a lot, imo: GET /users/:userID?.

Created at 1 day ago
OTLP/HTTP: Retry or no on status code 401, 403?

Also see #2993

Created at 1 day ago
Semantic Conventions: default to seconds for duration units

based on your comments, I think we should leave the specification for milliseconds as the conventional unit for HTTP durations

I don't yet see it, but I might be mistaken. From what I understood, this is the algorithm to pick the scale: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#exponential-bucket-histogram-aggregation (with the default of 160 buckets, for example).

If the typical requests are around 500ms-2000ms, then picking seconds as unit would mean more accuracy. If they are around 1-100ms, then picking milliseconds would be better. Or am I completely off-base here somehow?

Prometheus uses this unit, and the Statsd Timing measurement uses milliseconds as well.

Prometheus uses seconds unfortunately :(

Created at 1 day ago
issue comment
Prometheus query expression 0/0 and 1-0/0 returns 0

Hi, I'm sorry. I haven't tested this on 9.2.5. This definitely is a bug on Grafana and it exists in 9.2.6 and 9.2.7 as well unfortunately.

I can confirm it has been fixed in 9.3.0.

Created at 1 day ago
issue comment
Prometheus query expression 0/0 and 1-0/0 returns 0

Hi, what Prometheus backend are you using?

Because Prometheus returns a NaN for 0/0 and also for 1 - 0/0 and not 0. This looks like the backend seems to return a 0, which seems like a bad bug.

Created at 1 day ago
closed issue
[limits] ingester.instance-limits.max-series returns 500 errors and not 429

Describe the bug

Hitting -ingester.instance-limits.max-series returns a 500 error and not a 429.

To Reproduce

Set the limit to be 5 and then send more than 5 series.

Expected behavior

It should return a 429 so the data is not retried and remote_write is not blocked.

Additional Context

This is because we set the limit on PreCreate hook which is called on Append() and Append() fails. And this is treated like a generic Append() error, and not a rate-limit.

Created at 1 day ago
issue comment
[limits] ingester.instance-limits.max-series returns 500 errors and not 429

Interesting, I see. Happy to close this then.

Created at 1 day ago
Inconsistent rate-limiting behavior b/w OTLP/HTTP and OTLP/gRPC

OTLP HTTP Throttling Spec

If the server receives more requests than the client is allowed or the server is overloaded the server SHOULD respond with HTTP 429 Too Many Requests or HTTP 503 Service Unavailable and MAY include "Retry-After" header with a recommended time interval in seconds to wait before retrying.

The client SHOULD honour the waiting interval specified in "Retry-After" header if it is present. If the client receives HTTP 429 or HTTP 503 response and "Retry-After" header is not present in the response then the client SHOULD implement an exponential backoff strategy between retries.

OTLP gRPC Throttling Spec

The client SHOULD interpret RESOURCE_EXHAUSTED code as retryable only if the server signals that the recovery from resource exhaustion is possible. This is signalled by the server by returning a status containing RetryInfo. In this case the behavior of the server and the client is exactly as described in OTLP/gRPC Throttling section. If no such status is returned then the RESOURCE_EXHAUSTED code SHOULD be treated as non-retryable.


In gRPC, if RetryInfo is not present, we drop the data. But in HTTP if Retry-After is not present, we retry the data. This doesn't seem right.

Further, I don't see a mechanism for the HTTP server to specify, "I'm overloaded, please drop the data".

Created at 1 day ago
opened issue
[limits] ingester.instance-limits.max-series returns 500 errors and not 429

Describe the bug

Hitting -ingester.instance-limits.max-series returns a 500 error and not a 429.

To Reproduce

Set the limit to be 5 and then send more than 5 series.

Expected behavior

It should return a 429 so the data is not retried and remote_write is not blocked.

Additional Context

This is because we set the limit on PreCreate hook which is called on Append() and Append() fails. And this is treated like a generic Append() error, and not a rate-limit.

Created at 1 day ago

Update builder README instructions to include required modules (#4664)

Change configmapprovider.Provider to accept a location for retrieve (#4657)

This is a preliminary change in order to support single/multiple local/remote config sources. This PR changes the Provider to accept an URI as the resource identifier.

The current way is still fully supported and will continue to be supported, but there are more ways to specify the file:

Command line(currently supported): ./otelcol --config=local.yaml Command line(currently supported): ./otelcol --config=/path/to/local.yaml Command line(new support): ./otelcol --config=file:/path/to/local.yaml

Signed-off-by: Bogdan Drutu bogdandrutu@gmail.com

Remove unnecessary import comment from test file (#4667)

Signed-off-by: Bogdan Drutu bogdandrutu@gmail.com

Change Properties Provider to be a converter, does not match the Provider interface (#4666)

Signed-off-by: Bogdan Drutu bogdandrutu@gmail.com

Move service.ConfigMapConverterFunc to config.MapConverterFunc (#4673)

The drop of the Config prefix from the func name was forced by go lint, which does not like to repeat the package name as the prefix for types.

Signed-off-by: Bogdan Drutu bogdandrutu@gmail.com

Include request headers in client.Info (#4547)

  • Include request headers in client.Info

  • Add GRPC metadata

  • Add tests

  • Add changelog

  • Clone headers

  • Add config option

  • Update config/confighttp/clientinfohandler.go

Co-authored-by: Juraci Paixão Kröhling juraci.github@kroehling.de

  • Update config/confighttp/confighttp.go

Co-authored-by: Juraci Paixão Kröhling juraci.github@kroehling.de

  • Update config/configgrpc/configgrpc.go

Co-authored-by: Juraci Paixão Kröhling juraci.github@kroehling.de

  • oops

  • Fix the test

  • Immutable metadata

  • Really immutable metadata

  • Really return copy, add test

  • Optimize metadata getter

  • Move changelog entry to unreleased

  • Add experimental remarks per CR

Co-authored-by: Kemal Hadimli disq@users.noreply.github.com Co-authored-by: Juraci Paixão Kröhling juraci.github@kroehling.de

Add context to config.MapConverterFunc (#4678)

Signed-off-by: Bogdan Drutu bogdandrutu@gmail.com

fix: move compression.go into confighttp.go (#4651)

  • fix: move compression.go into confighttp.go

  • chore: add changelog

  • fix: create configcompression

  • fix: lint error

  • fix: internalized compressionNone and compressionEmpty

  • fix: remove compression from each compression methods

  • fix: separated helper functions

  • fix: lint error

  • fix: interalize none and empty

Use config to setup skip-compilation instead of the flag (#4645)

The skip compilation should only be supplied as a CLI flag. Previously, it was possible to specify that in the YAML file, contrary to the original intention

change from hash go sum and go mod to only go sum (#4680)

dependabot updates Mon Jan 17 13:58:51 PST 2022 (#4690)

  • dependabot updates Mon Jan 17 13:58:51 PST 2022 Bump github.com/klauspost/compress from 1.13.6 to 1.14.1 Bump github.com/mostynb/go-grpc-compression from 1.1.15 to 1.1.16 Bump github.com/ory/go-acc from 0.2.6 to 0.2.7 in /internal/tools

  • fix lint

Update otel-config.yaml (#4683)

Correctly clone windows logging encoder (#4686)

Description: Fixes #4685. When With is called, clone the original encoder instead of creating a new one. Fields were being dropped because fields are apart of the encoder and each new call to With was creating a new encoder.

Link to tracking issue: https://github.com/open-telemetry/opentelemetry-collector/issues/4685

Testing: Tested manually with a windows service.

[Builder] Remove deprecated config option module::core (#4693)

Signed-off-by: Bogdan Drutu bogdandrutu@gmail.com

Add external process execution guidelines (#4652)

The guidelines are necessary to explain how we want to generally approach external process execution. This was recently brought up in https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/6512

If we accept these guidelines the following should happen:

  • The prometheusexecreceiver should be modified to allow only a hard-coded list of exporters.
  • The fluentbitextension should be either removed or significantly limited in terms of what locations and what executable file names it can allow.
  • https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/6512 will be rejected, possibly substituted by a plugin system that @zenmoto referred to in https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/6512#issuecomment-996922645 if we find useful to have such plugin system.

Remove warning about already removed config value (#4696)

Signed-off-by: Bogdan Drutu bogdandrutu@gmail.com

Fix error message in memorylimiter processor related to ballast extension (#4697)

Signed-off-by: Bogdan Drutu bogdandrutu@gmail.com

Remove deprecate flags --metrics-level and --metrics-addr (#4695)

  • Usages of --metrics-level={VALUE} can be replaced by --set=service.telemetry.metrics.level={VALUE};
  • Usages of --metrics-addr={VALUE} can be replaced by --set=service.telemetry.metrics.address={VALUE};

Signed-off-by: Bogdan Drutu bogdandrutu@gmail.com

Include valid factories in error when unknown type in config (#4684)

  • Include valid factories in error when unknown type in config

  • Simplify by passing reflect.Values and remove no-longer-needed test

Add max_request_body_size to confighttp.HTTPServerSettings (#4677)

  • add max_recv_size to otlpreceiver

  • lint

  • update doc

  • gosec warns G107 for http.Post but not http.NewRequest?

  • rename to max_request_body_size

  • also update changelog

  • also update doc to mention default 0 value

  • rename unexposed name from maxRecvSize to maxRequestBodySize

  • remove unnecessary maxRequestBodySize field in toServerOptions

Created at 2 days ago
Semantic Conventions: default to seconds for duration units

I was just thinking about @reyang's point:

Certain backends might prefer integer (which might be related to history, faster processing - e.g. int is in general faster than float, better storage efficiency - e.g. delta encoding)

I want to understand the storage reasoning, because the values for each bucket is an integer anyways. In Prometheus, the _bucket, _count are integer values as they are all counts, and only _sum is a float. So only ~10% of the samples would be in floats with seconds compared to milliseconds. I would think the storage benefits are small.

Another thought: this is slightly tangential to HTTP, but we won't be able to measure less than millisecond durations if we use an integer.

Created at 2 days ago
issue comment
[otlp] Fix panic in dropped count (again!)

It's little strange to me that tenant check is only done in case of errors. Should we do it before we even try to convert metrics?

It is on an authenticated route so the check essentially NEVER fails. It only exists to load the userID value.

Added unit tests don't seem to be checking for dropped metric at all. In other words, they are unrelated to the PR. 🤔

I first wrote the unit tests to reproduce the panic and then proceeded to fix it. Which is why it seems like the unit test is unrelated.

Created at 3 days ago

[otlp] Fix panic in dropped count (again!)

This doesn't accurately count the dropped samples. For example if a single metric with multiple samples is faulty, we get a single error rather than an error per sample.

But I believe its the best best-effort measurement.

Before we used to do DatapointCount() - samplesInMap()

The problem is the following:

  1. target_info is a synthetic metric added in Prometheus, so the final samples could higher.
  2. A single histogram datapoint in OTLP corresponds to many samples in Prometheus.

Signed-off-by: Goutham Veeramachaneni gouthamve@gmail.com

Created at 3 days ago

[otlp] Fix panic in dropped count (again!)

This doesn't accurately count the dropped samples. For example if a single metric with multiple samples is faulty, we get a single error rather than an error per sample.

But I believe its the best best-effort measurement.

Before we used to do DatapointCount() - samplesInMap()

The problem is the following:

  1. target_info is a synthetic metric added in Prometheus, so the final samples could higher.
  2. A single histogram datapoint in OTLP corresponds to many samples in Prometheus.

Signed-off-by: Goutham Veeramachaneni gouthamve@gmail.com

Created at 3 days ago
issue comment
[otlp] Fix panic in dropped count (again!)

cc @pstibrany @replay

Created at 3 days ago

[otlp] Fix panic in dropped count (again!)

This doesn't accurately count the dropped samples. For example if a single metric with multiple samples is faulty, we get a single error rather than an error per sample.

But I believe its the best best-effort measurement.

Before we used to do DatapointCount() - samplesInMap()

The problem is the following:

  1. target_info is a synthetic metric added in Prometheus, so the final samples could higher.
  2. A single histogram datapoint in OTLP corresponds to many samples in Prometheus.

Signed-off-by: Goutham Veeramachaneni gouthamve@gmail.com

Created at 3 days ago
pull request opened
[otlp] Fix panic in dropped count (again!)

This doesn't accurately count the dropped samples. For example if a single metric with multiple samples is faulty, we get a single error rather than an error per sample.

But I believe its the best best-effort measurement.

Before we used to do DatapointCount() - samplesInMap()

The problem is the following:

  1. target_info is a synthetic metric added in Prometheus, so the final samples could higher.
  2. A single histogram datapoint in OTLP corresponds to many samples in Prometheus.

Checklist

  • [x] Tests updated
  • [ ] ~Documentation added~
  • [ ] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
Created at 3 days ago
create branch
gouthamve create branch fix-otel-panics-once-and-forall
Created at 3 days ago
Semantic Conventions: default to seconds for duration units

Currently Istio, Dapr, and linkerd report request duration in milliseconds

I don't see it being a big problem. Its easy to convert from a milliseconds to seconds in the fixed bucket histograms that they export. And they have the unit specified as well everywhere which means we can convert it internally.

If we land on seconds as the unit for exponential histograms then when the projects implement it, they can choose seconds. Switching from fixed bucket to exponential histograms is considered a breaking change in most projects so they could make the change when they make the switch.

Created at 1 week ago
Semantic Conventions: default to seconds for duration units

This also causes inconsistencies when ingesting metrics from both Prometheus sources and OTel sources into a single db. Half the exponential histograms would be in seconds and the other half would be in milliseconds with no good way to reconcile/aggregate the two.

It is possible convert with explicit bucket histograms, hence initially I didn't mind the unit being milliseconds, but when I realised that its not possible with exponential histograms, I wanted to propose this change.

Created at 1 week ago
Semantic Conventions: default to seconds for duration units

Hrm, so while converting from OTLP to Prometheus, we can always convert to _seconds in the case of explicit bucket histograms. Its simply downscaling the buckets by /1000.

But this is not possible with exponential histograms. Can we make it milliseconds for exponential histograms?

Created at 1 week ago
Semantic Conventions: Default to seconds for duration units

cc @jmacd @jsuereth

Created at 1 week ago