Skip to content

GitLab Runner Research

Dev Environment

  • Runner
  • Server: This project on GitLab.com SaaS with the registered runner.

Runner Jobs from Server

.gitlab-ci.yml provides matrix builds which can be triggered and visualized with the runner binary.

Tip: Just retry the latest pipeline in the https://gitlab.com/everyonecancontribute/observability/opentelemetry-gitlab-runner-playground/-/pipelines view after starting the GitLab Runner.

Ideas

  1. Start with https://opentelemetry.io/docs/instrumentation/go/getting-started/
  2. Log traces to stdout / txt file
  3. Add collector options later in config.toml
  4. Define trace scopes
  5. Research traces, contexts, and metrics

Analysis

Existing Job Traces are log

Runner uses a concept of job traces which sends partial log updates to the central coordinator. This is a different approach than OpenTelemetry instrumentation but partially overlaps. func (n *GitLabClient) PatchTrace( in network/gitlab.go implements the API function calls with uri := fmt.Sprintf("jobs/%d/trace", id) towards the REST API upload.

https://docs.gitlab.com/ee/api/jobs.html#get-a-log-file then provides access to the job log ("trace").

OpenTelemetry Go SDK

The Go SDK in OpenTelemetry heavily relies on Context Propagation.

This method could bring in more local contexts (in requests, blocked resources, etc.) with OpenTelemetry traces and spans in the code. Consider mentioning this detail in future MRs for review.

https://discuss.newrelic.com/t/opentelemetry-troubleshooting-framework-enable-logging/178668#go-4 shows how to enable debug logging for the SDK. https://github.com/open-telemetry/opentelemetry-go/blob/main/example/namedtracer/main.go too. TODO Add to the documentation.

Edit Also documented in https://docs.honeycomb.io/getting-data-in/opentelemetry/go/#printing-to-the-console

import (
  // ...
  "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
  "github.com/go-logr/stdr"
  // ...
)

func (mr *RunCommand) setupOpenTelemetryInstrumentation() {

    //TODO: Debug logging for the SDK
    stdr.SetVerbosity(5)
2022/04/20 22:09:51 internal_logging.go:48: "level"=1 "msg"="TracerProvider created" "config"={"SpanProcessors":[{"Type":"BatchSpanProcessor","SpanExporter":{"Type":"stdout","WithTimestamps":true},"Config":{"MaxQueueSize":2048,"BatchTimeout":"5s","ExportTimeout":"30s","MaxExportBatchSize":512,"BlockOnQueueFull":false}}],"SamplerType":"trace.parentBased","IDGeneratorType":"*trace.randomIDGenerator","SpanLimits":{"AttributeValueLengthLimit":-1,"AttributeCountLimit":128,"EventCountLimit":128,"LinkCountLimit":128,"AttributePerEventCountLimit":128,"AttributePerLinkCountLimit":128},"Resource":{"Attributes":{"telemetry.sdk.language":"go","telemetry.sdk.name":"opentelemetry","telemetry.sdk.version":"1.6.3","environment":"otel","service.name":"gitlab-org/gitlab-runner","service.version":"14.11.0~beta.6.g2134b4b6"},"SchemaURL":"https://opentelemetry.io/schemas/1.7.0"}}
Setting OpenTelemetry Tracer Provider               builds=0
2022/04/20 22:09:51 internal_logging.go:48: "level"=1 "msg"="Tracer created" "name"="gitlab-org/gitlab-runner" "version"="" "schemaURL"=""

Doesn't log anything with traces and spans unfortunately.

OpenTelemetry Go SDK Init and Destroy

Do not initialize and immediately defer/destroy the TracerProvider

    // First, create the console exporter
    exporter, err := newOpenTelemetryStdoutExporter()
    if err != nil {
        mr.log().Fatal("cannot create OpenTelemetry Console Exporter: %s", err.Error())
    }

    // Second, install the tracer provider
    tracerProvider := sdktrace.NewTracerProvider(
        sdktrace.WithSampler(sdktrace.AlwaysSample()), //TODO: Remove for production?
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(newOpenTelemetryResource()),
    )

    //TODO: Figure out where to initialize and shutdown the tracer provider.
    // https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/instrumentation/github.com/Shopify/sarama/otelsarama/example/producer/producer.go#L42
    /*
        defer func() {
            if err := tracerProvider.Shutdown(context.Background()); err != nil {
                mr.log().Fatal("cannot shutdown tracer provider: %s", err.Error())
            }
        }()
    */

After the function ends, the deferred shutdown is executed immediately - and thus no traces/spans created. Inspired by reading https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/instrumentation/github.com/Shopify/sarama/otelsarama/example/producer/producer.go#L42 and the InitTracer() function in a seperate call. TODO Upstream getting started doesn't explain that.

Update Use the following pattern found in https://github.com/open-telemetry/opentelemetry-go/blob/main/exporters/otlp/otlptrace/otlptracehttp/example_test.go#L67

func (mr *RunCommand) setupOpenTelemetryInstrumentation() func() {

//...
    return func() {
        // Shutdown will flush any remaining spans and shut down the exporter.
        tracerProvider.Shutdown(ctx)
    }
}

in main()

    shutdown := mr.setupOpenTelemetryInstrumentation()
    defer shutdown()

Labkit in Runner?

https://gitlab.com/gitlab-org/labkit/tree/master/tracing implements OpenTracing which needs future replacement with OpenTelemetry.

TODO: Figure out whether it can be used for the current purpose of CI/CD Observability. It is not implemented nor used inside GitLab Runner yet.

Trace Scopes

  • Runner => Jobs from Server
  • Executors => environment specific insights

MVC:

  1. Run - Execute - Start Job - Do something - End Job - Return

Development: OpenTelemetry

go get go.opentelemetry.io/otel \
       go.opentelemetry.io/otel/trace

go get go.opentelemetry.io/otel/sdk \
         go.opentelemetry.io/otel/exporters/stdout/stdouttrace       

Runner: Instrument Run Command

commands/multi.go implements the run command. Follow the Go instrumentation example but directly instrument the workflow in run.

Ignore the recommended coding style with a global name variable and hardcode the strings. (TODO: Discuss with OpenTelemetry upstream whether the global variable is a good idea - can be overwhelming when learning).

Follow the path with logging the traces to stdout, and once it works, log them to a file.

make runner-bin-host

./out/binaries/gitlab-runner --debug run -c $HOME/dev/everyonecancontribute/observability/opentelemetry-gitlab-runner-playground/dev-config.toml

The learning progress can be followed in https://gitlab.com/everyonecancontribute/observability/gitlab-runner-opentelemetry/-/merge_requests/1

RequestJob

commands/multi.go holds the doJobRequest() function where we can start a new tracer in the current context.

mr.network.RequestJob() is an interface which is implemented by network/gitlab.go > RequestJob().

Important: Extract the span from the current context, following https://opentelemetry.io/docs/instrumentation/go/manual/#get-the-current-span this requires the import of "go.opentelemetry.io/otel/trace".

https://opentelemetry.io/docs/instrumentation/go/manual/#events shows AddEvent() as small logging interface for a single span. This can be useful to test the span for the job request towards the HTTP API of GitLab.

{
    "Name": "Request Job",
    "SpanContext": {
        "TraceID": "c1f88f92c6a65011b41b0752419e3dab",
        "SpanID": "b843b966dee3e006",
        "TraceFlags": "01",
        "TraceState": "",
        "Remote": false
    },
    "Parent": {
        "TraceID": "c1f88f92c6a65011b41b0752419e3dab",
        "SpanID": "86721c0d6cefa1ca",
        "TraceFlags": "01",
        "TraceState": "",
        "Remote": false
    },
    "SpanKind": 1,
    "StartTime": "2022-04-20T23:44:09.346493+02:00",
    "EndTime": "2022-04-20T23:44:09.561581312+02:00",
    "Attributes": [
        {
            "Key": "httpStatusCode",
            "Value": {
                "Type": "INT64",
                "Value": 204
            }
        }
    ],
    "Events": [
        {
            "Name": "Requesting jobs from GitLab server",
            "Attributes": null,
            "DroppedAttributeCount": 0,
            "Time": "2022-04-20T23:44:09.346503+02:00"
        },
        {
            "Name": "Analysing response for jobs",
            "Attributes": null,
            "DroppedAttributeCount": 0,
            "Time": "2022-04-20T23:44:09.561533+02:00"
        },
        {
            "Name": "Response status 204 with text: '204 No Content'",
            "Attributes": null,
            "DroppedAttributeCount": 0,
            "Time": "2022-04-20T23:44:09.56154+02:00"
        }
    ],
    "Links": null,
    "Status": {
        "Code": "Unset",
        "Description": ""
    },
    "DroppedAttributes": 0,
    "DroppedEvents": 0,
    "DroppedLinks": 0,
    "ChildSpanCount": 0,
    "Resource": [
        {
            "Key": "environment",
            "Value": {
                "Type": "STRING",
                "Value": "otel"
            }
        },
        {
            "Key": "service.name",
            "Value": {
                "Type": "STRING",
                "Value": "gitlab-org/gitlab-runner"
            }
        },
        {
            "Key": "service.version",
            "Value": {
                "Type": "STRING",
                "Value": "14.11.0~beta.6.g2134b4b6"
            }
        },
        {
            "Key": "telemetry.sdk.language",
            "Value": {
                "Type": "STRING",
                "Value": "go"
            }
        },
        {
            "Key": "telemetry.sdk.name",
            "Value": {
                "Type": "STRING",
                "Value": "opentelemetry"
            }
        },
        {
            "Key": "telemetry.sdk.version",
            "Value": {
                "Type": "STRING",
                "Value": "1.6.3"
            }
        }
    ],
    "InstrumentationLibrary": {
        "Name": "gitlab-org/gitlab-runner",
        "Version": "",
        "SchemaURL": ""
    }
}
processRunner

Start a new span, and add events and attributes.

{
    "Name": "Process Runner",
    "SpanContext": {
        "TraceID": "f545535d1a41020093d316d9afdc07d7",
        "SpanID": "86ccb837a7b33135",
        "TraceFlags": "01",
        "TraceState": "",
        "Remote": false
    },
    "Parent": {
        "TraceID": "00000000000000000000000000000000",
        "SpanID": "0000000000000000",
        "TraceFlags": "00",
        "TraceState": "",
        "Remote": false
    },
    "SpanKind": 1,
    "StartTime": "2022-04-20T23:44:04.186972+02:00",
    "EndTime": "2022-04-20T23:44:07.988076847+02:00",
    "Attributes": null,
    "Events": [
        {
            "Name": "Got executor provider",
            "Attributes": [
                {
                    "Key": "executor",
                    "Value": {
                        "Type": "STRING",
                        "Value": "docker"
                    }
                }
            ],
            "DroppedAttributeCount": 0,
            "Time": "2022-04-20T23:44:04.186991+02:00"
        },
        {
            "Name": "Acquired executor provider",
            "Attributes": [
                {
                    "Key": "executor",
                    "Value": {
                        "Type": "STRING",
                        "Value": "docker"
                    }
                }
            ],
            "DroppedAttributeCount": 0,
            "Time": "2022-04-20T23:44:04.186996+02:00"
        },
        {
            "Name": "Acquired build",
            "Attributes": [
                {
                    "Key": "runner",
                    "Value": {
                        "Type": "STRING",
                        "Value": "hAe2pVEf"
                    }
                },
                {
                    "Key": "worker",
                    "Value": {
                        "Type": "INT64",
                        "Value": 2
                    }
                }
            ],
            "DroppedAttributeCount": 0,
            "Time": "2022-04-20T23:44:04.187002+02:00"
        },
        {
            "Name": "Requested build job",
            "Attributes": null,
            "DroppedAttributeCount": 0,
            "Time": "2022-04-20T23:44:04.711085+02:00"
        },
        {
            "Name": "Processing build job",
            "Attributes": [
                {
                    "Key": "build-id",
                    "Value": {
                        "Type": "INT64",
                        "Value": 2357407093
                    }
                }
            ],
            "DroppedAttributeCount": 0,
            "Time": "2022-04-20T23:44:04.711198+02:00"
        }
    ],
    "Links": null,
    "Status": {
        "Code": "Unset",
        "Description": ""
    },
    "DroppedAttributes": 0,
    "DroppedEvents": 0,
    "DroppedLinks": 0,
    "ChildSpanCount": 0,
    "Resource": [
        {
            "Key": "environment",
            "Value": {
                "Type": "STRING",
                "Value": "otel"
            }
        },
        {
            "Key": "service.name",
            "Value": {
                "Type": "STRING",
                "Value": "gitlab-org/gitlab-runner"
            }
        },
        {
            "Key": "service.version",
            "Value": {
                "Type": "STRING",
                "Value": "14.11.0~beta.6.g2134b4b6"
            }
        },
        {
            "Key": "telemetry.sdk.language",
            "Value": {
                "Type": "STRING",
                "Value": "go"
            }
        },
        {
            "Key": "telemetry.sdk.name",
            "Value": {
                "Type": "STRING",
                "Value": "opentelemetry"
            }
        },
        {
            "Key": "telemetry.sdk.version",
            "Value": {
                "Type": "STRING",
                "Value": "1.6.3"
            }
        }
    ],
    "InstrumentationLibrary": {
        "Name": "gitlab-org/gitlab-runner",
        "Version": "",
        "SchemaURL": ""
    }
}

OpenTelemetry Exporter

Follow the documentation to add an OTLP Exporter

go get -u go.opentelemetry.io/otel/exporters/otlp/otlptrace

https://github.com/open-telemetry/opentelemetry-go/blob/main/exporters/otlp/otlptrace/otlptracehttp/example_test.go

TODO: This is not documented in https://github.com/open-telemetry/opentelemetry-go/tree/main/exporters/otlp/otlptrace#configuration

go get -u go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp
go get -u go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
make runner-bin-host

Local Docker Compose Setup

Started on 2022-ß5-05.

Follow dev-env-otel.md to bring up Jaeger AIO and the OpenTelemetry collector.

cd docker-compose
docker-compose up -d

# Extract the port where 4317 is mapped to, use below for the OTEL_EXPORTER_OTLP_ENDPOINT variable
docker-compose ps

Export the gRPC target, run the GitLab runner binary and trigger the CI/CD pipeline on the server.

export OTEL_EXPORTER_OTLP_ENDPOINT="localhost:49175"


./out/binaries/gitlab-runner --debug run -c $HOME/dev/everyonecancontribute/observability/opentelemetry-gitlab-runner-playground/dev-config.toml

Open http://localhost:16686/ with Jaeger Query UI and search for the service.

WOHOOOOOOOO first trace.

Show OTLPResourceNoServiceName: Process Runner. But why? Let's Google. Found https://seankhliao.com/blog/12021-02-12-otel-go-v0.16.0-tracing/

resources While not strictly necessary, it's a good idea to set some attributes, at the very least ServiceName, or your trace shows up as from OTLPResourceNoServiceName (even if you set a named tracer 🤷). Some standard keys and values are defined in semconv.

Docs: resource semconv https://pkg.go.dev/go.opentelemetry.io/otel/sdk@v0.16.0/resource https://pkg.go.dev/go.opentelemetry.io/otel@v0.16.0/semconv

res, err := resource.New(ctx,
        resource.WithAttributes(
                semconv.ServiceNameKey.String("service-a"),
        ),
)

Kubernetes Deployment (WIP) Notes 2022-05-04

Get the OpenTelemetry Collector host and port from the [dev-env-otel.md] Kubernetes deployment.

  • 4317 for gRPC
  • 4318 for HTTP
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="212.2.243.155:4318"
export OTEL_EXPORTER_OTLP_ENDPOINT="212.2.243.155:4318"


./out/binaries/gitlab-runner --debug run -c $HOME/dev/everyonecancontribute/observability/opentelemetry-gitlab-runner-playground/dev-config.toml


OTEL_EXPORTER_OTLP_ENDPOINT: %s212.2.243.155:4318   builds=0
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT: %s212.2.243.155:4318  builds=0
2022/05/04 21:51:44 internal_logging.go:48: "level"=1 "msg"="TracerProvider created" "config"={"SpanProcessors":[{"Type":"BatchSpanProcessor","SpanExporter":{"Type":"otlptrace","Client":{"Type":"otlphttphttp","Endpoint":"localhost:4318","Insecure":false}},"Config":{"MaxQueueSize":2048,"BatchTimeout":"5s","ExportTimeout":"30s","MaxExportBatchSize":512,"BlockOnQueueFull":false}}],"SamplerType":"trace.alwaysOnSampler","IDGeneratorType":"*trace.randomIDGenerator","SpanLimits":{"AttributeValueLengthLimit":-1,"AttributeCountLimit":128,"EventCountLimit":128,"LinkCountLimit":128,"AttributePerEventCountLimit":128,"AttributePerLinkCountLimit":128},"Resource":{"Attributes":{},"SchemaURL":""}}
2022/05/04 21:51:44 internal_logging.go:48: "level"=1 "msg"="Tracer created" "name"="gitlab-org/gitlab-runner" "version"="" "schemaURL"=""

Next: Debug why the environment variables are not used.

https://docs.honeycomb.io/getting-data-in/opentelemetry/go/

Debug why nothing is received by OpenTelemtry Collector in Kubernetes.

https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/troubleshooting.md

service:
  telemetry:
    logs:
      level: "debug"
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
  namespace: observability
  labels:
    component: otel-collector
spec:
  config: |
    # OTel receiver - GitLab Runner writing into
    # https://opentelemetry.io/docs/collector/configuration/#receivers
    receivers:
      jaeger:
        protocols:
          grpc:
      otlp:
        protocols:
          grpc:
          http:
    # OTel exporter writing to Jaeger+Clickhouse backend
    # kubectl get svc -n observability
    # https://opentelemetry.io/docs/collector/configuration/#exporters
    exporters:
      jaeger:
        endpoint: "jaeger-clickhouse-collector-headless.observability:14250"
        tls:
          insecure: true
      logging:
        loglevel: debug

    processors:
      batch:

    extensions:
      health_check:
      pprof:
      zpages:

    # https://opentelemetry.io/docs/collector/configuration/#service
    service:
      extensions: [health_check, pprof, zpages]
      telemetry:
        logs:
          level: debug
          initial_fields:
            service: otel-gitlab
      pipelines:
        traces:
          receivers: [jaeger,otlp,zipkin]
          processors: [batch]
          exporters: [jaeger,logging]
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  namespace: observability
  labels:
    app: otel
spec:
  type: LoadBalancer
  selector:
    app.kubernetes.io/component: opentelemetry-collector

  # Expose ports as load balancer
  ports:
  - name: grpc-otlp
    port: 4317
    protocol: TCP
    targetPort: 4317
  - name: http-otlp
    port: 4318
    protocol: TCP
    targetPort: 4318
kubectl rollout restart deployment -n observability

Posting with HTTP/JSON: https://github.com/open-telemetry/opentelemetry-collector/blob/399132257fb5c72efb5aea8542f41e452f1f7183/receiver/otlpreceiver/README.md#writing-with-httpjson

Always receiving an empty response / EOF.

Maybe TLS problems with verification? https://github.com/open-telemetry/opentelemetry-collector/blob/main/config/configtls/README.md#tls--mtls-configuration

    receivers:
      jaeger:
        insecure: true
        protocols:
          grpc:
          http:
      otlp:
        insecure: true
        protocols:
          grpc:
          http:


kubectl rollout restart deployment -n observability otel-collector-collector

TODO: Follow discussion to extend the OpenTelemetryCollector CRD with host:port https://github.com/open-telemetry/opentelemetry-operator/issues/252

gRPC also does not work. Returns EOF.

brew install grpcurl

grpcurl 212.2.244.84:4317 helloworld.Greeter/SayHello                                                        ─╯
Failed to dial target host "212.2.244.84:4317": EOF

Switched the GitLab Runner code to gRPC.

2022/05/05 00:34:59 max retry time elapsed: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: EOF"

For testing, http.

export OTEL_EXPORTER_OTLP_ENDPOINT="http://212.2.244.84:4317"

Nope.

Next try: Skip load balancer, use portfowarding.

kubectl port-forward deployment/otel-collector 4317 -n observability

export OTEL_EXPORTER_OTLP_ENDPOINT="localhost:4317"

TODO Service config is broken.

With port-forwarding, the error changes to

2022/05/05 00:51:29 internal_logging.go:62: "level"=5 "msg"="exporting spans" "count"=5 "total_dropped"=0
2022/05/05 00:51:29 max retry time elapsed: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: tls: first record does not look like a TLS handshake"

Change connection protocol to http.

export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"

Error with parsing the address, kinda expected.

2022/05/05 00:53:47 internal_logging.go:62: "level"=5 "msg"="exporting spans" "count"=1 "total_dropped"=0
2022/05/05 00:53:47 max retry time elapsed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: address http://localhost:4317: too many colons in address"

Next try: WithInsecure() in the grpc code, and localhost portforwarding.

kubectl port-forward deployment/otel-collector 4317 -n observability

export OTEL_EXPORTER_OTLP_ENDPOINT="localhost:4317"

No more errors, but also no connection attempts logged in teh pod.

kubectl port-forward svc/jaeger-clickhouse-query 16686
Back to top