GitLab Runner Research¶
Dev Environment¶
- Runner
- Server: This project on GitLab.com SaaS with the registered runner.
Runner Jobs from Server¶
.gitlab-ci.yml provides matrix builds which can be triggered and visualized with the runner binary.
Tip: Just retry the latest pipeline in the https://gitlab.com/everyonecancontribute/observability/opentelemetry-gitlab-runner-playground/-/pipelines view after starting the GitLab Runner.
Ideas¶
- Start with https://opentelemetry.io/docs/instrumentation/go/getting-started/
- Log traces to stdout / txt file
- Add collector options later in config.toml
- Define trace scopes
- Research traces, contexts, and metrics
Analysis¶
Existing Job Traces are log¶
Runner uses a concept of job traces which sends partial log updates to the central coordinator. This is a different approach than OpenTelemetry instrumentation but partially overlaps. func (n *GitLabClient) PatchTrace(
in network/gitlab.go
implements the API function calls with uri := fmt.Sprintf("jobs/%d/trace", id)
towards the REST API upload.
https://docs.gitlab.com/ee/api/jobs.html#get-a-log-file then provides access to the job log ("trace").
OpenTelemetry Go SDK¶
The Go SDK in OpenTelemetry heavily relies on Context Propagation.
- https://levelup.gitconnected.com/context-in-golang-98908f042a57
- https://www.digitalocean.com/community/tutorials/how-to-use-contexts-in-go
This method could bring in more local contexts (in requests, blocked resources, etc.) with OpenTelemetry traces and spans in the code. Consider mentioning this detail in future MRs for review.
https://discuss.newrelic.com/t/opentelemetry-troubleshooting-framework-enable-logging/178668#go-4 shows how to enable debug logging for the SDK. https://github.com/open-telemetry/opentelemetry-go/blob/main/example/namedtracer/main.go too. TODO Add to the documentation.
Edit Also documented in https://docs.honeycomb.io/getting-data-in/opentelemetry/go/#printing-to-the-console
import (
// ...
"go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
"github.com/go-logr/stdr"
// ...
)
func (mr *RunCommand) setupOpenTelemetryInstrumentation() {
//TODO: Debug logging for the SDK
stdr.SetVerbosity(5)
2022/04/20 22:09:51 internal_logging.go:48: "level"=1 "msg"="TracerProvider created" "config"={"SpanProcessors":[{"Type":"BatchSpanProcessor","SpanExporter":{"Type":"stdout","WithTimestamps":true},"Config":{"MaxQueueSize":2048,"BatchTimeout":"5s","ExportTimeout":"30s","MaxExportBatchSize":512,"BlockOnQueueFull":false}}],"SamplerType":"trace.parentBased","IDGeneratorType":"*trace.randomIDGenerator","SpanLimits":{"AttributeValueLengthLimit":-1,"AttributeCountLimit":128,"EventCountLimit":128,"LinkCountLimit":128,"AttributePerEventCountLimit":128,"AttributePerLinkCountLimit":128},"Resource":{"Attributes":{"telemetry.sdk.language":"go","telemetry.sdk.name":"opentelemetry","telemetry.sdk.version":"1.6.3","environment":"otel","service.name":"gitlab-org/gitlab-runner","service.version":"14.11.0~beta.6.g2134b4b6"},"SchemaURL":"https://opentelemetry.io/schemas/1.7.0"}}
Setting OpenTelemetry Tracer Provider builds=0
2022/04/20 22:09:51 internal_logging.go:48: "level"=1 "msg"="Tracer created" "name"="gitlab-org/gitlab-runner" "version"="" "schemaURL"=""
Doesn't log anything with traces and spans unfortunately.
OpenTelemetry Go SDK Init and Destroy¶
Do not initialize and immediately defer/destroy the TracerProvider
// First, create the console exporter
exporter, err := newOpenTelemetryStdoutExporter()
if err != nil {
mr.log().Fatal("cannot create OpenTelemetry Console Exporter: %s", err.Error())
}
// Second, install the tracer provider
tracerProvider := sdktrace.NewTracerProvider(
sdktrace.WithSampler(sdktrace.AlwaysSample()), //TODO: Remove for production?
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(newOpenTelemetryResource()),
)
//TODO: Figure out where to initialize and shutdown the tracer provider.
// https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/instrumentation/github.com/Shopify/sarama/otelsarama/example/producer/producer.go#L42
/*
defer func() {
if err := tracerProvider.Shutdown(context.Background()); err != nil {
mr.log().Fatal("cannot shutdown tracer provider: %s", err.Error())
}
}()
*/
After the function ends, the deferred shutdown is executed immediately - and thus no traces/spans created. Inspired by reading https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/instrumentation/github.com/Shopify/sarama/otelsarama/example/producer/producer.go#L42 and the InitTracer()
function in a seperate call. TODO Upstream getting started doesn't explain that.
Update Use the following pattern found in https://github.com/open-telemetry/opentelemetry-go/blob/main/exporters/otlp/otlptrace/otlptracehttp/example_test.go#L67
func (mr *RunCommand) setupOpenTelemetryInstrumentation() func() {
//...
return func() {
// Shutdown will flush any remaining spans and shut down the exporter.
tracerProvider.Shutdown(ctx)
}
}
in main()
shutdown := mr.setupOpenTelemetryInstrumentation()
defer shutdown()
Labkit in Runner?¶
https://gitlab.com/gitlab-org/labkit/tree/master/tracing implements OpenTracing which needs future replacement with OpenTelemetry.
TODO: Figure out whether it can be used for the current purpose of CI/CD Observability. It is not implemented nor used inside GitLab Runner yet.
Trace Scopes¶
- Runner => Jobs from Server
- Executors => environment specific insights
MVC:
- Run - Execute - Start Job - Do something - End Job - Return
Development: OpenTelemetry¶
go get go.opentelemetry.io/otel \
go.opentelemetry.io/otel/trace
go get go.opentelemetry.io/otel/sdk \
go.opentelemetry.io/otel/exporters/stdout/stdouttrace
Runner: Instrument Run Command¶
commands/multi.go implements the run
command. Follow the Go instrumentation example but directly instrument the workflow in run
.
Ignore the recommended coding style with a global name
variable and hardcode the strings. (TODO: Discuss with OpenTelemetry upstream whether the global variable is a good idea - can be overwhelming when learning).
Follow the path with logging the traces to stdout, and once it works, log them to a file.
make runner-bin-host
./out/binaries/gitlab-runner --debug run -c $HOME/dev/everyonecancontribute/observability/opentelemetry-gitlab-runner-playground/dev-config.toml
The learning progress can be followed in https://gitlab.com/everyonecancontribute/observability/gitlab-runner-opentelemetry/-/merge_requests/1
RequestJob¶
commands/multi.go
holds the doJobRequest()
function where we can start a new tracer in the current context.
mr.network.RequestJob()
is an interface which is implemented by network/gitlab.go > RequestJob()
.
Important: Extract the span from the current context, following https://opentelemetry.io/docs/instrumentation/go/manual/#get-the-current-span this requires the import of "go.opentelemetry.io/otel/trace"
.
https://opentelemetry.io/docs/instrumentation/go/manual/#events shows AddEvent()
as small logging interface for a single span. This can be useful to test the span for the job request towards the HTTP API of GitLab.
{
"Name": "Request Job",
"SpanContext": {
"TraceID": "c1f88f92c6a65011b41b0752419e3dab",
"SpanID": "b843b966dee3e006",
"TraceFlags": "01",
"TraceState": "",
"Remote": false
},
"Parent": {
"TraceID": "c1f88f92c6a65011b41b0752419e3dab",
"SpanID": "86721c0d6cefa1ca",
"TraceFlags": "01",
"TraceState": "",
"Remote": false
},
"SpanKind": 1,
"StartTime": "2022-04-20T23:44:09.346493+02:00",
"EndTime": "2022-04-20T23:44:09.561581312+02:00",
"Attributes": [
{
"Key": "httpStatusCode",
"Value": {
"Type": "INT64",
"Value": 204
}
}
],
"Events": [
{
"Name": "Requesting jobs from GitLab server",
"Attributes": null,
"DroppedAttributeCount": 0,
"Time": "2022-04-20T23:44:09.346503+02:00"
},
{
"Name": "Analysing response for jobs",
"Attributes": null,
"DroppedAttributeCount": 0,
"Time": "2022-04-20T23:44:09.561533+02:00"
},
{
"Name": "Response status 204 with text: '204 No Content'",
"Attributes": null,
"DroppedAttributeCount": 0,
"Time": "2022-04-20T23:44:09.56154+02:00"
}
],
"Links": null,
"Status": {
"Code": "Unset",
"Description": ""
},
"DroppedAttributes": 0,
"DroppedEvents": 0,
"DroppedLinks": 0,
"ChildSpanCount": 0,
"Resource": [
{
"Key": "environment",
"Value": {
"Type": "STRING",
"Value": "otel"
}
},
{
"Key": "service.name",
"Value": {
"Type": "STRING",
"Value": "gitlab-org/gitlab-runner"
}
},
{
"Key": "service.version",
"Value": {
"Type": "STRING",
"Value": "14.11.0~beta.6.g2134b4b6"
}
},
{
"Key": "telemetry.sdk.language",
"Value": {
"Type": "STRING",
"Value": "go"
}
},
{
"Key": "telemetry.sdk.name",
"Value": {
"Type": "STRING",
"Value": "opentelemetry"
}
},
{
"Key": "telemetry.sdk.version",
"Value": {
"Type": "STRING",
"Value": "1.6.3"
}
}
],
"InstrumentationLibrary": {
"Name": "gitlab-org/gitlab-runner",
"Version": "",
"SchemaURL": ""
}
}
processRunner¶
Start a new span, and add events and attributes.
{
"Name": "Process Runner",
"SpanContext": {
"TraceID": "f545535d1a41020093d316d9afdc07d7",
"SpanID": "86ccb837a7b33135",
"TraceFlags": "01",
"TraceState": "",
"Remote": false
},
"Parent": {
"TraceID": "00000000000000000000000000000000",
"SpanID": "0000000000000000",
"TraceFlags": "00",
"TraceState": "",
"Remote": false
},
"SpanKind": 1,
"StartTime": "2022-04-20T23:44:04.186972+02:00",
"EndTime": "2022-04-20T23:44:07.988076847+02:00",
"Attributes": null,
"Events": [
{
"Name": "Got executor provider",
"Attributes": [
{
"Key": "executor",
"Value": {
"Type": "STRING",
"Value": "docker"
}
}
],
"DroppedAttributeCount": 0,
"Time": "2022-04-20T23:44:04.186991+02:00"
},
{
"Name": "Acquired executor provider",
"Attributes": [
{
"Key": "executor",
"Value": {
"Type": "STRING",
"Value": "docker"
}
}
],
"DroppedAttributeCount": 0,
"Time": "2022-04-20T23:44:04.186996+02:00"
},
{
"Name": "Acquired build",
"Attributes": [
{
"Key": "runner",
"Value": {
"Type": "STRING",
"Value": "hAe2pVEf"
}
},
{
"Key": "worker",
"Value": {
"Type": "INT64",
"Value": 2
}
}
],
"DroppedAttributeCount": 0,
"Time": "2022-04-20T23:44:04.187002+02:00"
},
{
"Name": "Requested build job",
"Attributes": null,
"DroppedAttributeCount": 0,
"Time": "2022-04-20T23:44:04.711085+02:00"
},
{
"Name": "Processing build job",
"Attributes": [
{
"Key": "build-id",
"Value": {
"Type": "INT64",
"Value": 2357407093
}
}
],
"DroppedAttributeCount": 0,
"Time": "2022-04-20T23:44:04.711198+02:00"
}
],
"Links": null,
"Status": {
"Code": "Unset",
"Description": ""
},
"DroppedAttributes": 0,
"DroppedEvents": 0,
"DroppedLinks": 0,
"ChildSpanCount": 0,
"Resource": [
{
"Key": "environment",
"Value": {
"Type": "STRING",
"Value": "otel"
}
},
{
"Key": "service.name",
"Value": {
"Type": "STRING",
"Value": "gitlab-org/gitlab-runner"
}
},
{
"Key": "service.version",
"Value": {
"Type": "STRING",
"Value": "14.11.0~beta.6.g2134b4b6"
}
},
{
"Key": "telemetry.sdk.language",
"Value": {
"Type": "STRING",
"Value": "go"
}
},
{
"Key": "telemetry.sdk.name",
"Value": {
"Type": "STRING",
"Value": "opentelemetry"
}
},
{
"Key": "telemetry.sdk.version",
"Value": {
"Type": "STRING",
"Value": "1.6.3"
}
}
],
"InstrumentationLibrary": {
"Name": "gitlab-org/gitlab-runner",
"Version": "",
"SchemaURL": ""
}
}
OpenTelemetry Exporter¶
Follow the documentation to add an OTLP Exporter
- https://opentelemetry.io/docs/instrumentation/go/exporting_data/#otlp-exporter
- https://github.com/open-telemetry/opentelemetry-go/tree/main/exporters/otlp/otlptrace
- https://github.com/open-telemetry/opentelemetry-go/tree/main/example/otel-collector
TODO: This is not documented in https://github.com/open-telemetry/opentelemetry-go/tree/main/exporters/otlp/otlptrace#configuration
go get -u go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp
go get -u go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
Local Docker Compose Setup¶
Started on 2022-ß5-05.
Follow dev-env-otel.md to bring up Jaeger AIO and the OpenTelemetry collector.
cd docker-compose
docker-compose up -d
# Extract the port where 4317 is mapped to, use below for the OTEL_EXPORTER_OTLP_ENDPOINT variable
docker-compose ps
Export the gRPC target, run the GitLab runner binary and trigger the CI/CD pipeline on the server.
export OTEL_EXPORTER_OTLP_ENDPOINT="localhost:49175"
./out/binaries/gitlab-runner --debug run -c $HOME/dev/everyonecancontribute/observability/opentelemetry-gitlab-runner-playground/dev-config.toml
Open http://localhost:16686/ with Jaeger Query UI and search for the service.
WOHOOOOOOOO first trace.
Show OTLPResourceNoServiceName: Process Runner
. But why? Let's Google. Found https://seankhliao.com/blog/12021-02-12-otel-go-v0.16.0-tracing/
resources While not strictly necessary, it's a good idea to set some attributes, at the very least ServiceName, or your trace shows up as from OTLPResourceNoServiceName (even if you set a named tracer 🤷). Some standard keys and values are defined in semconv.
Docs: resource semconv https://pkg.go.dev/go.opentelemetry.io/otel/sdk@v0.16.0/resource https://pkg.go.dev/go.opentelemetry.io/otel@v0.16.0/semconv
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceNameKey.String("service-a"),
),
)
Kubernetes Deployment (WIP) Notes 2022-05-04¶
Get the OpenTelemetry Collector host and port from the [dev-env-otel.md] Kubernetes deployment.
- 4317 for gRPC
- 4318 for HTTP
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="212.2.243.155:4318"
export OTEL_EXPORTER_OTLP_ENDPOINT="212.2.243.155:4318"
./out/binaries/gitlab-runner --debug run -c $HOME/dev/everyonecancontribute/observability/opentelemetry-gitlab-runner-playground/dev-config.toml
OTEL_EXPORTER_OTLP_ENDPOINT: %s212.2.243.155:4318 builds=0
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT: %s212.2.243.155:4318 builds=0
2022/05/04 21:51:44 internal_logging.go:48: "level"=1 "msg"="TracerProvider created" "config"={"SpanProcessors":[{"Type":"BatchSpanProcessor","SpanExporter":{"Type":"otlptrace","Client":{"Type":"otlphttphttp","Endpoint":"localhost:4318","Insecure":false}},"Config":{"MaxQueueSize":2048,"BatchTimeout":"5s","ExportTimeout":"30s","MaxExportBatchSize":512,"BlockOnQueueFull":false}}],"SamplerType":"trace.alwaysOnSampler","IDGeneratorType":"*trace.randomIDGenerator","SpanLimits":{"AttributeValueLengthLimit":-1,"AttributeCountLimit":128,"EventCountLimit":128,"LinkCountLimit":128,"AttributePerEventCountLimit":128,"AttributePerLinkCountLimit":128},"Resource":{"Attributes":{},"SchemaURL":""}}
2022/05/04 21:51:44 internal_logging.go:48: "level"=1 "msg"="Tracer created" "name"="gitlab-org/gitlab-runner" "version"="" "schemaURL"=""
Next: Debug why the environment variables are not used.
https://docs.honeycomb.io/getting-data-in/opentelemetry/go/
Debug why nothing is received by OpenTelemtry Collector in Kubernetes.
https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/troubleshooting.md
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
namespace: observability
labels:
component: otel-collector
spec:
config: |
# OTel receiver - GitLab Runner writing into
# https://opentelemetry.io/docs/collector/configuration/#receivers
receivers:
jaeger:
protocols:
grpc:
otlp:
protocols:
grpc:
http:
# OTel exporter writing to Jaeger+Clickhouse backend
# kubectl get svc -n observability
# https://opentelemetry.io/docs/collector/configuration/#exporters
exporters:
jaeger:
endpoint: "jaeger-clickhouse-collector-headless.observability:14250"
tls:
insecure: true
logging:
loglevel: debug
processors:
batch:
extensions:
health_check:
pprof:
zpages:
# https://opentelemetry.io/docs/collector/configuration/#service
service:
extensions: [health_check, pprof, zpages]
telemetry:
logs:
level: debug
initial_fields:
service: otel-gitlab
pipelines:
traces:
receivers: [jaeger,otlp,zipkin]
processors: [batch]
exporters: [jaeger,logging]
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
namespace: observability
labels:
app: otel
spec:
type: LoadBalancer
selector:
app.kubernetes.io/component: opentelemetry-collector
# Expose ports as load balancer
ports:
- name: grpc-otlp
port: 4317
protocol: TCP
targetPort: 4317
- name: http-otlp
port: 4318
protocol: TCP
targetPort: 4318
Posting with HTTP/JSON: https://github.com/open-telemetry/opentelemetry-collector/blob/399132257fb5c72efb5aea8542f41e452f1f7183/receiver/otlpreceiver/README.md#writing-with-httpjson
Always receiving an empty response / EOF.
Maybe TLS problems with verification? https://github.com/open-telemetry/opentelemetry-collector/blob/main/config/configtls/README.md#tls--mtls-configuration
receivers:
jaeger:
insecure: true
protocols:
grpc:
http:
otlp:
insecure: true
protocols:
grpc:
http:
kubectl rollout restart deployment -n observability otel-collector-collector
TODO: Follow discussion to extend the OpenTelemetryCollector CRD with host:port https://github.com/open-telemetry/opentelemetry-operator/issues/252
gRPC also does not work. Returns EOF.
brew install grpcurl
grpcurl 212.2.244.84:4317 helloworld.Greeter/SayHello ─╯
Failed to dial target host "212.2.244.84:4317": EOF
Switched the GitLab Runner code to gRPC.
2022/05/05 00:34:59 max retry time elapsed: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: EOF"
For testing, http.
Nope.
Next try: Skip load balancer, use portfowarding.
kubectl port-forward deployment/otel-collector 4317 -n observability
export OTEL_EXPORTER_OTLP_ENDPOINT="localhost:4317"
TODO Service config is broken.
With port-forwarding, the error changes to
2022/05/05 00:51:29 internal_logging.go:62: "level"=5 "msg"="exporting spans" "count"=5 "total_dropped"=0
2022/05/05 00:51:29 max retry time elapsed: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: tls: first record does not look like a TLS handshake"
Change connection protocol to http.
Error with parsing the address, kinda expected.
2022/05/05 00:53:47 internal_logging.go:62: "level"=5 "msg"="exporting spans" "count"=1 "total_dropped"=0
2022/05/05 00:53:47 max retry time elapsed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: address http://localhost:4317: too many colons in address"
Next try: WithInsecure() in the grpc code, and localhost portforwarding.
kubectl port-forward deployment/otel-collector 4317 -n observability
export OTEL_EXPORTER_OTLP_ENDPOINT="localhost:4317"
No more errors, but also no connection attempts logged in teh pod.