Finding and Fixing a Memory Leak in OpenTelemetry: A Debugging Walkthrough

Table of Contents

TL;DR: A Cloud Run Function started crashing with out-of-memory errors. After systematic debugging, I discovered a memory leak in OpenTelemetry’s Python SDK caused by strong references in the MeterProvider class. Here’s how I tracked it down and contributed the fix.

1. Problem Statement #

Observability has become the cornerstone of modern software operations, with OpenTelemetry emerging as the industry standard for collecting telemetry data across distributed systems. As organizations increasingly adopt it to gain visibility into their applications, they often encounter unexpected challenges that test both their debugging skills and their understanding of the observability stack.

During my time working on an API gateway management platform for a prominent Bavarian automotive corporation, I was responsible for building and maintaining a Kubernetes-based system that multiplexed logs and metrics from Apigee, forwarding them to various collectors including Splunk, Dynatrace, and Prometheus with Grafana. Our architecture leveraged Google Cloud Run Functions for both scheduled operations and message-driven processing.

The Mystery: Persistent Memory Growth #

One of our scheduled Cloud Run Functions began exhibiting persistent memory growth that defied initial assumptions. Memory usage would steadily climb during execution, never returning to baseline levels, eventually causing out-of-memory errors and function restarts after a few executions.

Finding and fixing opentelemetry memory leak — Figure 1: Oddly looking memory usage graph

The function’s logic appeared straightforward, implementing a classic ETL pattern:

Extract: Reading metrics data from Apigee
Transform: Converting data to OpenTelemetry Protocol (OTLP) format
Load: Sending converted metrics to an OTLP exporter

Local testing showed no obvious memory retention issues, making this a classic production-only problem.

What You’ll Learn from This Post #

This post presents the complete debugging journey from symptom detection to root cause analysis and resolution. You’ll discover:

Systematic debugging techniques for identifying memory leaks in serverless environments
The specific OpenTelemetry bug that caused our memory leak and why it occurred
Practical investigation methods using profiling tools and memory analysis
The fix implementation and lessons learned for future prevention

Whether you’re dealing with similar memory issues, working with OpenTelemetry, or simply want to strengthen your production debugging toolkit, this deep dive will provide actionable insights for your own troubleshooting adventures.

2. Investigation Methodology #

Step 1: Checked Our Code #

OpenTelemetry dependency wasn’t our first suspect. Since our function utilized multithreading, we initially focused there, especially considering that Google Cloud Run recommends avoiding background activities when using request-based billing.

Our debugging approach included:

Manual code analysis - systematically reviewing code sections that could potentially leak memory.
Python profiling - using the built-in profile library to analyze memory allocation and usage patterns.

Despite our thorough investigation, we found nothing suspicious in our codebase. This approach had reached a dead end.

Step 2: Narrow the Problem Down #

With our initial troubleshooting efforts unsuccessful, we pivoted to a different strategy. Since our function had three distinct components (Extract/Transform/Load), we decided to isolate each step and monitor memory usage independently.

This targeted approach proved invaluable. By testing each component separately, we discovered that the memory leak originated specifically in the Transform section - the code responsible for converting data to OTLP format using the OpenTelemetry library.

The evidence now pointed clearly toward a potential bug in the official Python OpenTelemetry library itself.

Step 3: Minimal Reproducible Code Snippet #

Armed with this knowledge, I stripped away all non-essential code and created a minimal reproducible example that demonstrated the memory leak.

To validate our findings, I reached out to the OpenTelemetry maintainers through the CNCF Slack community. Their response confirmed our suspicions - we had indeed uncovered a legitimate bug in the library.

This validation was crucial, as it not only confirmed our diagnosis but also opened the path toward finding a proper solution.

3. Root Cause Analysis #

The root cause of our memory leak was OpenTelemetry’s improper use of strong references in the MeterProvider class. Understanding this requires grasping how Python’s reference system works.

The OpenTelemetry Bug #

The culprit was the _all_metric_readers field in OpenTelemetry’s MeterProvider class. This field used a regular set() to track metric readers, creating strong references that prevented garbage collection.

Each function execution created new metric readers that accumulated in memory, prevented cleanup, until the function crashed from memory exhaustion.

Strong vs Weak References Explained #

Python supports two types of object references with very different memory management behaviors:

Strong References (e.g. set()) keep objects alive indefinitely:

Objects remain in memory even when you think you’ve deleted them
Python’s garbage collector cannot clean them up
Creates memory leaks in long-running applications

Weak References (e.g. weakref.WeakSet()) allow automatic cleanup:

Objects can be garbage collected when no other code needs them
The WeakSet automatically removes dead references
No memory leaks occur

Demonstrating the Problem

Here’s the problematic behavior:

import weakref
import gc

class MetricReader:
    def __init__(self, name):
        self.name = name

class MeterProviderWithSet:
    def __init__(self):
        self._readers = set()  # Strong references cause memory leak!

    def add_reader(self, reader):
        self._readers.add(reader)

# Test the memory leak
provider = MeterProviderWithSet()
reader = MetricReader("ProblematicReader")
reader_weakref = weakref.ref(reader)

provider.add_reader(reader)
del reader
gc.collect()

# The reader still exists in memory!
assert reader_weakref() is None  # This fails - memory leak confirmed

Fixed Implementation

Here’s the fixed version using weak references:

class MeterProviderWithWeakSet:
    def __init__(self):
        self._readers = weakref.WeakSet()  # Allows garbage collection!

    def add_reader(self, reader):
        self._readers.add(reader)

# Test the fixed behavior
provider = MeterProviderWithWeakSet()
reader = MetricReader("FixedReader")
reader_weakref = weakref.ref(reader)

provider.add_reader(reader)
del reader
gc.collect()

# SUCCESS: Reader was properly garbage collected
assert reader_weakref() is None  # This passes - no memory leak

When to Use Each Approach #

Use set() when you OWN the objects:

You control the object lifecycle (create and destroy)
Objects should persist until you explicitly remove them
Examples: Connection pools, shopping carts, plugin registries

Use weakref.WeakSet() when you OBSERVE objects:

Objects have independent lifecycles
You’re just tracking or monitoring them
Automatic cleanup when objects are destroyed elsewhere
Examples: Event observers, metrics collection, caching systems

OpenTelemetry’s MeterProvider was designed to observe metric readers, not own them. By using a regular set(), it inadvertently took ownership, preventing garbage collection and causing our memory leak.

Why This Mattered in Our Case

In our serverless environment, where functions are frequently created and destroyed, this distinction became critical. Each function execution created new metric readers that accumulated in memory, never being released, until the function eventually crashed from memory exhaustion.

4. Benchmarking the Issue #

Since our initial profiling efforts hadn’t clearly identified the problem, and now that we understood the root cause, I wanted to create a reliable way to detect similar issues in the future. The key was building a comprehensive benchmark that could demonstrate both the problem and the solution.

Building the Benchmark #

I developed a benchmark consisting of three core components:

LeakyReader: Simulates an object that holds large data structures in memory, representing our real-world metric readers.
ProblematicProvider: Mimics the original OpenTelemetry behavior by keeping strong references to readers, preventing garbage collection.
FixedProvider: Implements the solution using weakref.WeakSet, allowing proper cleanup of unused readers.

Test Scenarios #

The benchmark runs two distinct scenarios to highlight the difference:

Leaking Scenario: Memory usage steadily increases over time. LeakyReader instances accumulate in memory and are never garbage collected due to strong references, creating the same issue we experienced in production.
Fixed Scenario: Memory usage remains stable. LeakyReader instances are properly garbage collected at the end of each iteration, demonstrating how the fix resolves the memory leak.

You can find the complete benchmark code in this GitHub Gist.

Profiling with Memray #

For profiling the simulation, I used Bloomberg’s memray library, which provides excellent memory profiling capabilities for Python applications. The gist includes a comprehensive README explaining how to run the benchmark and generate the profiling data.

The results speak for themselves:

Leaking Scenario Results

Figure 2: Memory continuously grows as objects accumulate without being garbage collected

Fixed Scenario Results

Figure 3: Memory usage remains stable as objects are properly cleaned up after each iteration

The dramatic difference between these two charts clearly demonstrates both the severity of the original problem and the effectiveness of the solution.

5. The Solution #

With a clear understanding of the issue and solid evidence from our benchmarking, the path forward was straightforward.

Contributing the Fix #

I took the following steps to address the problem:

Reported the Issue: Filed issue #4220 in the OpenTelemetry Python repository, documenting the memory leak with our findings
Implemented the Fix: Created pull request #4224 with the necessary changes to replace strong references with weak references
Collaborated with Maintainers: Worked with the OpenTelemetry team to review and approve the changes
Waited for Release: The fix was merged and included in a subsequent release

Immediate Results #

Once the updated OpenTelemetry library was deployed to our production environment, the impact was immediate. The Cloud Run Function memory usage finally behaved as expected from the beginning.

The cyclical pattern of memory growth and crashes disappeared entirely, replaced by stable, predictable memory consumption that remained consistent across function executions.

6. Lessons Learned #

This debugging journey provided several valuable takeaways that extend beyond this specific issue.

Profile Early and Often

Use tools like tracemalloc, objgraph, or memray in Python to identify reference leaks before they become production problems. These tools provide concrete data about memory allocation patterns that code review might miss. Establishing baseline memory profiles during development can save significant debugging time later.

Monitor Production Continuously

Implement comprehensive monitoring that can detect gradual memory growth patterns. Without proper monitoring, we wouldn’t have noticed this problem early, creating a longer feedback loop and potential user impact.

Third-Party Libraries Aren’t Immune

Even widely-adopted, well-maintained open source libraries can contain edge-case bugs. Don’t assume the problem is always in your code - systematic elimination of components can reveal issues in dependencies.

The Environment Matters

Production environments behave differently than local ones. Serverless functions have different memory pressure patterns than long-running applications, making certain types of memory leaks more critical and harder to detect during development.