How do we save 70K kernels in 30 critical service tasks

2022-04-18 0 By

As part of Uber’s many efforts to achieve profitability through engineering, our team has recently focused on reducing the cost of computing power through increased efficiency.Some of the most influential work is around GOGC optimization.In this blog post, we want to share our experience with efficient, low-risk, large-scale, semi-automated Go garbage collection tuning mechanisms.Uber’s technology stack is made up of thousands of microservices, powered by a cloud-native scheduling-based infrastructure.Most of these services are written in Go.Our team, the Mapmaking engineering group, has previously played an important role in significantly improving the efficiency of multiple Java services by tuning GC.In early 2021, we explored the possibility of performance tuning for go-based services.We ran several CPU profiles to assess the current state and found that GC was the largest CPU consumer for most mission-critical services.Here are some representations of CPU configuration files where GC (identified by the Runtime.scanObject method) consumes a significant portion of allocated computing resources.As a result of this discovery, we started GC tuning for the related services.To our delight, Go’s GC implementation and simple tuning enabled us to automate most of the detection and tuning mechanisms.We will detail our approach and its implications in a subsequent section.The GOGC tuner Go runtime environment calls the concurrent garbage collector at periodic intervals, unless there is a previous triggering event.Triggering events are based on memory back pressure.Therefore, GC affected Go services benefit from more memory because it reduces the number of times the GC must run.In addition, we realized that our mainboard-level CPU to memory ratio is 1:5 (1 core :5 GB of memory), whereas most Golang services have a 1:1 to 1:2 configuration ratio.Therefore, we believe we can use more memory to reduce the impact of the GC CPU.This is a service agnostic mechanism that can have a big impact if applied properly.Delving into Go’s garbage collection is beyond the scope of this article, but here’s what’s relevant: Garbage collection in Go is concurrent, and all objects need to be analyzed to determine which ones are still accessible.We refer to accessible objects as “real-time datasets.”Go provides only one tool, the GOGC, expressed as a percentage of the real-time data set, to control garbage collection.The GOGC value acts as a multiplier for the data set.The default value for GOGC is 100%, which means that the Go runtime environment reserves the same amount of memory for new allocations as it does for live datasets.For example: Hard target = real-time dataset + real-time dataset * (GOGC / 100).Pacer is then responsible for predicting the best time to trigger garbage collection to avoid hitting hard (and soft) targets.Figure 3: Example heap memory with default configuration dynamic and diverse: No one-size-fits-all We found that adjustments based on fixed GOGC values were not suitable for Uber’s service.Some of these challenges are that not knowing the maximum memory allocated to a container can lead to memory overflow problems.Our microservices have a significantly different mix of memory usage.For example, sharding systems can have very different real-time data sets.We encountered this situation in one of the services where P99 was used 1GB and P1 was used 100MB, so 100MB instances have a huge impact on GC.The pain point mentioned earlier in the automation case is the reason for the concept of GOGCTuner.The GOGCTuner library simplifies the process for service owners to optimize garbage collection and adds a layer of reliability on top of it.GOGCTuner dynamically calculates the correct GOGC value based on the container’s memory limit (or upper limit of the service owner) and sets it using Go’s runtime API.Here are details of the GOGCTuner library’s capabilities: Simplified configuration to facilitate reasoning and deterministic calculations.100% of the GOGC is not clear and definitive to GO beginners because it still relies on real-time data sets.On the other hand, the 70% limit ensures that the service always uses 70% of the heap space.Prevent OOM (memory overflow) : This library reads the memory limit from Cgroup and uses the default hard limit of 70% (which is safe in our experience).It is worth noting that there are limits to this protection.The spinner can only adjust the buffer allocation, so if your service’s live objects are higher than the spinner’s limit, the spinner will set 1.25 times the usage of the lower live objects to the default limit.Higher GOGC values are allowed for the following situations: As mentioned above, manual GOGC is not deterministic.We still rely on the size of the real-time data set.What if the real-time data set is twice as large as our last peak?GOGCTuner will use more cpus to enforce the same memory limit.Conversely, manual adjustment can cause memory overflow.As a result, service owners used to provide a lot of caching for these types of scenarios.See the following example: Normal traffic (live data set is 150M) Figure 4: Normal operation.The default configuration is on the left, and manual adjustment is on the right.Double flow (300M in real time data set) Figure 5: Double load.The default configuration is on the left, and manual adjustment is on the right.Double the flow with GOGCTuner set to 70% (300M for the real-time dataset) Figure 6: Double the flow, but with a spinner.The default configuration is on the left and the GOGCTuner adjustment is on the right.Services that use the MADV_FREE memory policy result in incorrect memory measurements.For example, our observability metric shows 50% memory usage (it has actually freed 20% of that 50%).The service owner then uses only this “inaccurate” metric to tune the GOGC.Observability We found that we lacked some key metrics that would allow us to understand more about garbage collection for each service.Interval between garbage collections: This allows us to see if we can adjust.If your service still has a high GC impact, but you’ve seen this figure 120s, that means you can no longer use GOGC to tweak it.In this case, you need to optimize allocation.Figure 7: Spacing diagram between GCS.GC CPU impact: Let us know which services are most affected by GC.Figure 8: P99 GC CPU consumption diagram.Real-time data set size: Helps us identify memory leaks.The problem that service owners notice is that they see an increase in memory usage.To show them that there are no memory leaks, we added a “live usage” metric that shows steady memory usage.Figure 9: P99 real-time data set projection.GOGC value: Useful for understanding the effect of adjustments.Figure 10: Diagram of the spinner assigning min, P50, p99 GOGC values to the application.Our initial approach was to have a timer run once per second to monitor the heap metrics and then adjust the GOGC value accordingly.The downside of this approach is that it starts to get quite expensive, because in order to read the heap metrics, Go needs to perform STW (ReadMemStats) once, which is not very accurate, as we might do garbage collection multiple times per second.Fortunately, an alternative has been found.Go has finalizers (setFinizers), which are functions that run when garbage objects are collected.They are mainly used to clean up memory in C code or other resources.We can use a self-referenced Finalizer to reset ourselves every time a GC is called.This allows us to reduce any CPU overhead.For example: Figure 11: Sample code for GC triggering events.Call the runtime.SetFinalizer(f, finalizerHandler) in finalizerHandler allows the application to run on each GC;It basically doesn’t let the reference die because it’s not an expensive resource (it’s just a pointer).Impact After deploying GOGCTuner across dozens of our services, we delved into some of them with significant double-digit increases in CPU usage.These services alone have resulted in a cumulative savings of approximately 70K kernels.Here are two such examples: Figure 12: An observability service with a high standard deviation (maximum is 10 times the minimum) of the real-time data set running on thousands of computing cores, showing a reduction in P99 CPU usage of about 65%.Figure 13: Mission-critical Uber Eats service running on thousands of computing cores, showing p99 CPU usage reduced by about 30%.The resulting reduction in CPU usage optimizes p99 latency tactically (and associated SLAs, user experience) and performance costs strategically (since services scale based on their usage).Conclusion garbage collection is one of the most elusive and underrated factors affecting application performance.Go’s strong GC mechanism and simplified tuning, our diverse and large-scale Go service footprint, and our strong internal platform (Go, computing, observability) together allow us to have such a massive impact.The problem itself is evolving due to changes in technology and our capabilities, and we want to continue to improve the way GC tuning is done.To reiterate what we said in the introduction: there is no one-size-fits-all solution.We believe that GC performance varies in a cloud native setting because of the highly variable performance of the public cloud and the containerized loads that run in it.This, combined with the fact that the vast majority of CNCF landing projects we use (Kubernetes, Prometheus, Jaeger, etc.) are written in Golang, means that any external large-scale deployment can also benefit from this work.Cristian Velazquez is a senior engineer ii in Uber’s Mapping engineering team.He is responsible for several efficiency initiatives that span multiple organizations, the most relevant of which is garbage collection tuning for Java and Go.How We Saved 70K Cores Across 30 mission-critical Services (Large-scale, semi-automated Go GC Tuning @Uber)