Speeding up processing with Goroutines

How I increased the speed of my image processing application by 8x

I realize I still need to do a deeper dive on my Go program that I used to make the images displayed here, but until then, let's chat concurrency!

All you need to know is, I have a program that performs actions on every pixel in an image. I'd heard that goroutines could be used to do some of this work in parallel. What if I could process ever pixel at once? While wasn't able to get to that level of speed, I cut total performance speed by about 4 times.

My original function

1func blendImages(image1 image.Image, image2 image.Image, modFunction func(pixel1 color.RGBA, pixel2 color.RGBA) color.RGBA) image.Image { 2 img1, img2 := resizeImages(image1, image2) 3 bounds := img1.Bounds() 4 width, height := bounds.Max.X, bounds.Max.Y 5 newImage := image.NewRGBA(image.Rect(0, 0, width, height)) 6 for y := bounds.Min.Y; y < bounds.Max.Y; y++ { 7 for x := bounds.Min.X; x < bounds.Max.X; x++ { 8 pixel1 := img1.At(x, y) 9 pixel2 := img2.At(x, y) 10 rgba := modFunction(color.RGBAModel.Convert(pixel1).(color.RGBA), color.RGBAModel.Convert(pixel2).(color.RGBA)) 11 12 newImage.Set(x, y, rgba) 13 } 14} 15
go

It takes in two images, resizes them to the same size, loops through the length and height ranges and passes the pixel at that coordinate of each image to modFunction, which returns a new pixel as result. This pixel processing happens one at a time, which is where I saw the opportunity for using Goroutines.

The Goroutine Refactor

After much back and forth with ChatGPT, I landed on the following -

1func blendImagesConcurrently(image1, image2 image.Image, modFunction func(pixel1, pixel2 color.RGBA) color.RGBA) image.Image { 2 img1, img2 := resizeImages(image1, image2) 3 bounds := img1.Bounds() 4 width, height := bounds.Max.X, bounds.Max.Y 5 newImage := image.NewRGBA(image.Rect(0, 0, width, height)) 6 var wg sync.WaitGroup 7 numGoroutines := 500 8 rowsPerGoroutine := height / numGoroutines 9 for i := 0; i < numGoroutines; i++ { 10 wg.Add(1) 11 12 go func(startRow, endRow int) { 13 defer wg.Done() 14 for y := startRow; y < endRow; y++ { 15 for x := bounds.Min.X; x < bounds.Max.X; x++ { 16 pixel1 := img1.At(x, y) 17 pixel2 := img2.At(x, y) 18 19 rgba := modFunction(color.RGBAModel.Convert(pixel1).(color.RGBA), color.RGBAModel.Convert(pixel2).(color.RGBA)) 20 newImage.Set(x, y, rgba) 21 } 22 } 23 }(i*rowsPerGoroutine, min((i+1)*rowsPerGoroutine, height)) 24} 25 wg.Wait() 26 return newImage 27} 28 29// min returns the smaller of x or y. 30 31func min(x, y int) int { 32if x < y { 33 return x 34} 35 return y 36} 37
go

Basically it divides the pixels into 500 different chunks and passes them into the goroutine to process those pixels. I don't know what's happening under the hood, perse, but I can confirm that it dramatically sped up performance. But how can I know for sure?

Performance benchmarking with b *testing.B

I've never seen anything like this before Go, but Go makes it incredibly straightforward to test how fast a function executes.

I set up the following tests -

1package main 2 3import ( 4 "image" 5 "image/color" 6 "testing" 7) 8 9func getTwoTestImages(dimension int) (image.Image, image.Image) { 10 img1 := CreateNewImage(dimension, dimension, func(height, width int) color.RGBA { 11 return color.RGBA{ 12 R: 200,G: 200,B: 200,A: 255, 13 } 14 }) 15 16 img2 := CreateNewImage(dimension, dimension, func(height, width int) color.RGBA { 17 return color.RGBA{ 18 R: 100,G: 100,B: 100,A: 255 19 } 20 }) 21 return img1, img2 22} 23 24func getTestDimension() int { 25 return 1000 26} 27 28func BenchmarkModifyImageConcurrent(b *testing.B) { 29 img, img2 := getTwoTestImages(getTestDimension()) 30 b.ResetTimer() // Reset the timer to exclude the setup time 31 for i := 0; i < b.N; i++ { 32 _ = blendImagesConcurrently(img, img2, replaceHue) 33 } 34} 35 36func BenchmarkModifyImage(b *testing.B) { 37 img, img2 := getTwoTestImages(getTestDimension()) 38 b.ResetTimer() // Reset the timer to exclude the setup time 39 for i := 0; i < b.N; i++ { 40 _ = blendImages(img, img2, replaceHue) 41 } 42} 43
go

the value that returns from getTestDimension dictates the size of the test squares - so a value of 1000 would mean a 1000px x 1000px square.

Performance results

When I run with with getTestDimension set to 1000, I get the following results. In the test time period, the Concurrent function ran 63 times, while the linear function ran 8 - just shy of 8 times faster. Though when I do the division of 128556984 nano seconds / 23612317 nano seconds I get 5.44 - so... either way it's faster?

1goos: darwin 2goarch: amd64 3pkg: pioneer 4cpu: VirtualApple @ 2.50GHz 5BenchmarkModifyImageConcurrent-8 63 23612317 ns/op 6BenchmarkModifyImage-8 8 128556984 ns/op 7PASS 8

Though when I drop drop the test image size to 300x300, results become

1goos: darwin 2goarch: amd64 3pkg: pioneer 4cpu: VirtualApple @ 2.50GHz 5BenchmarkModifyImageConcurrent-8 6544 177636 ns/op 6BenchmarkModifyImage-8 150 7855518 ns/op 7

Suddenly the concurrent implementation runs 43x faster? And this time, both the cycle count and the nano seconds per operation are much closer to the same ratio.

How many coroutines should I run?

I don't know! I landed on 500 because that's when the code ran the fastest at 300x300. I look forward to developing a deeper understanding of what's going on. Above 500 the performance started to dip - presumably due to the overhead of orchestrating the goroutines themselves.

In conclusion

This exercise has inspired more questions than answers - namely -

  • why is the goroutine function so much faster at smaller image sizes, but not as much faster at larger file sizes?
  • Can I graph out speed performer by goroutine count for a bunch of image sizes to determine if there is an optimal goroutine count for different size bands?
  • Does the optimal goroutine count vary by what computer is running the program?

Questions aside, I'm thrilled my application runs so much faster!!

a blurry and colorful image made with the image processing tool made with the Go program - now at least 8ish to 44ish times faster