How to find performance bottlenecks in your application

This is a part of a KCachegrind window

This is a part of a KCachegrind window

Profiling your application to find performance bottlenecks is easier then you may think, I’m using gperftools and KCachegrind to profile Laustracker that makes heavy use of OpenCV functions.

Before you can start profiling you need to install the following tools:

[chris@thinkpad ~]$ sudo yum install pprof gperftools-libs kdesdk-kcachegrind

Have a look at the files that get installed by gperftools-libs, later we will need libprofiler.so.

[chris@thinkpad ~]$ rpm -ql gperftools-libs
/usr/lib64/libprofiler.so.0
/usr/lib64/libprofiler.so.0.3.0
/usr/lib64/libtcmalloc.so.4
/usr/lib64/libtcmalloc.so.4.1.0
/usr/lib64/libtcmalloc_and_profiler.so.4
/usr/lib64/libtcmalloc_and_profiler.so.4.1.0
/usr/lib64/libtcmalloc_debug.so.4
/usr/lib64/libtcmalloc_debug.so.4.1.0
/usr/lib64/libtcmalloc_minimal.so.4
/usr/lib64/libtcmalloc_minimal.so.4.1.0
/usr/lib64/libtcmalloc_minimal_debug.so.4
/usr/lib64/libtcmalloc_minimal_debug.so.4.1.0
[chris@thinkpad ~]$

We will start with an easy application that you may know:

[chris@thinkpad ~]$ which top
/usr/bin/top
[chris@thinkpad ~]$ env LD_PRELOAD="/usr/lib64/libprofiler.so" CPUPROFILE=top.prof /usr/bin/top
...
PROFILE: interrupts/evictions/bytes = 3/0/344
[chris@thinkpad ~]$ pprof --callgrind /usr/bin/top top.prof > top.callgrind
Using local file /usr/bin/top.
Using local file top.prof.
Removing killpg from all stack traces.
[chris@thinkpad ~]$ pprof --text /usr/bin/top top.prof
Using local file /usr/bin/top.
Using local file top.prof.
Removing killpg from all stack traces.
Total: 3 samples
      1  33.3%  33.3%        1  33.3% __getdents64
      1  33.3%  66.7%        1  33.3% __open_nocancel
      1  33.3% 100.0%        1  33.3% __xstat64
      0   0.0% 100.0%        2  66.7% 00000000004034e5
      0   0.0% 100.0%        3 100.0% 0000000000404a9c
      0   0.0% 100.0%        1  33.3% 00000000004090f8
      0   0.0% 100.0%        2  66.7% 0000000000409190
      0   0.0% 100.0%        3 100.0% __libc_start_main
      0   0.0% 100.0%        1  33.3% _init
      0   0.0% 100.0%        1  33.3% _init (inline)
      0   0.0% 100.0%        1  33.3% readdir64
      0   0.0% 100.0%        3 100.0% readproc
      0   0.0% 100.0%        3 100.0% tty_to_dev
[chris@thinkpad ~]$ kcachegrind top.callgrind

Since I haven’t written the top program I don’t know exactly what these functions do, readproc probably reads process information from /proc. You see that it is not strongly necessary to modify the application to get profiling information but if you use a debug build the output will include file and line number information and kcachegrind will show an annotated source code view.

KCachegrind top.callgrind

KCachegrind top.callgrind

Profiling Laustracker

You will find the source code of Laustracker here:

From http://stackoverflow.com/questions/7622672/opencv-better-performance:

Traditional techniques for improving image analysis:

  • Reduce the image to a monochrome sample.
  • Reduce the range of samples, e.g. from 8-bit monochrome to 4-bit monochrome.
  • Reduce the size of the image, e.g. 1024x1924 to 64x64.
  • Reduce the frame rate, e.g 60fps to 5fps.
  • Perform a higher level function to guess where the target area is with say a lower resolution, then perform the regular analysis on the cropped output, e.g. perform image recognition to locate the hand before determining the gesture.

But first compile Laustracker with debugging informations:

[chris@thinkpad laustracker]$ rm -r build/
[chris@thinkpad laustracker]$ mkdir build
[chris@thinkpad laustracker]$ cd build/
[chris@thinkpad build]$ cmake .. -DCMAKE_BUILD_TYPE=Debug
[chris@thinkpad build]$ make

Get profiling data

[chris@thinkpad cli]$ env LD_PRELOAD="/usr/lib64/libprofiler.so" CPUPROFILE=laustracker-cli.prof ../../bin/laustracker-cli --robots-img=../../testdata/aoi_robots_dark.jpg --walls-img=../../testdata/aoi_labyrinth_empty.jpg --benchmark
...
FPS: 9.368
FPS: 20.53
FPS: 19.04
FPS: 20.06
FPS: 20.45
FPS: 20.05
FPS: 20.26
FPS: 19.94
FPS: 19.29
FPS: 19.89
FPS: 20.57
FPS: 20.41
FPS: 20.09
FPS: 19.74
PROFILE: interrupts/evictions/bytes = 694/77/45320
[chris@thinkpad cli]$

I’m using the --benchmark option only to run at maximum FPS.

Analyse the data

[chris@thinkpad cli]$ pprof --text ../../bin/laustracker-cli laustracker-cli.prof | head -n 25
Using local file ../../bin/laustracker-cli.
Using local file laustracker-cli.prof.
Removing killpg from all stack traces.
Total: 694 samples
    149  21.5%  21.5%      149  21.5% cv::CvtColorLoop_Invoker::operator
    123  17.7%  39.2%      192  27.7% cv::warpPerspectiveInvoker::operator
      67   9.7%  48.8%       67   9.7% cv::inRange8u
      65   9.4%  58.2%       65   9.4% cv::RemapVec_8u::operator
      63   9.1%  67.3%       63   9.1% cv::MorphFilter::operator
      59   8.5%  75.8%      124  17.9% cv::remapBilinear
      24   3.5%  79.3%       24   3.5% cvFindNextContour
      23   3.3%  82.6%       27   3.9% gdk_region_spans_intersect_foreach
      16   2.3%  84.9%       16   2.3% __memcpy_ssse3_back
      15   2.2%  87.0%       83  12.0% cv::inRange
      12   1.7%  88.8%      142  20.5% cv::RemapInvoker::operator
      7   1.0%  89.8%        7   1.0% icvCvt_BGR2RGB_8u_C3R
      6   0.9%  90.6%        6   0.9% cv::MorphRowFilter::operator
      6   0.9%  91.5%        6   0.9% cv::split8u
      5   0.7%  92.2%        6   0.9% cv::FillConvexPoly [clone .constprop.70]
      5   0.7%  92.9%        5   0.7% cv::vBinOp8
      4   0.6%  93.5%        4   0.6% cv::Mat::Mat
      3   0.4%  93.9%        3   0.4% cv::MorphColumnFilter::operator
      2   0.3%  94.2%        2   0.3% __poll
      2   0.3%  94.5%        3   0.4% _int_malloc
      2   0.3%  94.8%        2   0.3% brk
      2   0.3%  95.1%       77  11.1% cv::FilterEngine::proceed
      2   0.3%  95.4%        3   0.4% cvStartFindContours
      2   0.3%  95.7%        3   0.4% do_lookup_x
[chris@thinkpad cli]$
[chris@thinkpad cli]$ pprof --callgrind ../../bin/laustracker-cli laustracker-cli.prof > laustracker-cli.callgrind
Using local file ../../bin/laustracker-cli.
Using local file laustracker-cli.prof.
Removing killpg from all stack traces.
[chris@thinkpad cli]$ kcachegrind laustracker-cli.callgrind
[chris@thinkpad cli]$
KCachegrind laustracker-cli.callgrind

KCachegrind laustracker-cli.callgrind

[chris@thinkpad cli]$ pprof --gv ../../bin/laustracker-cli laustracker-cli.prof
[chris@thinkpad cli]$ pprof --gv --nodecount=30 ../../bin/laustracker-cli laustracker-cli.prof
[chris@thinkpad cli]$ pprof --gif --nodecount=30 ../../bin/laustracker-cli laustracker-cli.prof > laustracker-cli.prof.gif
Output image of: pprof --gif --nodecount=30 ../../bin/laustracker-cli laustracker-cli.prof > laustracker-cli.prof.gif

Output image of pprof --gif --nodecount=30 ../../bin/laustracker-cli laustracker-cli.prof > laustracker-cli.prof.gif

What does this tell us

A lot of time is spent in cv::warpPerspectiveInvoker::operator and cv::MorphFilter::operator and since I’ve written the application I know that it should also do its job if I remove them at some positions.

Laustracker CLI with warpPerspective

[chris@thinkpad cli]$ ../../bin/laustracker-cli --robots-img=../../testdata/aoi_robots_dark.jpg --walls-img=../../testdata/aoi_labyrinth_empty.jpg --benchmark
...
FPS: 20.31
FPS: 19.91
FPS: 20.46
FPS: 19.94
FPS: 19.15
FPS: 20.34
FPS: 20.28
FPS: 20.01
FPS: 20.41
FPS: 19.36
FPS: 20.28
FPS: 19.93
FPS: 20.45
FPS: 18.92
FPS: 19.51
FPS: 21.45
FPS: 19.83
FPS: 19.18
FPS: 18.96
FPS: 20.12
FPS: 20.26
[chris@thinkpad cli]$
Laustracker CLI with warpPerspective

Laustracker CLI with warpPerspective

Laustracker CLI without warpPerspective

I’ve added the command line option --no-wrap to test this:

[chris@thinkpad cli]$ ../../bin/laustracker-cli --robots-img=../../testdata/aoi_robots_dark.jpg --walls-img=../../testdata/aoi_labyrinth_empty.jpg --benchmark --no-wrap
...
FPS: 26.49
FPS: 26.75
FPS: 26.75
FPS: 25.44
FPS: 25.19
FPS: 26.34
FPS: 24.89
FPS: 25.48
FPS: 25.69
FPS: 25.5
FPS: 23.91
FPS: 25.88
FPS: 25.7
FPS: 26.33
FPS: 26.89
FPS: 26.34
FPS: 26.55
FPS: 25.65
FPS: 26.49
FPS: 23.47
FPS: 26.48
FPS: 25.29
FPS: 26.64
[chris@thinkpad cli]$
Laustracker CLI without warpPerspective

Laustracker CLI without warpPerspective

You see the image without is only a little bit warped but the frame rate has improved drastically.

Reduce the usage of the MorphFilter

I’ve added the command line option --fast to test this:

[chris@thinkpad cli]$ ../../bin/laustracker-cli --robots-img=../../testdata/aoi_robots_dark.jpg --walls-img=../../testdata/aoi_labyrinth_empty.jpg --benchmark --no-wrap --fast
...
FPS: 29.49
FPS: 27.71
FPS: 29.52
FPS: 30.69
FPS: 29.54
FPS: 30.41
FPS: 28.55
FPS: 29.95
FPS: 30.66
FPS: 28.1
FPS: 29.95
FPS: 28.25
FPS: 28.79
FPS: 30.58
FPS: 29.92
FPS: 28.94
FPS: 27.95
FPS: 31.05
FPS: 30.02
FPS: 30.76
[chris@thinkpad cli]$

And you can see the frame rate has improved again.

Changed some functions and avoided copying image data

Now I’ve changed some more functions and avoided copying image data where possible:

[chris@thinkpad cli]$ ../../bin/laustracker-cli --robots-img=../../testdata/aoi_robots_dark.jpg --walls-img=../../testdata/aoi_labyrinth_empty.jpg --benchmark --fast --no-wrap
...
FPS: 15.19
FPS: 35.12
FPS: 33.84
FPS: 33.7
FPS: 33.65
FPS: 34.09
FPS: 32.7
FPS: 33.5
FPS: 34.45
FPS: 32.19
FPS: 34.27
FPS: 33.59
FPS: 33.22
FPS: 33.28
FPS: 33.6
FPS: 33.32
FPS: 33.89
FPS: 34.79
FPS: 33.78
FPS: 34.17
FPS: 33.59
FPS: 36.7
FPS: 34.76
FPS: 33.11
[chris@thinkpad cli]$

Set a image ROI

Setting a region of interest in OpenCV will reduce the image size that needs to be processed and in turn will improve the performance.

bool use_image_roi = labyrinth_pos_ready && fastmode;

if (use_image_roi)
{
    // define image ROI
    img = image(Rect(rob_labyrinth_pos[0], rob_labyrinth_pos[2]));
    if (debuglevel > 3) laustracker_imshow("Robots: Region of Interest", img);
}
else
{
    img = image;
}

The result

[chris@thinkpad cli]$ env LD_PRELOAD="/usr/lib64/libprofiler.so" CPUPROFILE=laustracker-cli.prof ../../bin/laustracker-cli --robots-img=../../testdata/aoi_robots_dark.jpg --walls-img=../../testdata/aoi_labyrinth_empty.jpg --benchmark --fast --no-wrap
...
FPS: 42.49
FPS: 41.41
FPS: 45.8
FPS: 41.3
FPS: 41.05
FPS: 43.66
FPS: 36.82
FPS: 41.83
FPS: 38.45
FPS: 41.46
FPS: 41.84
FPS: 41.52
FPS: 41.79
FPS: 39.71
FPS: 42.16
FPS: 43.21
FPS: 40.21
FPS: 43.37
FPS: 45.75
FPS: 41.38
FPS: 41.76
FPS: 42.56
FPS: 39.63
FPS: 38.53
FPS: 39.75
FPS: 39.88
FPS: 44.67
FPS: 41.93
FPS: 40.62
FPS: 43.35
FPS: 40.7
FPS: 40.53
FPS: 41.23
FPS: 39.45
FPS: 36.61
FPS: 40.99
FPS: 40.61
FPS: 40.68
FPS: 44.28
FPS: 41.59
PROFILE: interrupts/evictions/bytes = 913/26/38808
[chris@thinkpad cli]$ pprof --text ../../bin/laustracker-cli laustracker-cli.prof | head -n 25
Using local file ../../bin/laustracker-cli.
Using local file laustracker-cli.prof.
Removing killpg from all stack traces.
Total: 913 samples
    202  22.1%  22.1%      202  22.1% cv::CvtColorLoop_Invoker::operator
    177  19.4%  41.5%      177  19.4% cv::inRange8u
    114  12.5%  54.0%      114  12.5% cv::RemapVec_8u::operator
      90   9.9%  63.9%      204  22.3% cv::remapBilinear
      68   7.4%  71.3%       85   9.3% gdk_region_spans_intersect_foreach
      65   7.1%  78.4%      243  26.6% cv::inRange
      42   4.6%  83.0%       42   4.6% cvFindNextContour
      25   2.7%  85.8%       25   2.7% icvCvt_BGR2RGB_8u_C3R
      24   2.6%  88.4%      235  25.7% cv::RemapInvoker::operator
      19   2.1%  90.5%       19   2.1% __memcpy_ssse3_back
      9   1.0%  91.5%        9   1.0% cv::MorphRowFilter::operator
      8   0.9%  92.3%        8   0.9% cv::copyMask8u
      7   0.8%  93.1%        7   0.8% cv::Mat::Mat
      5   0.5%  93.6%        5   0.5% __poll
      4   0.4%  94.1%        4   0.4% recv
      4   0.4%  94.5%        4   0.4% writev
      3   0.3%  94.9%        3   0.3% cv::MorphColumnFilter::operator
      3   0.3%  95.2%        3   0.3% cv::ThresholdRunner::operator
      2   0.2%  95.4%        2   0.2% __GI_memset
      2   0.2%  95.6%        3   0.3% cv::Line2
      2   0.2%  95.8%        2   0.2% cv::LineAA
      2   0.2%  96.1%        2   0.2% icvHoughLinesStandard
      2   0.2%  96.3%        2   0.2% pthread_cond_signal@@GLIBC_2.3.2
      2   0.2%  96.5%        2   0.2% std::basic_fstream::close
[chris@thinkpad cli]$ pprof --gif ../../bin/laustracker-cli laustracker-cli.prof > laustracker-cli.prof.gif
Using local file ../../bin/laustracker-cli.
Using local file laustracker-cli.prof.
Removing killpg from all stack traces.
Dropping nodes with <= 4 samples; edges with <= 0 abs(samples)
[chris@thinkpad cli]$
New output image of pprof --gif ../../bin/laustracker-cli laustracker-cli.prof > laustracker-cli.prof.gif

New output image of pprof --gif ../../bin/laustracker-cli laustracker-cli.prof > laustracker-cli.prof.gif

This should be fast enough as I won’t get more than 25 FPS from the camera.

Update: I get about 54 FPS after I’ve upgraded to Fedora 19 Alpha.