While you wait for that to finish, can I interest you in parallel processing?

parallel processing

Max Kuhn


January 17, 2018

caret has been able to utilize parallel processing for some time (before it was on CRAN in October 2007) using slightly different versions of the package. Around September of 2011, caret started using the foreach package was used to “harmonize” the parallel processing technologies thanks to a super smart guy named Steve Weston.

I’ve done a few benchmarks to quantify the benefits to parallel processing over the years. I just did another set since I just got a new iMacPro. This time, I put the code and results in a GitHub repo. The main questions that I wanted to answer were:

The task I used was to tune a boosted classification tree model (via the xgboost) package over 25 random values of the seven tuning parameters. The data were simulated and contained 2,000 training set points and 115 numeric predictors. Five repeats of 10-fold cross-validation was used so that 1,250 separate models could be run in parallel. An example of the code is here.

I tested this using three machines:

All the computers have >=8GB of memory. The DIY machine was running Ubuntu 16.04 as well as Windows 10. All of the builds used R 3.4.2 or higher.

The methods for parallelization were:

This JSS paper from 2009 might be helpful.

Once nice thing about parallelism in caret is that you do not have to modify the package code. With foreach, you simply “register” a parallel backend and the caret functions automatically utilize the workers.

To be clear, the forking/socket approaches split the 1,250 model fits across the workers while the threading affects each individual model fit (one-at-a-time).

In each case, different number of workers were applied to the problems and the execution time was measured. Since these are all recent Intel processors, hyper-threading can be used. This means that, even though you have X physical cores, there are 2X “logical cores” that can perform independent operations. For that reason, we can add more workers than physical cores. The order of all runs were randomized and run in clear R sessions.

First, let’s look at the non-threading results. The raw execution times were:

When converted to speed-ups, we get:

The speed-up is the sequential time divided by the parallel time; a speed-up of 3 means that the parallel execution took a third less time than the sequential code.

My observations:

The doMC results are strange. When the number of workers is a multiple of 5, there is a performance hit. At first I thought this was an aberration until it was apparent that it occurred over two operating systems and three computers. Also, it is specific to doMC; forking using doParallel does not have the same issue. I’ve notified the doMC package maintainer about it.

What about threading? These runs had foreach run sequentially and allowed xgb.train to run in parallel by adjusting the nthread argument. The results:

This isn’t so great and really surprising (or fair). We are starting and stopping threads 1,250 times across the model tuning process. In general, we want to parallelize the longest running part of the algorithm. I have no doubt of the utility of threading the computations for a single model but it can be counterproductive when model is called repeatedly.

The code and results can be found here if you are interested in more details.

(This article was originally posted at http://appliedpredictivemodeling.com)