【编程学习】大数据平台基础课程要点总结——R语言实现多线程
在此附上老师教学课件地址:
由于学的时候用的英文,懒得翻译,就直接英文输出了~~~
Let’s check how many cores can you use by R language?
1 | library(parallel) |
PARALLELIZE USING parallel
multicore
, snow
, foreach
The parallel
package was introduced in 2011 to unify two popular parallisation packages: snow
and multicore
. The multicore
package was designed to parallelise using the fork mechanism, on Linux machines. The snow
package was designed to parallelise other mechanisms. R processes started with snow
are not forked, so they will not see the parent’s data. Data will have to be copied to child processes. The good news: snow can start R processes on Windows machines, or remotely machines in the cluster.
- The
parallel
library can be used to send tasks (encoded as function calls) to each of the processing cores on your machine in parallel. - The most popular
mclapply()
function essentially parallelizes calls tolapply()
. mclapply
gathers up the responses from each of these function calls, and returns a list of responses that is the same length as the list or vector of input data (one return per input item).
REMARK ON mclapply
- The
mclapply()
function (and relatedmc*
functions) works via the fork mechanism on Unix-style operating systems. - Briefly, your R session is the main process and when you call a function like
mclapply()
, you fork a series of sub-processes that operate independently from the main process (although they share a few low-level features). - These sub-processes then execute your function on their subsets of the data, presumably on separate cores of your CPU. Once the computation is complete, each sub-process returns its results and then the sub-process is killed. The parallel package manages the logistics of forking the sub-processes and handling them once they’ve finished.
Example
- Let’s start up with a simple example. We select
Sepal.Length
andSpecies
from the iris dataset, subset it to 100 observations, and then iterate across 10,000 trials, each time resampling the observations with replacement. - Run a logistic regression fitting species as a function of length, and record the coefficients for each trial to be returned.
1 | # built a function first |
1 | # benchmark |
1 | # mclapply |
Elapsed
is the time this function takes to run. As you can see from this case study, mclapply
has resulted in a huge improvement
PARALLELIZE WITH parLapply
- Using the forking mechanism on your computer is one way to execute parallel computation but it’s not the only way that the parallel package offers.
- Another way to build a “cluster” using the multiple cores on your computer is via sockets.
- A socket is simply a mechanism with which multiple processes or applications running on your computer (or different computers, for that matter) can communicate with each other.
- With parallel computation, data and results need to be passed back and forth between the parent and child processes and sockets can be used for that purpose.
Example
- Building a socket cluster is simple to do in R with the
makeCluster()
function.
1 | library(snow) |
OTHER PACKAGES FOR PARALLEL COMPUTING IN R
Many R packages come up with parallel features (or arguments). You may refer to this link for more details. Examples are:
future
provides a lightweight and unified Future API for sequential and parallel processing of R expression via futures.Data.table
is a venerable and powerful package written primarily by Matt Dowle. It is a high-performance implementation of R’s data frame construct, with an enhanced syntax. There have been innumerable benchmarks showcasing the power of the data.table package. It provides a highly optimized tabular data structure for most common analytical operations.- The
caret
package (Classification And REgression Training) is a set of functions that streamline the process for creating predictive models. The package contains tools for data splitting, preprocessing, feature selection, model tuning using resampling, variable importance estimation, and other functionality. Multidplyr
is a backend for dplyr that partitions a data frame across multiple cores. You tell multidplyr how to split the data up with partition(), and then the data stays on each node until you explicitly retrieve it with collect(). This minimizes time spent moving data around, and maximizes parallel performance.