title author date
ddR README
Edward Ma, Indrajit Roy, Michael Lawrence
2015-10-22

The 'ddR' package aims to provide an unified R interface for writing parallel and distributed applications. Our goal is to ensure that R programs written using the 'ddR' API work across different distributed backends, therefore, reducing the effort required by users to understand and program on different backends. Currently 'ddR' programs can be executed on R's default 'parallel' package as well as the open source HP Distributed R. We plan to add support for SparkR. This package is an outcome of feedback and collaboration across different companies and R-core members!

Through funding provided by the R-consortium this package is under active development for the summer of 2016. Check out the mailing list to see the latest discussions.

'ddR' is an API, and includes a default execution engine, to express and execute distributed applications. Users can declare distributed objects (i.e., dlist, dframe, darray), and execute parallel operations on these data structures using R-style apply functions. It also allows different backends (that support ddR, and have ddR "drivers" written for them) to be dynamically activated in the R user's environment to execute applications

Please refer to the user guide under vignettes/ for a detailed description on how to use the package.

Some quick examples

library(ddR)

By default, the parallel backend is used with all the cores present on the machine. You can switch backends or specify the number of cores to use with the useBackend function. For example, you can specify that the parallel backend should be used with only 4 cores by executing useBackend(parallel, executors=4).

Initializing a distributed list (dlist):

a <- dmapply(function(x) { x }, rep(3,5))
collect(a)
## [[1]]
## [1] 3
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 3
## 
## [[5]]
## [1] 3

Printing a:

a
## 
## ddR Distributed Object
## Type: dlist
## # of partitions: 5
## Partitions per dimension: 5x1
## Partition sizes: [1], [1], [1], [1], [1]
## Length: 5
## Backend: parallel

a is a distributed object in ddR. Note that we did not specify the number of partitions of the output, but by default it is equal to the length of the inputs (5). Use the parameter nparts to specify how the output should be partitioned:

Below is the code to add 1 to the first element of a, 2 to the second, etc. The syntax of dmapply is similar to R's standard mapply function.

b <- dmapply(function(x,y) { x + y }, a, 1:5,nparts=1)
b
## 
## ddR Distributed Object
## Type: dlist
## # of partitions: 1
## Partitions per dimension: 1x1
## Partition sizes: [5]
## Length: 5
## Backend: parallel

Since we specified nparts=1 in dmapply, b only has one partition of 5 elements. Note that the argument nparts is optional, and a user can always ignore it.

collect(b)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 5
## 
## [[3]]
## [1] 6
## 
## [[4]]
## [1] 7
## 
## [[5]]
## [1] 8

Some other operations: `

Adding a to b, and then subtracting a constant value

addThenSubtract <- function(x,y,z) {
  x + y - z
}
c <- dmapply(addThenSubtract,a,b,MoreArgs=list(z=5))
collect(c)
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] 4
## 
## [[4]]
## [1] 5
## 
## [[5]]
## [1] 6

We can also process distributed objects partitionwise. Below is an example where we calculate the length of each partition:

d <- dmapply(function(x) length(x),parts(a))
collect(d)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 1
## 
## [[4]]
## [1] 1
## 
## [[5]]
## [1] 1

We partitioned a with 5 parts and it had 5 elements, so the length of each partition is 1.

However, b only had one partition, so that one partition should be of length 5:

e <- dmapply(function(x) length(x),parts(b))
collect(e)
## [[1]]
## [1] 5

Note that parts() and non-parts arguments can be used in any combination to dmapply. parts(dobj) returns a list of the partitions of that dobject, which can be passed into dmapply like any other list. parts(dobj,index), where index is a list, vector, or scalar, returns a specific partition or range of partitions of dobj.

We also have support for darrays and dframes. Check vignettes/ on how to use them.

For more interesting parallel machine learning algorithms, you may view (and run) the example scripts under /examples.

Using the Distributed R backend

To use the Distributed R library for ddR, first install distributedR.ddR and then load it:

library(distributedR.ddR)
## Loading required package: distributedR
## Loading required package: Rcpp
## Loading required package: RInside
## Loading required package: XML
## Loading required package: ddR
## 
## Attaching package: 'ddR'
## 
## The following objects are masked from 'package:distributedR':
## 
##     darray, dframe, dlist, is.dlist
useBackend(distributedR)

Now you can try the different list examples which were used with the 'parallel' backend.

How to Contribute

You can help us in different ways:

  1. Reporting issues.
  2. Contributing code and sending a Pull Request.

In order to contribute the code base of this project, you must agree to the Developer Certificate of Origin (DCO) 1.1 for this project under GPLv2+:

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I have the 
    right to submit it under the open source license indicated in the file; or
(b) The contribution is based upon previous work that, to the best of my 
    knowledge, is covered under an appropriate open source license and I 
    have the right under that license to submit that work with modifications, 
    whether created in whole or in part by me, under the same open source 
    license (unless I am permitted to submit under a different license), 
    as indicated in the file; or
(c) The contribution was provided directly to me by some other person who 
    certified (a), (b) or (c) and I have not modified it.
(d) I understand and agree that this project and the contribution are public and
    that a record of the contribution (including all personal information I submit 
    with it, including my sign-off) is maintained indefinitely and may be 
    redistributed consistent with this project or the open source license(s) involved.

To indicate acceptance of the DCO you need to add a Signed-off-by line to every commit. E.g.:

Signed-off-by: John Doe <john.doe@hisdomain.com>

To automatically add that line use the -s switch when running git commit:

$ git commit -s


title author date
ddR README
Edward Ma, Indrajit Roy, Michael Lawrence
2015-10-22

‘ddR’包旨在提供一个用于写入的统一的R接口 并行和分布式应用程序。我们的目标是确保R 使用ddRAPI编写的程序可以跨不同的分布式进行编写 后端,因此,减少用户所需的努力 了解和编程在不同的后端。目前’ddR’ 程序可以在R的默认parallel包中执行 开源的HP Distributed R.我们计划增加支持 SparkR。这个包是反馈和协作的结果 跨不同公司和R核心成员!

通过提供的资金 R-consortium 这个包是在 积极开发2016年夏季。查看邮寄 列表 看最新的讨论。

‘ddR’是一个API,并且包含一个默认的执行引擎来表达 并执行分布式应用程序。用户可以声明分布式 对象(即 dlist dframe darray ),并执行并行 对这些数据结构的操作使用R风格的应用 功能。它还允许不同的后端(支持ddR和 有为他们写的ddR驱动程序)被动态激活 R用户执行应用程序的环境

请参阅vignette下的用户指南,了解有关如何使用软件包的详细说明。

一些快速例子

library(ddR)

默认情况下,并行后端与机器上存在的所有核心一起使用。您可以切换后端或指定与 useBackend 功能一起使用的内核数。例如,您可以通过执行 useBackend(parallel,executors = 4)指定并行后端应仅使用4个内核。

初始化分布式列表(dlist):

a <- dmapply(function(x) { x }, rep(3,5))
collect(a)
## [[1]]

[1] 3

[[2]]

[1] 3

[[3]]

[1] 3

[[4]]

[1] 3

[[5]]

[1] 3

打印 a

a
##

ddR Distributed Object

Type: dlist

# of partitions: 5

Partitions per dimension: 5x1

Partition sizes: [1], [1], [1], [1], [1]

Length: 5

Backend: parallel

a 是ddR中的分布式对象。请注意,我们没有指定输出的分区数,但默认情况下它等于输入的长度(5)。使用参数 nparts 指定如何分割输出:

以下是将代码添加到 a 的第一个元素,2到第二个等等。 dmapply 的语法类似于R的标准 mapply 功能。

b <- dmapply(function(x,y) { x + y }, a, 1:5,nparts=1)
b
##

ddR Distributed Object

Type: dlist

# of partitions: 1

Partitions per dimension: 1x1

Partition sizes: [5]

Length: 5

Backend: parallel

由于我们在 dmapply 中指定了 nparts = 1 b 只有一个5个元素的分区。请注意,参数 nparts 是可选的,用户可以随时忽略它。

collect(b)
## [[1]]

[1] 4

[[2]]

[1] 5

[[3]]

[1] 6

[[4]]

[1] 7

[[5]]

[1] 8

其他一些操作: `

a 添加到 b ,然后减去常量值

addThenSubtract <- function(x,y,z) {
  x + y - z
}
c <- dmapply(addThenSubtract,a,b,MoreArgs=list(z=5))
collect(c)
## [[1]]

[1] 2

[[2]]

[1] 3

[[3]]

[1] 4

[[4]]

[1] 5

[[5]]

[1] 6

我们也可以分区处理分布式对象。以下是我们计算每个分区长度的示例:

d <- dmapply(function(x) length(x),parts(a))
collect(d)
## [[1]]

[1] 1

[[2]]

[1] 1

[[3]]

[1] 1

[[4]]

[1] 1

[[5]]

[1] 1

我们用5个部分分割了一个,它有5个元素,所以每个分区的长度是1。

但是, b 只有一个分区,所以一个分区的长度为5:

e <- dmapply(function(x) length(x),parts(b))
collect(e)
## [[1]]

[1] 5

请注意,可以将任何组合中的 parts()和非零件参数用于dmapply。 parts(dobj)返回该dobject分区的列表,可以像任何其他列表一样传递给dmapply。 part(dobj,index),其中 index 是列表,向量或标量,返回 dobj 的特定分区或分区范围。

我们也支持 darrays dframes 。检查小插曲/如何使用它们。

有关更多有趣的并行机器学习算法,您可以查看(并运行)/ examples下的示例脚本。

使用分布式R后端

要使用分布式R库进行ddR,首先安装 distributedR.ddR ,然后加载它:

library(distributedR.ddR)
## Loading required package: distributedR

Loading required package: Rcpp

Loading required package: RInside

Loading required package: XML

Loading required package: ddR

Attaching package: 'ddR'

The following objects are masked from 'package:distributedR':

darray, dframe, dlist, is.dlist

useBackend(distributedR)

现在,您可以尝试与并行后端一起使用的不同列表示例。

如何贡献

您可以通过不同的方式帮助我们:

  1. Reporting issues.
  2. Contributing code and sending a Pull Request.

为了提供本项目的代码基础,您必须同意GPLv2 +下的本项目的开发商原产地证书(DCO)1.1:

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or © The contribution was provided directly to me by some other person who certified (a), (b) or © and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.

为了表明对DCO的接受,您需要向每个提交添加一个 Signed-off-by 行。例如:

Signed-off-by: John Doe <john.doe@hisdomain.com>

要在运行 git commit 时自动添加该行使用 -s 开关:

$ git commit -s




相关问题推荐