バイオインフォマティクス、勉強中: R: Map() でマップする

Map the values with Map().

Map()はapply系の関数。mapply()のラッパー。

Map()は沢山のArgumentsを受け付けてくれるのでとても便利。
マルチコア版のmcMap()もある。

Map() is a wrapper function to mapply. Map() is useful as it takes lots of arguments. You can also use mcMap(), the multicore version of it.

#######################################################
たとえば、以下のようなデータフレームがあるとする。
Assume that there is a data frame like this.

x <- data.frame(strA = c("A", "B", "C", "D", "E", "F"),
strB = c("a", "b", "c", "d", "e", "f"),
strC = c("1", "2", "3", "4", "5", "6"))

> x
strA strB strC
1 A a 1
2 B b 2
3 C c 3
4 D d 4
5 E e 5
6 F f 6

各row毎に、strAとstrBとstrCを繋げた文字列を得たいとする。
ついでに任意の文字（ここでは「-」）を加える。
And, you would like to concatenate character strings in the columns strA, strB and strC. Here, you may want the characters to be linked with '-'.

当然ながら、for()ループで実現できる。
You can do it with the for().

x$strABC <- ""
for(i in 1:nrow(x)){
strABC <- paste(x$strA[i], x$strB[i], x$strC[i], sep = "-")
x$strABC[i] <- strABC
}

for()ループの結果。
The result of the for loop.

> x
strA strB strC strABC
1 A a 1 A-a-1
2 B b 2 B-b-2
3 C c 3 C-c-3
4 D d 4 D-d-4
5 E e 5 E-e-5
6 F f 6 F-f-6

しかし、rowの数が増えるほど計算が遅くなってしまう。Map()だと速い。
However, the more row numbers you have, the more time it takes. Map() does this more rapidly.

Map()を使うには、まず関数をつくる必要がある。
In order to use Map(), you need to make a function to be used by the function Map().

get.strABC <- function(strA, strB, strC){
return(paste(strA, strB, strC, sep = "-"))
}

この関数を使ってみるとこんな結果が帰ってくる。
This function acts like this.

> get.strABC("A", "a", "1")
[1] "A-a-1"

上のfor()ループの中でこの関数を使うと同じ結果を得ることができる。
Of course, you can use this function in a for loop.

x$strABC <- ""
for(i in 1:nrow(x)){
x$strABC[i] <- get.strABC(x$strA[i], x$strB[i], x$strC[i])
}

Map()でget.strABC()を実行する。Map()はリストで結果を返すので、unlist()で結果をシンプルにする必要がある。
Let's use Map(). You need to unlist the output of Map() to map the values in a preexisting container.

x$strABC <- unlist(Map(f = get.strABC,
strA = x$strA,
strB = x$strB,
strC = x$strC))

> x
strA strB strC strABC
1 A a 1 A-a-1
2 B b 2 B-b-2
3 C c 3 C-c-3
4 D d 4 D-d-4
5 E e 5 E-e-5
6 F f 6 F-f-6

Map()のマルチコア版であるmcMap()も使える。この場合、parallelパッケージをロードする必要がある。
You can also use mcMap(), which is the multicore version of Map().
You need load the parallel package to do it.

library(parallel)
x$strABC <- unlist(mcMap(f = get.strABC,
strA = x$strA,
strB = x$strB,
strC = x$strC))

mcMap()のArgumentであるmc.coresで、使うコアの数を制御できる。
The number of cores can be regulated with the argument mc.cores.

#######################################################
rowの数を増やしてfor()とMap()を比べる。

x <- data.frame(strA = c("A", "B", "C", "D", "E", "F"),
strB = c("a", "b", "c", "d", "e", "f"),
strC = c("1", "2", "3", "4", "5", "6"))

for(i in 1:12) x <- rbind(x, x)

> nrow(x)
[1] 24576

for()ループでやると、4.286秒かかる。

x$strABC <- ""
system.time(
for(i in 1:nrow(x)){
x$strABC[i] <- get.strABC(x$strA[i], x$strB[i], x$strC[i])
}
)

ユーザシステム経過
4.169 0.137 4.286

Map()でやると、0.940秒。for()より速い。

system.time(
x$strABC <- unlist(Map(f = get.strABC,
strA = x$strA,
strB = x$strB,
strC = x$strC))
)

ユーザシステム経過
0.892 0.009 0.940

mcMap()でやると、0.711秒。Map()との差は小さい。

system.time(
x$strABC <- unlist(mcMap(f = get.strABC,
strA = x$strA,
strB = x$strB,
strC = x$strC, mc.cores = 4))
)

ユーザシステム経過
1.697 0.155 0.711

#######################################################
rowの数を増やしてMap()とmcMap()を比べる。

x <- data.frame(strA = c("A", "B", "C", "D", "E", "F"),
strB = c("a", "b", "c", "d", "e", "f"),
strC = c("1", "2", "3", "4", "5", "6"))

for(i in 1:16) x <- rbind(x, x)

> nrow(x)
[1] 393216

Map()でやると、16.926秒かかる。for()は怖くて試せない。

system.time(
x$strABC <- unlist(Map(f = get.strABC,
strA = x$strA,
strB = x$strB,
strC = x$strC))
)

ユーザシステム経過
16.596 0.107 16.926

mcMap(mc.cores = 4)でやると、8.939秒。Map()の半分の時間。

system.time(
x$strABC <- unlist(mcMap(f = get.strABC,
strA = x$strA,
strB = x$strB,
strC = x$strC, mc.cores = 4))
)

ユーザシステム経過
1.104 0.172 8.939

mcMap(mc.cores = 20)でやると、3.412秒。かなり速い。

system.time(
x$strABC <- unlist(mcMap(f = get.strABC,
strA = x$strA,
strB = x$strB,
strC = x$strC, mc.cores = 20))
)

ユーザシステム経過
80.537 3.933 3.412

バイオインフォマティクス、勉強中

2016年7月12日火曜日

R: Map() でマップする

0 件のコメント:

コメントを投稿