バイオインフォマティクス、勉強中

2017年7月21日金曜日

エンコーディングの問題

たまたま入手したファイルが開けず苦労した。
mi で調べると、エンコーディングが Windows-1252 だった。

これを R で解決する方法。
readLines() で encoding を指定して読み込み、write() で書き出す。

gff <- readLines("XXX.gff3", encoding = "windows-1252")
write(gff, file = "XXX_UTF8.gff3")

これでとりあえず問題回避できる。

2017年7月6日木曜日

homebrew: tbl2asn installation failed

homebrew で tbl2asn をインストールできないときの対処法。
When installation of tbl2asn through homobrew failed.

Error: SHA256 mismatch
Expected: 429d63ee3c36d1f2f6322c62c6089d5ee8a8b089e5cc9373e298e017bcbbb9ec
Actual: bfaff2308fd58b94730639b9c6b520ca1884ede9979389cfc9bc4902ae75702c
Archive: /Users/XXX/Library/Caches/Homebrew/tbl2asn-25.3.gz
To retry an incomplete download, remove the file above.

下のファイルをテキストで開いて、変更を加える。
Open the .rb file as a text and change the sha256 as bellow.

File location:
/usr/local/Homebrew/Library/Taps/homebrew/homebrew-science/tbl2asn.rb

### original
# if OS.mac?
# url "ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/tbl2asn/mac.tbl2asn.gz"
# sha256 "429d63ee3c36d1f2f6322c62c6089d5ee8a8b089e5cc9373e298e017bcbbb9ec"

### fixed
# if OS.mac?
# url "ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/tbl2asn/mac.tbl2asn.gz"
# sha256 "bfaff2308fd58b94730639b9c6b520ca1884ede9979389cfc9bc4902ae75702c"

2016年8月3日水曜日

R: pdflatex is not available

Rで「pdflatex is not available」というエラーに遭遇したときの対処方法。
こちらのサイトを参考にした。

まず、ターミナルで「pdflatex」の場所を確かめる。whichを使う。

$ which pdflatex
/usr/texbin/pdflatex

ちなみに「R」の場所はこちら。

$ which r
/usr/bin/r

「pdflatex」が入っていることを確認した。mac側ではなくR側の問題ということになる。

Rがソフトウェアを認識するかどうかは、Sys.which()で知ることができる。
たとえば「R」を探すなら、

> Sys.which("r")
r
"/usr/bin/r"

という具合。ターミナルでみたときと同じ。

上のエラーが出ている状態で「pdflatex」を探してみると、

> Sys.which("pdflatex")
pdflatex
""

どうやら認識していないことがわかる。

これはPATHの問題。RのPATHを調べるには、Sys.getenv()を使う。

> Sys.getenv("PATH")
[1] "/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin"

いくつかPATHがあるが、「/usr/texbin」がない。
これがエラーの元凶。

PATHを通す作業を行う。そのためには、Sys.setenv()を使う。
既存のPATHと「/usr/texbin」をpaste()で繋げるところがRっぽい。

Sys.setenv(PATH = paste(Sys.getenv("PATH"), "/usr/texbin", sep=":"))

これで、PATHが追加された。

> Sys.getenv("PATH")
[1] "/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/texbin"

Rが「pdflatex」の場所を知ることができるようになった。

> Sys.which("pdflatex")
pdflatex
"/usr/texbin/pdflatex"

これで、「pdflatex is not available」というエラーが出なくなる。

2016年7月22日金曜日

R: オブジェクトのサイズを考える

関数object.size()を使えば、オブジェクトのサイズを知ることができる。大きなデータを扱う場合、サイズは意識しておいた方が良いと思う。

とりあえず、character vectorをつくってサイズを眺めてみる。

# > x <- ""
# > nchar(x)
# [1] 0
# > object.size(x)
# 96 bytes

# > object.size("1")
# 96 bytes

# > object.size("1234567")
# 96 bytes

# > object.size("12345678")
# 104 bytes

# > object.size("123456789012345")
# 104 bytes

# > object.size("1234567890123456")
# 120 bytes

# > object.size("1234567890123456789012345678901")
# 120 bytes

# > object.size("12345678901234567890123456789012")
# 136 bytes

文字が長くなると表示しづらいので、for()で伸ばしてnchar()で文字数を確認し、object.size()で調査する。

# > x <- ""
# > for(i in 1:47) x <- paste(x, "a", sep = "")
# > nchar(x)
# [1] 47
# > object.size(x)
# 136 bytes

# > x <- ""
# > for(i in 1:48) x <- paste(x, "a", sep = "")
# > nchar(x)
# [1] 48
# > object.size(x)
# 152 bytes

# > x <- ""
# > for(i in 1:63) x <- paste(x, "a", sep = "")
# > nchar(x)
# [1] 63
# > object.size(x)
# 152 bytes

# > x <- ""
# > for(i in 1:64) x <- paste(x, "a", sep = "")
# > nchar(x)
# [1] 64
# > object.size(x)
# 216 bytes

文字の数とサイズは、どうやら以下の関係になっている。
0~7: 96 bytes
8~15: 104 bytes (+8)
16~31: 120 bytes (+16)
32~47: 136 bytes (+16)
48~63: 152 bytes (+16)
64~?: 216 bytes (+64)

一見すると秩序がありそう。
ということで、以下のコマンドでプロットしてみる。

main <- "Relationship between nchar and object.size"
sub <- c("64(blue), 128(green), 136(red), 144(orange)")
y <- ""
plot(x = nchar(y), y = object.size(y),
sub = sub, main = main,
xlim = c(0,550), ylim = c(0,700), pch = 20)
for(i in 1:550){
y <- paste(y, "y", sep = "")
points(x = nchar(y), y = object.size(y), pch = 20)
}
abline(v = c(64, 128, 136, 144),
col = c("blue", "green", "red", "orange"))

これはもう、本当に変なプロットになった。
どうやら、127文字を越えると取扱い方が変わるらしい。

# > x <- ""
# > for(i in 1:127) x <- paste(x, "a", sep = "")
# > nchar(x)
# [1] 127
# > object.size(x)
# 216 bytes

# > x <- ""
# > for(i in 1:128) x <- paste(x, "a", sep = "")
# > nchar(x)
# [1] 128
# > object.size(x)
# 224 bytes

# > x <- ""
# > for(i in 1:135) x <- paste(x, "a", sep = "")
# > nchar(x)
# [1] 135
# > object.size(x)
# 224 bytes

# > x <- ""
# > for(i in 1:136) x <- paste(x, "a", sep = "")
# > nchar(x)
# [1] 136
# > object.size(x)
# 232 bytes

# > x <- ""
# > for(i in 1:143) x <- paste(x, "a", sep = "")
# > nchar(x)
# [1] 143
# > object.size(x)
# 232 bytes

# > x <- ""
# > for(i in 1:144) x <- paste(x, "a", sep = "")
# > nchar(x)
# [1] 144
# > object.size(x)
# 240 bytes

まとめると、文字の数とサイズは以下の関係になっている。
0~7: 96 bytes
8~15: 104 bytes (+8)
16~31: 120 bytes (+16)
32~47: 136 bytes (+16)
48~63: 152 bytes (+16)
64~127: 216 bytes (+64)
128~135: 224 bytes (+8)
136~143: 232 bytes (+8)
144~152: 240 bytes (+8)
153~xxx: 8文字毎に8バイト増える。

基本的に１バイトに１文字が格納される計算になるのだが、そのオブジェクトの占有メモリは１文字毎に増えるのではない。あらかじめ決められた規則に則って、時には16文字分、あるいは64文字分のメモリが割り当てられる。しかも、どうやら128文字目以降は8文字単位でメモリの割り当てが増やされるらしい。

もうひとつの奇妙な現象を見つけた。
以下のコマンドを実行してもらうとわかるのだが、ひとつのベクターが２つの文字列をもち、しかもそれらの文字列が同一であれば、オブジェクトサイズは期待値よりも小さい。もしかしてfactorとして管理してるのかなぁ。

x <- ""
for(i in 1:10) x <- paste(x, "a", sep = "")
y <- c(x, x)
z <- c(x, paste(x, "b", sep = ""))
w <- c(paste(x, "b", sep = ""), paste(x, "b", sep = ""))
x
y
z
w
nchar(x)
nchar(y)
nchar(z)
nchar(w)
object.size(x)
object.size(y)
object.size(z)
object.size(w)

2016年7月20日水曜日

R: latticeのlevelplot()で枠を消す

表に入っているデータを俯瞰したいとき、レベルプロットが役に立つ。レベルプロットはlatticeパッケージのlevelplot()で実現できる。

しかし通常のやり方だと関数内で縦と横がスケーリングされてしまうので、ピクセル単位で期待通りのサイズの絵を出力するのが難しい。

マージンをゼロにしてプロット領域のみにし、プロットの外枠を消し、カラーキーを消すと、１ピクセルに１つの値を配置することができる。

R: lattice levelplot(), Remove the border line.

When you would like to see the information in a table or matrix at a glance, draw a level plot. It can be done with the function levelplot() in the lattice package.

It seems that the function automatically scales the plot, making it difficult to get a plot with defined scales.

By removing the margins, border lines and colorkey, you can get a plot in which a square of 1 x 1 pixel is filled with a value-related color.

# テスト用のマトリックスを作る。乱数の絶対値。
# Make a matrix with randomly produced values for a test.
x <- rnorm(20, mean = 0.5, sd = 0.3)
x <- abs(x)
m <- matrix(x, nrow = 4)

# パッケージをロードする。
# Load the package.
library(lattice)

# マージンゼロのテーマを定義する。参考サイトはこちら。
# Define the theme.
theme.x <- list(
layout.heights = list( # マージンをゼロに
top.padding = 0,
main.key.padding = 0,
key.axis.padding = 0,
axis.xlab.padding = 0,
xlab.key.padding = 0,
key.sub.padding = 0,
bottom.padding = 0),
layout.widths = list(
left.padding = 0,
key.ylab.padding = 0,
ylab.axis.padding = 0,
axis.key.padding = 0,
right.padding = 0
),
axis.line = list(lwd = 0) # 外枠を消す
)

# 保存先に移動する。
# Change the working directory.
setwd("test")

# レベルプロットをPNGとして保存する。
# Draw a level plot and save it as a PNG.
width <- nrow(m)
height <- ncol(m)
filename <- "m.png"
png(filename = filename, width = width, height = height)
levelplot(m, par.settings = theme.x, xlab = NULL, ylab = NULL,
scales = list(draw = FALSE), colorkey = FALSE)
dev.off()

こうして得たPNGファイルは、X軸とY軸の表示もなく、色と値の関係も分からないが、１ピクセルに１つの値が入っているので取扱いやすい。
# The resulting PNG file does not have the XY-axises or colorkey. However, as it is completely in a pixel-based format, it may be good for some purposes.

と、思いきや、Keynote ver. 6 以降に張り付けようとすると、PNGの画質が劣化する問題が生じる。この問題を回避するには、png()ではなくpdf()を使用するのが良さそう。pdf()は細い外枠が加えられるが、おそらく支障のないレベル。

# レベルプロットをPDFとして保存する。
# pdf()のサイズはピクセルではなくインチなので注意。
width <- nrow(m) / 72
height <- ncol(m) / 72
filename <- "m.png"
pdf(filename = filename, width = width, height = height)
levelplot(m, par.settings = theme.x, xlab = NULL, ylab = NULL,
scales = list(draw = FALSE), colorkey = FALSE, useRaster = FALSE)
dev.off()

pdf()を使うとき、levelplot()にuseRaster=TRUEを与えると、keynoteに載せたときに絵が変になる。

2016年7月12日火曜日

R: strsplit() で、ピリオドを区切りとして文字列を分ける

strsplit()は、指定した文字を区切りとして文字列を分割する関数。
splitというArgumentで区切り文字を指定する。

ピリオド「.」はワイルドカードとして認識されてしまう、というのが今回の問題。手元にピリオドを含む文字列があり、ピリオドを区切りとして２つの文字列に分割したいとき、エスケープする必要がある。

# たとえば「ABC.01」という文字列を考える。
x <- "ABC.01"

# ピリオドをそのまま与えると、ワイルドカードになる。
strsplit(x, split = ".")[[1]]

# 「\\」でエスケープすると、ピリオドが区切りとみなされる。
strsplit(x, split = "\\.")[[1]]

R: Map() でマップする

Map the values with Map().

Map()はapply系の関数。mapply()のラッパー。

Map()は沢山のArgumentsを受け付けてくれるのでとても便利。
マルチコア版のmcMap()もある。

Map() is a wrapper function to mapply. Map() is useful as it takes lots of arguments. You can also use mcMap(), the multicore version of it.

#######################################################
たとえば、以下のようなデータフレームがあるとする。
Assume that there is a data frame like this.

x <- data.frame(strA = c("A", "B", "C", "D", "E", "F"),
strB = c("a", "b", "c", "d", "e", "f"),
strC = c("1", "2", "3", "4", "5", "6"))

> x
strA strB strC
1 A a 1
2 B b 2
3 C c 3
4 D d 4
5 E e 5
6 F f 6

各row毎に、strAとstrBとstrCを繋げた文字列を得たいとする。
ついでに任意の文字（ここでは「-」）を加える。
And, you would like to concatenate character strings in the columns strA, strB and strC. Here, you may want the characters to be linked with '-'.

当然ながら、for()ループで実現できる。
You can do it with the for().

x$strABC <- ""
for(i in 1:nrow(x)){
strABC <- paste(x$strA[i], x$strB[i], x$strC[i], sep = "-")
x$strABC[i] <- strABC
}

for()ループの結果。
The result of the for loop.

> x
strA strB strC strABC
1 A a 1 A-a-1
2 B b 2 B-b-2
3 C c 3 C-c-3
4 D d 4 D-d-4
5 E e 5 E-e-5
6 F f 6 F-f-6

しかし、rowの数が増えるほど計算が遅くなってしまう。Map()だと速い。
However, the more row numbers you have, the more time it takes. Map() does this more rapidly.

Map()を使うには、まず関数をつくる必要がある。
In order to use Map(), you need to make a function to be used by the function Map().

get.strABC <- function(strA, strB, strC){
return(paste(strA, strB, strC, sep = "-"))
}

この関数を使ってみるとこんな結果が帰ってくる。
This function acts like this.

> get.strABC("A", "a", "1")
[1] "A-a-1"

上のfor()ループの中でこの関数を使うと同じ結果を得ることができる。
Of course, you can use this function in a for loop.

x$strABC <- ""
for(i in 1:nrow(x)){
x$strABC[i] <- get.strABC(x$strA[i], x$strB[i], x$strC[i])
}

Map()でget.strABC()を実行する。Map()はリストで結果を返すので、unlist()で結果をシンプルにする必要がある。
Let's use Map(). You need to unlist the output of Map() to map the values in a preexisting container.

x$strABC <- unlist(Map(f = get.strABC,
strA = x$strA,
strB = x$strB,
strC = x$strC))

> x
strA strB strC strABC
1 A a 1 A-a-1
2 B b 2 B-b-2
3 C c 3 C-c-3
4 D d 4 D-d-4
5 E e 5 E-e-5
6 F f 6 F-f-6

Map()のマルチコア版であるmcMap()も使える。この場合、parallelパッケージをロードする必要がある。
You can also use mcMap(), which is the multicore version of Map().
You need load the parallel package to do it.

library(parallel)
x$strABC <- unlist(mcMap(f = get.strABC,
strA = x$strA,
strB = x$strB,
strC = x$strC))

mcMap()のArgumentであるmc.coresで、使うコアの数を制御できる。
The number of cores can be regulated with the argument mc.cores.

#######################################################
rowの数を増やしてfor()とMap()を比べる。

x <- data.frame(strA = c("A", "B", "C", "D", "E", "F"),
strB = c("a", "b", "c", "d", "e", "f"),
strC = c("1", "2", "3", "4", "5", "6"))

for(i in 1:12) x <- rbind(x, x)

> nrow(x)
[1] 24576

for()ループでやると、4.286秒かかる。

x$strABC <- ""
system.time(
for(i in 1:nrow(x)){
x$strABC[i] <- get.strABC(x$strA[i], x$strB[i], x$strC[i])
}
)

ユーザシステム経過
4.169 0.137 4.286

Map()でやると、0.940秒。for()より速い。

system.time(
x$strABC <- unlist(Map(f = get.strABC,
strA = x$strA,
strB = x$strB,
strC = x$strC))
)

ユーザシステム経過
0.892 0.009 0.940

mcMap()でやると、0.711秒。Map()との差は小さい。

system.time(
x$strABC <- unlist(mcMap(f = get.strABC,
strA = x$strA,
strB = x$strB,
strC = x$strC, mc.cores = 4))
)

ユーザシステム経過
1.697 0.155 0.711

#######################################################
rowの数を増やしてMap()とmcMap()を比べる。

x <- data.frame(strA = c("A", "B", "C", "D", "E", "F"),
strB = c("a", "b", "c", "d", "e", "f"),
strC = c("1", "2", "3", "4", "5", "6"))

for(i in 1:16) x <- rbind(x, x)

> nrow(x)
[1] 393216

Map()でやると、16.926秒かかる。for()は怖くて試せない。

system.time(
x$strABC <- unlist(Map(f = get.strABC,
strA = x$strA,
strB = x$strB,
strC = x$strC))
)

ユーザシステム経過
16.596 0.107 16.926

mcMap(mc.cores = 4)でやると、8.939秒。Map()の半分の時間。

system.time(
x$strABC <- unlist(mcMap(f = get.strABC,
strA = x$strA,
strB = x$strB,
strC = x$strC, mc.cores = 4))
)

ユーザシステム経過
1.104 0.172 8.939

mcMap(mc.cores = 20)でやると、3.412秒。かなり速い。

system.time(
x$strABC <- unlist(mcMap(f = get.strABC,
strA = x$strA,
strB = x$strB,
strC = x$strC, mc.cores = 20))
)

ユーザシステム経過
80.537 3.933 3.412

登録: 投稿 (Atom)