改善R代码的五个技巧

本文概述

1.从1开始更有趣
2. vector()你的c()
3.抛弃which()
4.因素那个因素！
5.首先获得$, 然后获得幂
登出

@drsimonj这里有五个简单的窍门, 我发现自己一直与R的同伴分享他们的改进代码！

1.从1开始更有趣

下次使用冒号运算符从1创建序列(如1：n)时, 请尝试seq()。

# Sequence a vector
x <- runif(10)
seq(x)
#>  [1]  1  2  3  4  5  6  7  8  9 10

# Sequence an integer
seq(nrow(mtcars))
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
#> [24] 24 25 26 27 28 29 30 31 32

冒号运算符可能会产生意想不到的结果, 可能会引起各种问题, 而你无需注意！看一下要对空向量的长度进行排序时会发生什么：

# Empty vector
x <- c()

1:length(x)
#> [1] 1 0

seq(x)
#> integer(0)

你还将注意到, 这使你不必使用诸如length()之类的函数。当应用于一定长度的对象时, seq()将自动创建一个从1到对象长度的序列。

2. vector()你的c()

下次使用c()创建空向量时, 请尝试将其替换为vector(” type”, length)。

# A numeric vector with 5 elements
vector("numeric", 5)
#> [1] 0 0 0 0 0

# A character vector with 3 elements
vector("character", 3)
#> [1] "" "" ""

这样做可以提高内存使用率并提高速度！你通常经常预先知道向量中将使用哪种类型的值, 以及向量将持续多长时间。使用c()意味着R必须慢慢解决这两个问题。因此, 请使用vector()帮助提升它！

这个值的一个很好的例子是在for循环中。人们通常通过声明一个空向量并使用c()使其增长来编写循环, 如下所示：

x <- c()
for (i in seq(5)) {
  x <- c(x, i)
}

#> x at step 1 : 1
#> x at step 2 : 1, 2
#> x at step 3 : 1, 2, 3
#> x at step 4 : 1, 2, 3, 4
#> x at step 5 : 1, 2, 3, 4, 5

而是使用vector()预先定义类型和长度, 并通过索引引用位置, 如下所示：

n <- 5
x <- vector("integer", n)
for (i in seq(n)) {
  x[i] <- i
}

#> x at step 1 : 1, 0, 0, 0, 0
#> x at step 2 : 1, 2, 0, 0, 0
#> x at step 3 : 1, 2, 3, 0, 0
#> x at step 4 : 1, 2, 3, 4, 0
#> x at step 5 : 1, 2, 3, 4, 5

这是一个快速的速度比较：

n <- 1e5

x_empty <- c()
system.time(for(i in seq(n)) x_empty <- c(x_empty, i))
#>    user  system elapsed 
#>  15.238   2.327  17.650

x_zeros <- vector("integer", n)
system.time(for(i in seq(n)) x_zeros[i] <- i)
#>    user  system elapsed 
#>   0.007   0.000   0.007

那应该足够令人信服！

3.抛弃which()

下次你使用which()时, 请尝试放弃它！人们经常使用which()从某个布尔条件中获取索引, 然后在这些索引中选择值。这不是必需的。

使向量元素大于5：

x <- 3:7

# Using which (not necessary)
x[which(x > 5)]
#> [1] 6 7

# No which
x[x > 5]
#> [1] 6 7

或计数大于5的值：

# Using which
length(which(x > 5))
#> [1] 2

# Without which
sum(x > 5)
#> [1] 2

你为什么要抛弃which()？通常这是不必要的, 布尔向量就足够了。

例如, R使你可以选择布尔矢量中标记为TRUE的元素：

condition <- x > 5
condition
#> [1] FALSE FALSE FALSE  TRUE  TRUE
x[condition]
#> [1] 6 7

同样, 当与sum()或mean()结合使用时, 布尔向量可用于获取满足条件的值的计数或比例：

sum(condition)
#> [1] 2
mean(condition)
#> [1] 0.4

which()告诉你TRUE值的索引：

which(condition)
#> [1] 4 5

尽管结果没有错, 但没有必要。例如, 我经常看到人们结合使用which()和length()来测试任何或所有值是否为TRUE。相反, 你只需要any()或all()：

x <- c(1, 2, 12)

# Using `which()` and `length()` to test if any values are greater than 10
if (length(which(x > 10)) > 0)
  print("At least one value is greater than 10")
#> [1] "At least one value is greater than 10"

# Wrapping a boolean vector with `any()`
if (any(x > 10))
  print("At least one value is greater than 10")
#> [1] "At least one value is greater than 10"

# Using `which()` and `length()` to test if all values are positive
if (length(which(x > 0)) == length(x))
  print("All values are positive")
#> [1] "All values are positive"

# Wrapping a boolean vector with `all()`
if (all(x > 0))
  print("All values are positive")
#> [1] "All values are positive"

哦, 它为你节省了一些时间…

x <- runif(1e8)

system.time(x[which(x > .5)])
#>    user  system elapsed 
#>   1.156   0.522   1.686

system.time(x[x > .5])
#>    user  system elapsed 
#>   1.071   0.442   1.662

4.因素那个因素！

你是否曾经从某个因素中删除过价值, 发现自己陷入了不再存在的旧水平？我看到了各种各样的创造性方法来解决这个问题。最简单的解决方案通常只是再次将其包装在factor()中。

本示例创建一个具有四个级别的因子(” a”, ” b”, ” c”和” d”)：

# A factor with four levels
x <- factor(c("a", "b", "c", "d"))
x
#> [1] a b c d
#> Levels: a b c d

plot(x)

如果删除所有一个级别(” d”)的个案, 则级别仍记录在因子中：

# Drop all values for one level
x <- x[x != "d"]

# But we still have this level!
x
#> [1] a b c
#> Levels: a b c d

plot(x)

删除它的一种超简单方法是再次使用factor()：

x <- factor(x)
x
#> [1] a b c
#> Levels: a b c

plot(x)

这通常是解决很多人生气的问题的好方法。因此, 省去你的头痛, 并将其作为因素！

5.首先获得美元, 然后获得权力

下次你要从满足条件的data.frame列中提取值时, 请在$与行之前指定$。

假设你要使用mtcars数据集获得4缸(cyl)汽缸的马力(hp)。你可以编写以下任何一个：

# rows first, column second - not ideal
mtcars[mtcars$cyl == 4, ]$hp
#>  [1]  93  62  95  66  52  65  97  66  91 113 109

# column first, rows second - much better
mtcars$hp[mtcars$cyl == 4]
#>  [1]  93  62  95  66  52  65  97  66  91 113 109

这里的技巧是使用第二种方法。

但是为什么呢？

第一个原因：消除讨厌的逗号！在列之前指定行时, 需要记住逗号：mtcars [mtcars $ cyl == 4, ] $ hp。当你首先指定列时, 这意味着你现在正在引用向量, 并且不需要逗号！

第二个原因：速度！让我们在更大的数据帧上进行测试：

# Simulate a data frame...
n <- 1e7
d <- data.frame(
  a = seq(n), b = runif(n)
)

# rows first, column second - not ideal
system.time(d[d$b > .5, ]$a)
#>    user  system elapsed 
#>   0.497   0.126   0.629

# column first, rows second - much better
system.time(d$a[d$b > .5])
#>    user  system elapsed 
#>   0.089   0.017   0.107

值得, 对不对？

不过, 如果你想磨练R数据框忍者的技能, 建议你学习dplyr。你可以在dplyr网站上获得很好的概述, 或者通过在线课程(例如使用dplyr进行R中的srcmini的Data Manipulation)来真正学习绳索。

登出

感谢你的阅读, 希望对你有所帮助。

有关最新博客文章的更新, 请在Twitter上关注@drsimonj, 或通过drsimonjackson@gmail.com给我发送电子邮件以取得联系。

如果你想要生成此博客的代码, 请查看blogR GitHub存储库。

本文概述

1.从1开始更有趣

2. vector()你的c()

3.抛弃which()

4.因素那个因素！

5.首先获得美元, 然后获得权力

登出

相关推荐

评论抢沙发

评论前必须登录！

猜你喜欢

热门标签

回顶部

本文概述

1.从1开始更有趣

2. vector()你的c()

3.抛弃which()

4.因素那个因素！

5.首先获得美元, 然后获得权力

登出

相关推荐

评论 抢沙发

评论前必须登录！

猜你喜欢

热门标签

回顶部

评论抢沙发