filter by using %like% between two columns of the data table

问题内容:

Hello stackoverflowers,

I wonder if I could use the %like% operator row-wise in the datatable between two columns of the same datatable.

The following reproducible example will make it more clear.

First prepare the data

library(data.table)

iris <- as.data.table(iris)
iris <- iris[seq.int(from = 1, to = 150,length.out = 5)]
iris[, Species2 := c('set', "set|vers", "setosa", "nothing" , "virginica")]

Hence the dataset looks as follows.

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  Species2
1:          5.1         3.5          1.4         0.2     setosa       set
2:          4.9         3.6          1.4         0.1     setosa  set|vers
3:          6.4         2.9          4.3         1.3 versicolor    setosa
4:          6.4         2.7          5.3         1.9  virginica   nothing
5:          5.9         3.0          5.1         1.8  virginica virginica

I would like to use something like the following command row-wise.

iris[Species%like%Species2]

but it does not understand that I want it row-wise. Is that possible?
The result should be the 1,2,5 rows.

问题评论:

答案:

答案1:

One way would be to group by row:

iris[, .SD[Species %like% Species2], by = 1:5]
#   : Sepal.Length Sepal.Width Petal.Length Petal.Width   Species  Species2
#1: 1          5.1         3.5          1.4         0.2    setosa       set
#2: 2          4.9         3.6          1.4         0.1    setosa  set|vers
#3: 5          5.9         3.0          5.1         1.8 virginica virginica

Or as per @docendodiscimus ‘s comment, in case there are duplicate entries, you can do:

iris[, .SD[Species[1L] %like% Species2[1L]], by = .(Species, Species2)]

答案评论:

1  
In case there are duplicate entries, I’d go for iris[, .SD[Species[1L] %like% Species2[1L]], by = .(Species, Species2)] instead of by-row grouping
1  
Very nice solution
    
Good call @docendodiscimus thanks. I ll add this.

答案2:

You can’t pass a vector to the pattern argument of %like% since it calls upon grepl/grep and these aren’t vectorized. You could use mapply to call %like% for each row to get what you want:

iris[mapply(function(x,y) x %like% y, Species, Species2) ]

#   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species  Species2
#1:          5.1         3.5          1.4         0.2    setosa       set
#2:          4.9         3.6          1.4         0.1    setosa  set|vers
#3:          5.9         3.0          5.1         1.8 virginica virginica

答案评论:

答案3:

%like% is just a wrapper around grepl, so the pattern (right-hand side) can only be length 1. You should be seeing a warning about this.

The stringi package lets you vectorize the pattern argument.

library(stringi)

iris[stri_detect_regex(Species, Species2)]

If you like the operator style instead of the function, you can make your own:

`%vlike%` <- function(x, y) {
  stri_detect_regex(x, y)
}

iris[Species %vlike% Species2]
#    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species  Species2
# 1:          5.1         3.5          1.4         0.2    setosa       set
# 2:          4.9         3.6          1.4         0.1    setosa  set|vers
# 3:          5.9         3.0          5.1         1.8 virginica virginica

答案评论:

原文地址:

https://stackoverflow.com/questions/47755000/filter-by-using-like-between-two-columns-of-the-data-table

添加评论

友情链接:蝴蝶教程