This issue is difficult to describe. Installing and using the development data.table build does not fix the bug.
# [Minimal reproducible example] Here are steps to reproduce. Requires DT1.tsv.gz
R --vanilla
library(data.table)
DT1 = fread('DT1.tsv')
DT2 = DT1[!b%in%c('qm27','qm29')] # to reproduce the bug, there must be no occurrences of these in column b
# instead doing DT2 = copy(DT1[!b%in%c('qm27','qm29')]) fixes the bug
indices(DT1) # "b"; caused by previous row selection and assignment
nrow(DT1[b=='qm105']) # 133705 (correct)
# adding setindex(DT1,NULL) here fixed bug
# adding setindex(DT1,NULL); setindex(DT1,b) has no effect; bug still occurs
setkey(DT2,a)
nrow(DT1[b=='qm105']) # 1 (incorrect)
# Output of sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.14.3
loaded via a namespace (and not attached):
[1] bit_4.0.4 compiler_4.1.0 bit64_4.0.5
Also reproduced with different machine, OS, R, and data.table versions:
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin20.3.0 (64-bit)
Running under: macOS Big Sur 11.5.1
Matrix products: default
BLAS/LAPACK: /opt/local/lib/libopenblas-r1.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.14.0
loaded via a namespace (and not attached):
[1] compiler_4.0.4
This issue is difficult to describe. Installing and using the development data.table build does not fix the bug.
#[Minimal reproducible example] Here are steps to reproduce. Requires DT1.tsv.gz#Output of sessionInfo()Also reproduced with different machine, OS, R, and data.table versions: