best/fastest way to read a chunk of lines from a text file that are separated by identifiers in R

问题内容:

I have a text file where each line begins with known character identifiers as such (* is a delimiter):

AAA*123456789*.*.*.
BBB*123456789*.*.*.
CCC*123456789*.*.*.
.
.
.
ZZZ*123456789*.*.*.

The problem is even though the information is organized this way. Every line from AAA to ZZZ represents one record in this particular data. So after that ZZZ line, data goes back to AAA up to ZZZ again.

Is there a way, other than using a for loop and processing line by line,to take the chunk of lines from AAA to ZZZ and basically put it on one line so I can separate out each line by the delimiter after that?

Or let me know if you have any other suggestions on processing this kind of data.

Thanks,

问题评论:

    
Try tapply(lines, cumsum(grepl("^AAA"., lines)), FUN = paste, collapse="")
    
I think I saw an elegant solution to a similar question a while ago that used read.dcf, but I can’t find it. A general approach is to use readLines, then split(lines, cumsum(grepl('^AAA', lines), make a named list/data frame of each element, and then call do.call(rbind, ...) or equivalent on the result. For a full answer, edit with more representative example data.

答案:

答案1:

Using the sample data in the Note at the end read it into a data frame, create a grouping variable g and then use reshape to convert it from long to wide form. No packages are used. text=Lines can be replaced with a filename, e.g. "myfile", if the input comes from a file.

DF <- read.table(text = Lines, sep = "*", as.is = TRUE, strip.white = TRUE)
DF$g <- cumsum(DF$V1 == "AAA")
reshape(DF, dir = "wide", idvar = "g", timevar = "V1")

Note:

Lines <- "AAA*123456789*.*.*.
BBB*123456789*.*.*.
CCC*123456789*.*.*.
AAA*123456789*.*.*.
BBB*123456789*.*.*.
CCC*123456789*.*.*."

答案评论:

答案2:

We can use tapply to paste the elements

tapply(lines, cumsum(grepl("^AAA", lines)), FUN = paste, collapse="")

No packages are used as well

data

lines <- readLines(textConnection("AAA*123456789*.*.*.
BBB*123456789*.*.*.
CCC*123456789*.*.*.
ZZZ*123456789*.*.*.
AAA*123456789*.*.*.
BBB*123456789*.*.*.
CCC*123456789*.*.*.
ZZZ*123456789*.*.*."))

答案评论:

原文地址:

https://stackoverflow.com/questions/47756867/best-fastest-way-to-read-a-chunk-of-lines-from-a-text-file-that-are-separated-by

Tags:, ,

添加评论

友情链接:蝴蝶教程