Best way to mark (split?) The dataset in each row
I have a dataset that contains 485k (1.1GB) rows. Each line contains about 700 characters containing about 250 variables (1-16 characters per variable), but it has no sections. The lengths of each variable are known. What is the best way to change and label data with a symbol ,
?
For example: I have lines like:
0123456789012... 1234567890123...
and an array of lengths:
5,3,1,4,...
then I have to do the following:
01234,567,8,9012,... 12345,678,9,0123,...
Can anyone help me with this? Python or R tools are mostly preferred for me ...
source to share
In R read.fwf
will work:
# inputs
x <- c("0123456789012...", "1234567890123... ")
widths <- c(5,3,1,4)
read.fwf(textConnection(x), widths, colClasses = "character")
giving:
V1 V2 V3 V4
1 01234 567 8 9012
2 12345 678 9 0123
If you want numeric rather than character columns, drop the argument colClasses
.
source to share