Using colClasses and choosing fread arguments at the same time
I am trying to load in a small number of fields from a tab delimited file with many unused fields using fread
in a batch data.table
.
For this purpose, I am using a parameter select
which is great for reading in columns.
However, when I do not specify the classes of the different fields, the auto selector does not work (most / all numeric variables end up being read as numeric digits like 1.896916e-316
).
To fix this, my first instinct was to change the code:
DT <- fread("data.txt", select = c ("V1", "V2", ..., "Vn"))
to
DT <- fread("data.txt", select = c("V1", "V2", ..., "Vn"),
colClasses = c("numeric", ..., "character"))
those. match a character character select
with a character vector of colClasses
equal length, with (obviously) the type of the i-th selected field from the selected set equal to the i-th element colClasses
.
However, it fread
doesn't look like this - even if used select
, colClasses
expects a character vector with as many fields as a WHOLE file:
Error in
fread("data.txt", select = c("V1", "V2", ..., "Vn",
:colClasses
has no title and length 25, but has 256 columns. See?data.table
for usecolClasses
.
It might be okay if I only had to do it with one file - I just laid out the rest of the char with "character"
(or any other type) because they get re-flipped anyway.
However, I plan on repeating this process 13 times or so on files corresponding to different years - they have the same column names, but appear to appear in different orders (and different number of columns from year to year). which destroys the ability to cycle (and also takes much longer).
The following worked, but hardly seem to be efficient (coding):
DT <- fread("data.txt", select=c("V1", "V2", "V3"),
colClasses = c(V1 = "factor", V2 = "character", V3 = "numeric"))
This is a pain because I am taking 25 columns, so a huge block of code defining the column types is used for this. I cannot use rep
to save space, for example.
colClasses = c(rep("character", times = 3), rep("numeric", times = 20))
Any suggestions for improving performance / performance?
Here's a preview of the data for reference:
LEAID FIPST NAME SCHLEV AGCHRT CCDNF GSLO V33 TOTALREV TFEDREV
1: 0100002 01 ALABAMA YOUTH SERVICES N 3 1 03 0 -2 -2
2: 0100005 01 ALBERTVILLE CITY 03 3 1 PK 4143 38394000 6326000
3: 0100006 01 MARSHALL COUNTY 03 3 1 PK 5916 58482000 11617000
4: 0100007 01 HOOVER CITY 03 3 1 PK 13232 154703000 10184000
5: 0100008 01 MADISON CITY 03 3 1 PK 8479 89773000 6648000
---
18293: 5680180 56 NORTHEAST WYOMING BOCES 07 3 1 N -2 -2 -2
18294: 5680250 56 REGION V BOCES 07 3 1 N -2 -2 -2
18295: 5680251 56 WYOMING DEPARTMENT OF FAMILY SERVICES 02 3 1 KG 82 -2 -2
18296: 5680252 56 YOUTH EMERGENCY SERVICES, INC. - ADMINISTRATION OFFICE N 3 1 07 29 -1 -1
18297: 5680253 56 WYOMING BEHAVIORAL INSTITUTE N N 1 01 0 -2 -2
source to share
Actually found the solution by reading this illustration of the drop
/ select
/ options by colClasses
Mr Dole more closely :
DT <- fread("data.txt", select = c("V1", "V2", "V3"),
colClasses = list(character = c("char_names"),
factor = c("factor_names"),
numeric = c("numeric_names")))
I didn't understand this before because there were problems with my attempts fread
due to poor formatting of my CSV file.
However, I used to call it a bug that the natural approach doesn't work:
DT <- fread("data.txt", select = c("V1", ..., "Vn"),
colClasses = c("type1", ..., "typen"))
source to share
Perhaps something like this:
varnames <- readLines(file='filename.txt', n=1)
valid <- c("LEAID", "FIPST", "NAME", "SCHLEV", "AGCHRT", "CCDNF", "GSLO", "V33", "TOTALREV", "TFEDREV")
colC <- varnames %in% valid
colCchar <- colC
colCchar[!colC] <-"NULL"
colCchar[colC] <- c( rep("numeric", 2), rep("character",2),
rep("numeric", 2), "character",
rep("numeric", 3) )
dt<-fread("data.txt", colClasses=colCchar)
Obviously it hasn't been tested since over 200+ the first line was not provided. It will not be robust to reordering of variables in targets, but your description of the problem "left a lot to be desired." I can't figure out how the column names would be the same, but they might be different. You may need match
to get the order of the required variables.
source to share