Using colClasses and choosing fread arguments at the same time

I am trying to load in a small number of fields from a tab delimited file with many unused fields using fread

in a batch data.table

.

For this purpose, I am using a parameter select

which is great for reading in columns.

However, when I do not specify the classes of the different fields, the auto selector does not work (most / all numeric variables end up being read as numeric digits like 1.896916e-316

).

To fix this, my first instinct was to change the code:

DT <- fread("data.txt", select = c ("V1", "V2", ..., "Vn"))

      

to

DT <- fread("data.txt", select = c("V1", "V2", ..., "Vn"),
            colClasses = c("numeric", ..., "character"))

      

those. match a character character select

with a character vector of colClasses

equal length, with (obviously) the type of the i-th selected field from the selected set equal to the i-th element colClasses

.

However, it fread

doesn't look like this - even if used select

, colClasses

expects a character vector with as many fields as a WHOLE file:

Error in fread("data.txt", select = c("V1", "V2", ..., "Vn",

: colClasses

has no title and length 25, but has 256 columns. See ?data.table

for use colClasses

.

It might be okay if I only had to do it with one file - I just laid out the rest of the char with "character"

(or any other type) because they get re-flipped anyway.

However, I plan on repeating this process 13 times or so on files corresponding to different years - they have the same column names, but appear to appear in different orders (and different number of columns from year to year). which destroys the ability to cycle (and also takes much longer).

The following worked, but hardly seem to be efficient (coding):

DT <- fread("data.txt", select=c("V1", "V2", "V3"),
            colClasses = c(V1 = "factor", V2 = "character", V3 = "numeric"))

      

This is a pain because I am taking 25 columns, so a huge block of code defining the column types is used for this. I cannot use rep

to save space, for example.

colClasses = c(rep("character", times = 3), rep("numeric", times = 20))

      

Any suggestions for improving performance / performance?

Here's a preview of the data for reference:

         LEAID FIPST                                                   NAME SCHLEV AGCHRT CCDNF GSLO   V33  TOTALREV  TFEDREV
    1: 0100002    01                                 ALABAMA YOUTH SERVICES      N      3     1   03     0        -2       -2
    2: 0100005    01                                       ALBERTVILLE CITY     03      3     1   PK  4143  38394000  6326000
    3: 0100006    01                                        MARSHALL COUNTY     03      3     1   PK  5916  58482000 11617000
    4: 0100007    01                                            HOOVER CITY     03      3     1   PK 13232 154703000 10184000
    5: 0100008    01                                           MADISON CITY     03      3     1   PK  8479  89773000  6648000
---                                                                                                                       
18293: 5680180    56                                NORTHEAST WYOMING BOCES     07      3     1    N    -2        -2       -2
18294: 5680250    56                                         REGION V BOCES     07      3     1    N    -2        -2       -2
18295: 5680251    56                  WYOMING DEPARTMENT OF FAMILY SERVICES     02      3     1   KG    82        -2       -2
18296: 5680252    56 YOUTH EMERGENCY SERVICES, INC. - ADMINISTRATION OFFICE      N      3     1   07    29        -1       -1
18297: 5680253    56                           WYOMING BEHAVIORAL INSTITUTE      N      N     1   01     0        -2       -2

      

+3


source to share


2 answers


Actually found the solution by reading this illustration of the drop

/ select

/ options by colClasses

Mr Dole more closely :

DT <- fread("data.txt", select = c("V1", "V2", "V3"),
            colClasses = list(character = c("char_names"),
                              factor = c("factor_names"),
                              numeric = c("numeric_names")))

      

I didn't understand this before because there were problems with my attempts fread

due to poor formatting of my CSV file.



However, I used to call it a bug that the natural approach doesn't work:

DT <- fread("data.txt", select = c("V1", ..., "Vn"),
            colClasses = c("type1", ..., "typen"))

      

+3


source


Perhaps something like this:

 varnames <- readLines(file='filename.txt', n=1)
 valid <- c("LEAID", "FIPST", "NAME", "SCHLEV", "AGCHRT", "CCDNF", "GSLO", "V33", "TOTALREV", "TFEDREV")
 colC <- varnames %in% valid 
 colCchar <- colC
 colCchar[!colC] <-"NULL"
 colCchar[colC] <- c( rep("numeric", 2), rep("character",2),  
                      rep("numeric", 2), "character",
                      rep("numeric", 3) )
 dt<-fread("data.txt", colClasses=colCchar)

      



Obviously it hasn't been tested since over 200+ the first line was not provided. It will not be robust to reordering of variables in targets, but your description of the problem "left a lot to be desired." I can't figure out how the column names would be the same, but they might be different. You may need match

to get the order of the required variables.

+1


source







All Articles