Parse multiple XBRL files stored in a zip file

I downloaded some ZIP files from the website. Each zip file contains multiple files of extension html

and xml

(~ 100K each).

You can manually extract the files and then analyze them. However, I would like to be able to do this in R

(if possible)

Sample file (sorry, this is a bit bigger) using the code from the previous question  - download one zip file

library(XML)

pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
doc <- htmlParse(pth)

myfiles <- doc["//a[contains(text(),'Accounts_Monthly_Data')]", fun = xmlAttrs][[1]]
fileURLS <- file.path("http://download.companieshouse.gov.uk", myfiles) [[1]]

dir.create("temp", "hmrcCache")
download.file(fileURLS, destfile = file.path("temp", myfiles))

      

I can parse the files using if I manually extract them. This can be done as follows: XBRL package

library(XBRL)     
inst <- file.path("temp", "Prod224_0004_00000121_20130630.html")
out <- xbrlDoAll(inst, cache.dir="temp/hmrcCache", prefix.out=NULL, verbose=T)

      

I am struggling with how to extract these files from the zip folder and parse each one in, say, a loop using R, without manually extracting them. I tried to get started but don't know how to progress from here. Thanks for any advice.

# Get names of files
lst <- unzip(file.path("temp", myfiles), list=TRUE)
dim(lst) # 118626

# unzip  and extract first file
nms <- lst$Name[1] # Prod224_0004_00000121_20130630.html
lst2 <- unz(file.path("temp", myfiles), filename=nms)

      

I am using Windows 8.1

R version 3.1.2 (2014-10-31)

Platform: x86_64-w64-mingw32 / x64 (64-bit)

+1


source to share


1 answer


Using Karsten's suggestion in the comments, I unzipped the files to a temporary directory and then parsed each file. I used the package snow

to speed things up.

  # Parse one zip file to start
  fls <- list.files(temp)[[1]]

  # Unzip 
  tmp <- tempdir()
  lst <- unzip(file.path(temp, fls), exdir=tmp)

  # Only parse first 10 records
  inst <- lst[1:10]

  # Start to parse - in parallel
  cl <- makeCluster(parallel::detectCores())
  clusterCall(cl, function() library(XBRL))

  # Start
  st <- Sys.time()

  out <- parLapply(cl, inst, function(i) 
                                  xbrlDoAll(i, 
                                            cache.dir="temp/hmrcCache", 
                                            prefix.out=NULL, verbose=T) )

  stopCluster(cl)

  Sys.time() - st

      



(I'm not sure if I'm using it correctly tempdir()

as this seems to store large amounts of data in the directory Local\Temp

- I'd welcome comments if I'm approaching this incorrectly, thanks).

0


source







All Articles