Creating a data.frame with all the data on unemployment in the US county
I am trying to create a data.frame with all US unemployment data from the US Bureau of Labor Statistics:
http://www.bls.gov/lau/#cntyaa
The data has one result per year from 1990 to 2013.
I originally planned to use the BLS API, but it looks like each graph counts as a separate request and the total number of requests will exceed their threshold. Now I think it would be easier to just clear the data from the TXT files they host on the internet, but I am having trouble parsing the data with R. Here is an example of a data page:
http://www.bls.gov/lau/laucnty90.txt # 90 = 1990
I originally tried to parse the package file rvest
. But since all the data is in one tag <p>
, I believe there is not enough HTML structure for this tool for this tool.
Then I tried download.file
and read.table
. But then again, the data doesn't seem to be in the correct format for these tools - extra lines at the top and bottom, and the "separator" is just a space that mixes R when county names contain spaces.
At the end of the day, I just need a data.frame with 3 pieces of data from that file: FIPS state, FIPS county code, and unemployment rate.
Now I think the easiest way to create this data.frame might be to download the excel files, remove the columns I don't need, remove the extra text at the top and bottom, export to CSV and then read it into R.
I can, of course, do this in every 14 years. But I am losing some reproducibility with this - other people will not be able to easily verify that I was not mistaken in the import process.
Does anyone see an easier way to create this data.frame file?
source to share
You need a "ruler" to figure out which columns to split into:
cat(">",paste0(rep(c(1:9,"+"),14),collapse=""))
cat(">",paste0(sprintf("%08s0/",1:14),collapse=""))
> 123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+
cat(">",paste0(sprintf("%08s0/",1:14),collapse=""))
> 000000010/000000020/000000030/000000040/000000050/000000060/000000070/000000080/000000090/000000100/000000110/000000120/000000130/000000140/
# and paste in the first line of data
> CN0100100000000 01 001 Autauga County, AL 1990 16,875 15,853 1,022 6.1
This allows you to figure out where to put the splits and take the difference and shift by one value gets the width. Skip the first six lines and then process the data read as characters to avoid the trouble of factors. Remove the commas before you go digital.
> dat = read.fwf(url,
widths=diff(c(0,16,21,29,80,86,100,115,125,132)+1),
skip=6,colClasses="character")
> str(dat)
'data.frame': 3219 obs. of 9 variables:
$ V1: chr "CN0100100000000 " "CN0100300000000 " "CN0100500000000 " "CN0100700000000 " ...
$ V2: chr " 01 " " 01 " " 01 " " 01 " ...
$ V3: chr " 001 " " 003 " " 005 " " 007 " ...
$ V4: chr " Autauga County, AL " " Baldwin County, AL " " Barbour County, AL " " Bibb County, AL " ...
$ V5: chr " 1990 " " 1990 " " 1990 " " 1990 " ...
$ V6: chr " 16,875 " " 46,773 " " 11,458 " " 7,408 " ...
$ V7: chr " 15,853 " " 44,492 " " 10,619 " " 6,776 " ...
$ V8: chr " 1,022 " " 2,281 " " 839 " " 632 " ...
$ V9: chr " 6.1" " 4.9" " 7.3" " 8.5" ...
dat[6:8] <- lapply( dat[6:8],
function(col) as.numeric( gsub("[,]", "", col)) )
> str(dat)
'data.frame': 3219 obs. of 9 variables:
$ V1: chr "CN0100100000000 " "CN0100300000000 " "CN0100500000000 " "CN0100700000000 " ...
$ V2: chr " 01 " " 01 " " 01 " " 01 " ...
$ V3: chr " 001 " " 003 " " 005 " " 007 " ...
$ V4: chr " Autauga County, AL " " Baldwin County, AL " " Barbour County, AL " " Bibb County, AL " ...
$ V5: chr " 1990 " " 1990 " " 1990 " " 1990 " ...
$ V6: num 16875 46773 11458 7408 19130 ...
$ V7: num 15853 44492 10619 6776 18001 ...
$ V8: num 1022 2281 839 632 1129 ...
$ V9: chr " 6.1" " 4.9" " 7.3" " 8.5" ...
dat[[9]] <- as.numeric( dat[[9]])
Perhaps this could be improved with some "NULL"
source to share
Here's an option to use the package XLConnect
to read the spreadsheet directly, avoiding all these operations when defining column boundaries.
get.file <- function(url) {
require(XLConnect)
download.file(url,"temp.xlsx",mode="wb")
wb <- loadWorkbook("temp.xlsx")
ws <- readWorksheet(wb,1,startRow=7,header=FALSE,drop=list(6))
ws[!is.na(ws$Col2),] # remove empty rows at the end
}
pfx <- "http://www.bls.gov/lau/laucnty"
urls <- paste0(pfx,c(90:99,formatC(0:13,width=2,flag=0)),".xlsx")
result <- do.call(rbind,lapply(urls,get.file))
head(result)
# Col1 Col2 Col3 Col4 Col5 Col7 Col8 Col9 Col10
# 1 CN0100100000000 01 001 Autauga County, AL 1990 16875 15853 1022 6.1
# 2 CN0100300000000 01 003 Baldwin County, AL 1990 46773 44492 2281 4.9
# 3 CN0100500000000 01 005 Barbour County, AL 1990 11458 10619 839 7.3
# 4 CN0100700000000 01 007 Bibb County, AL 1990 7408 6776 632 8.5
# 5 CN0100900000000 01 009 Blount County, AL 1990 19130 18001 1129 5.9
# 6 CN0101100000000 01 011 Bullock County, AL 1990 4381 3869 512 11.7
In the call, readWorksheet(...)
we skip the first 7 rows because they contain headers, and we drop column 6 because it is empty. Then we delete any n rows that contain NA
2 in column (the last few rows are notes). Finally, we use lapply(...)
to create a list of all the extracted files and do.call(rbind,...)
to combine them in a row in different ways.
Note that all columns are symbols. There's some more cleaning you will need to do. As is typical of these multipoint datsets, some data is missing and the code for "missing" is not always the same (sometimes "NA"
, sometimes "N.A."
, etc.).
source to share