Creating a data.frame with all the data on unemployment in the US county

I am trying to create a data.frame with all US unemployment data from the US Bureau of Labor Statistics:

http://www.bls.gov/lau/#cntyaa

The data has one result per year from 1990 to 2013.

I originally planned to use the BLS API, but it looks like each graph counts as a separate request and the total number of requests will exceed their threshold. Now I think it would be easier to just clear the data from the TXT files they host on the internet, but I am having trouble parsing the data with R. Here is an example of a data page:

http://www.bls.gov/lau/laucnty90.txt # 90 = 1990

I originally tried to parse the package file rvest

. But since all the data is in one tag <p>

, I believe there is not enough HTML structure for this tool for this tool.

Then I tried download.file

and read.table

. But then again, the data doesn't seem to be in the correct format for these tools - extra lines at the top and bottom, and the "separator" is just a space that mixes R when county names contain spaces.

At the end of the day, I just need a data.frame with 3 pieces of data from that file: FIPS state, FIPS county code, and unemployment rate.

Now I think the easiest way to create this data.frame might be to download the excel files, remove the columns I don't need, remove the extra text at the top and bottom, export to CSV and then read it into R.

I can, of course, do this in every 14 years. But I am losing some reproducibility with this - other people will not be able to easily verify that I was not mistaken in the import process.

Does anyone see an easier way to create this data.frame file?

+3


source to share


3 answers


You need a "ruler" to figure out which columns to split into:

cat(">",paste0(rep(c(1:9,"+"),14),collapse=""))
cat(">",paste0(sprintf("%08s0/",1:14),collapse=""))
> 123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+
cat(">",paste0(sprintf("%08s0/",1:14),collapse=""))
> 000000010/000000020/000000030/000000040/000000050/000000060/000000070/000000080/000000090/000000100/000000110/000000120/000000130/000000140/
# and paste in the first line of data
> CN0100100000000   01     001   Autauga County, AL                                1990        16,875       15,853      1,022      6.1

      

This allows you to figure out where to put the splits and take the difference and shift by one value gets the width. Skip the first six lines and then process the data read as characters to avoid the trouble of factors. Remove the commas before you go digital.



> dat = read.fwf(url, 
                 widths=diff(c(0,16,21,29,80,86,100,115,125,132)+1), 
                 skip=6,colClasses="character")
> str(dat)
'data.frame':   3219 obs. of  9 variables:
 $ V1: chr  "CN0100100000000 " "CN0100300000000 " "CN0100500000000 " "CN0100700000000 " ...
 $ V2: chr  "  01 " "  01 " "  01 " "  01 " ...
 $ V3: chr  "    001 " "    003 " "    005 " "    007 " ...
 $ V4: chr  "  Autauga County, AL                               " "  Baldwin County, AL                               " "  Barbour County, AL                               " "  Bibb County, AL                                  " ...
 $ V5: chr  " 1990 " " 1990 " " 1990 " " 1990 " ...
 $ V6: chr  "       16,875 " "       46,773 " "       11,458 " "        7,408 " ...
 $ V7: chr  "      15,853   " "      44,492   " "      10,619   " "       6,776   " ...
 $ V8: chr  "   1,022  " "   2,281  " "     839  " "     632  " ...
 $ V9: chr  "    6.1" "    4.9" "    7.3" "    8.5" ...

dat[6:8] <- lapply( dat[6:8], 
                    function(col) as.numeric( gsub("[,]", "", col)) )

> str(dat)
'data.frame':   3219 obs. of  9 variables:
 $ V1: chr  "CN0100100000000 " "CN0100300000000 " "CN0100500000000 " "CN0100700000000 " ...
 $ V2: chr  "  01 " "  01 " "  01 " "  01 " ...
 $ V3: chr  "    001 " "    003 " "    005 " "    007 " ...
 $ V4: chr  "  Autauga County, AL                               " "  Baldwin County, AL                               " "  Barbour County, AL                               " "  Bibb County, AL                                  " ...
 $ V5: chr  " 1990 " " 1990 " " 1990 " " 1990 " ...
 $ V6: num  16875 46773 11458 7408 19130 ...
 $ V7: num  15853 44492 10619 6776 18001 ...
 $ V8: num  1022 2281 839 632 1129 ...
 $ V9: chr  "    6.1" "    4.9" "    7.3" "    8.5" ...

dat[[9]] <- as.numeric( dat[[9]])

      

Perhaps this could be improved with some "NULL"

+3


source


This is a fixed width data file only. I don't have time to give you the perfect code, but setting this up should give you what you need:



url = 'http://www.bls.gov/lau/laucnty90.txt'
w = c(17, 8, 6, 50, 12, 13, 12, 11) 
dat = read.fwf(url, w, skip=3)

      

+6


source


Here's an option to use the package XLConnect

to read the spreadsheet directly, avoiding all these operations when defining column boundaries.

get.file <- function(url) {
  require(XLConnect)
  download.file(url,"temp.xlsx",mode="wb")
  wb <- loadWorkbook("temp.xlsx")
  ws <- readWorksheet(wb,1,startRow=7,header=FALSE,drop=list(6))
  ws[!is.na(ws$Col2),]   # remove empty rows at the end
}
pfx  <- "http://www.bls.gov/lau/laucnty"
urls <- paste0(pfx,c(90:99,formatC(0:13,width=2,flag=0)),".xlsx")
result <- do.call(rbind,lapply(urls,get.file))
head(result)
#              Col1 Col2 Col3               Col4 Col5  Col7  Col8 Col9 Col10
# 1 CN0100100000000   01  001 Autauga County, AL 1990 16875 15853 1022   6.1
# 2 CN0100300000000   01  003 Baldwin County, AL 1990 46773 44492 2281   4.9
# 3 CN0100500000000   01  005 Barbour County, AL 1990 11458 10619  839   7.3
# 4 CN0100700000000   01  007    Bibb County, AL 1990  7408  6776  632   8.5
# 5 CN0100900000000   01  009  Blount County, AL 1990 19130 18001 1129   5.9
# 6 CN0101100000000   01  011 Bullock County, AL 1990  4381  3869  512  11.7

      

In the call, readWorksheet(...)

we skip the first 7 rows because they contain headers, and we drop column 6 because it is empty. Then we delete any n rows that contain NA

2 in column (the last few rows are notes). Finally, we use lapply(...)

to create a list of all the extracted files and do.call(rbind,...)

to combine them in a row in different ways.

Note that all columns are symbols. There's some more cleaning you will need to do. As is typical of these multipoint datsets, some data is missing and the code for "missing" is not always the same (sometimes "NA"

, sometimes "N.A."

, etc.).

+3


source







All Articles