How to scan correctly?
I've been working and messing around with Nokogiri, REXML and Ruby for a month now. I have this gigantic database that I am trying to accomplish. What I am reading are HTML links and XML files.
There are exactly 43612 XML files that I want to scan and store in a CSV file.
My script works if the scan can be 500 xml files but more, which takes too long and it hangs or something.
Here I have split the code into parts for easy reading, all script / code is here: https://gist.github.com/1981074
I am using two libraries. I couldn't find a way to do this all in nokogiri. I personally find REXML easier to use.
My question is, how do I fix this, so there won't be a week for me to crawl all this? How do I make it run faster?
HERE IS MY SCRIPT:
Require the required lib:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'rexml/document'
require 'csv'
include REXML
Create a bunch of storage array that grabs data:
@urls = Array.new
@ID = Array.new
@titleSv = Array.new
@titleEn = Array.new
@identifier = Array.new
@typeOfLevel = Array.new
Grab all XML links from the spec site and store them in an array called @urls
htmldoc = Nokogiri::HTML(open('http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI'))
htmldoc.xpath('//a/@href').each do |links|
@urls << links.content
end
Loop to throw @urls array and grab every node element I want to grab using xpath.
@urls.each do |url|
# Loop throw the XML files and grab element nodes
xmldoc = REXML::Document.new(open(url).read)
# Root element
root = xmldoc.root
# HΓ€mtar info-id
@ID << root.attributes["id"]
# TitleSv
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
|e| m = e.text
m = m.to_s
next if m.empty?
@titleSv << m
}
Then save them as a CSV file.
CSV.open("eduction_normal.csv", "wb") do |row|
(0..@ID.length - 1).each do |index|
row << [@ID[index], @titleSv[index], @titleEn[index], @identifier[index], @typeOfLevel[index], @typeOfResponsibleBody[index], @courseTyp[index], @credits[index], @degree[index], @preAcademic[index], @subjectCodeVhs[index], @descriptionSv[index], @lastedited[index], @expires[index]]
end
end
source to share
It is difficult to pinpoint the exact problem due to the way the code is structured. Here are some tips to speed up and structure your program so itβs easier to find whatβs blocking you.
Libraries
You are using a lot of libraries here that you probably don't need.
You use both REXML
and Nokogiri
. They both do the same job. Also Nokogiri
much better ( benchmark ).
Use hash
Instead of storing data in index
15 arrays, use one set of hashes.
For example,
items = Set.new
doc.xpath('//a/@href').each do |url|
item = {}
item[:url] = url.content
items << item
end
items.each do |item|
xml = Nokogiri::XML(open(item[:url]))
item[:id] = xml.root['id']
...
end
Collect data then write to file
Now that you have yours installed items
, you can iterate over it and write to a file. This is much faster than taking turns doing it.
Be DRY
In your original code, you repeat the same thing several dozen times. Instead of copying and pasting, try abstracting away from the general code instead.
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
|e| m = e.text
m = m.to_s
next if m.empty?
@titleSv << m
}
Move common to method
def get_value(xml, path)
str = ''
xml.elements.each(path) do |e|
str = e.text.to_s
next if str.empty?
end
str
end
And move anything constant to another hash
xml_paths = {
:title_sv => "/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]",
:title_en => "/educationInfo/titles/title[2] | /ns:educationInfo/ns:titles/ns:title[2]",
...
}
You can now combine these techniques to make much cleaner code.
item[:title_sv] = get_value(xml, xml_paths[:title_sv])
item[:title_en] = get_value(xml, xml_paths[:title_en])
Hope this helps!
source to share
It won't work without your fixes. And I believe you should do as @Ian Bishop said to refactor your syntax code
require 'rubygems'
require 'pioneer'
require 'nokogiri'
require 'rexml/document'
require 'csv'
class Links < Pioneer::Base
include REXML
def locations
["http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI"]
end
def processing(req)
doc = Nokogiri::HTML(req.response.response)
htmldoc.xpath('//a/@href').map do |links|
links.content
end
end
end
class Crawler < Pioneer::Base
include REXML
def locations
Links.new.start.flatten
end
def processing(req)
xmldoc = REXML::Document.new(req.respone.response)
root = xmldoc.root
id = root.attributes["id"]
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]") do |e|
title = e.text.to_s
CSV.open("eduction_normal.csv", "a") do |f|
f << [id, title ...]
end
end
end
end
Crawler.start
# or you can run 100 concurrent processes
Crawler.start(concurrency: 100)
source to share
If you really want to speed it up, you have to go in parallel.
One of the easiest ways is to install JRuby and then launch the application with one slight modification: install either peach or "pmap" and then change items.each
to items.peach(n)
(each in parallel), where n
is the number of threads. You will need at least one thread per processor core, but if you inject I / O into your loop, you need more.
Also, use Nokogiri, it's much faster. Ask a separate question for Nokogiri if you need to solve something specific in Nokogiri. I'm sure it can do what you need.
source to share