How to scan correctly?

I've been working and messing around with Nokogiri, REXML and Ruby for a month now. I have this gigantic database that I am trying to accomplish. What I am reading are HTML links and XML files.

There are exactly 43612 XML files that I want to scan and store in a CSV file.

My script works if the scan can be 500 xml files but more, which takes too long and it hangs or something.

Here I have split the code into parts for easy reading, all script / code is here: https://gist.github.com/1981074

I am using two libraries. I couldn't find a way to do this all in nokogiri. I personally find REXML easier to use.

My question is, how do I fix this, so there won't be a week for me to crawl all this? How do I make it run faster?

HERE IS MY SCRIPT:

Require the required lib:

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'rexml/document'
require 'csv'
include REXML

      

Create a bunch of storage array that grabs data:

@urls = Array.new 
@ID = Array.new
@titleSv = Array.new
@titleEn = Array.new
@identifier = Array.new
@typeOfLevel = Array.new

      

Grab all XML links from the spec site and store them in an array called @urls

htmldoc = Nokogiri::HTML(open('http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI'))

htmldoc.xpath('//a/@href').each do |links|
  @urls << links.content
end

      

Loop to throw @urls array and grab every node element I want to grab using xpath.

@urls.each do |url|
  # Loop throw the XML files and grab element nodes
  xmldoc = REXML::Document.new(open(url).read)
  # Root element
  root = xmldoc.root
  # HΓ€mtar info-id
  @ID << root.attributes["id"]
  # TitleSv
  xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
    |e| m = e.text 
        m = m.to_s
        next if m.empty? 
        @titleSv << m
  }

      

Then save them as a CSV file.

 CSV.open("eduction_normal.csv", "wb") do |row|
    (0..@ID.length - 1).each do |index|
      row << [@ID[index], @titleSv[index], @titleEn[index], @identifier[index], @typeOfLevel[index], @typeOfResponsibleBody[index], @courseTyp[index], @credits[index], @degree[index], @preAcademic[index], @subjectCodeVhs[index], @descriptionSv[index], @lastedited[index], @expires[index]]
    end
  end

      

+1


source to share


3 answers


It is difficult to pinpoint the exact problem due to the way the code is structured. Here are some tips to speed up and structure your program so it’s easier to find what’s blocking you.

Libraries

You are using a lot of libraries here that you probably don't need.

You use both REXML

and Nokogiri

. They both do the same job. Also Nokogiri

much better ( benchmark ).

Use hash

Instead of storing data in index

15 arrays, use one set of hashes.

For example,

items = Set.new

doc.xpath('//a/@href').each do |url|
  item = {}
  item[:url] = url.content
  items << item
end

items.each do |item|
  xml = Nokogiri::XML(open(item[:url]))

  item[:id] = xml.root['id']
  ...
end

      

Collect data then write to file

Now that you have yours installed items

, you can iterate over it and write to a file. This is much faster than taking turns doing it.



Be DRY

In your original code, you repeat the same thing several dozen times. Instead of copying and pasting, try abstracting away from the general code instead.

xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
    |e| m = e.text 
     m = m.to_s
     next if m.empty? 
     @titleSv << m
}

      

Move common to method

def get_value(xml, path)
   str = ''
   xml.elements.each(path) do |e|
     str = e.text.to_s
     next if str.empty?
   end

   str
end

      

And move anything constant to another hash

xml_paths = {
  :title_sv => "/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]",
  :title_en => "/educationInfo/titles/title[2] | /ns:educationInfo/ns:titles/ns:title[2]",
  ...
}

      

You can now combine these techniques to make much cleaner code.

item[:title_sv] = get_value(xml, xml_paths[:title_sv])
item[:title_en] = get_value(xml, xml_paths[:title_en])

      

Hope this helps!

+4


source


It won't work without your fixes. And I believe you should do as @Ian Bishop said to refactor your syntax code



require 'rubygems'
require 'pioneer'
require 'nokogiri'
require 'rexml/document'
require 'csv'

class Links < Pioneer::Base
  include REXML
  def locations
    ["http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI"]
  end

  def processing(req)
    doc = Nokogiri::HTML(req.response.response)
    htmldoc.xpath('//a/@href').map do |links|
      links.content
    end
  end
end

class Crawler < Pioneer::Base
  include REXML
  def locations
    Links.new.start.flatten
  end

  def processing(req)
    xmldoc = REXML::Document.new(req.respone.response)
    root = xmldoc.root
    id = root.attributes["id"]
    xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]") do |e|
      title = e.text.to_s
      CSV.open("eduction_normal.csv", "a") do |f|
        f << [id, title ...]
      end
    end
  end
end

Crawler.start
# or you can run 100 concurrent processes
Crawler.start(concurrency: 100)

      

+2


source


If you really want to speed it up, you have to go in parallel.

One of the easiest ways is to install JRuby and then launch the application with one slight modification: install either peach or "pmap" and then change items.each

to items.peach(n)

(each in parallel), where n

is the number of threads. You will need at least one thread per processor core, but if you inject I / O into your loop, you need more.

Also, use Nokogiri, it's much faster. Ask a separate question for Nokogiri if you need to solve something specific in Nokogiri. I'm sure it can do what you need.

+1


source







All Articles