How to grab the content of an element using Nokogiri using SAX
I want to parse a couple thousand XML files from a website (I have permission) and need to use SAX to avoid loading the file into memory. Then save them to a CSV file.
The xml files look like this:
<?xml version="1.0" encoding="UTF-8"?><educationInfo xmlns="http://skolverket.se/education/info/1.2" xmlns:ct="http://skolverket.se/education/commontypes/1.2" xmlns:nya="http://vhs.se/NyA-emil-extensions" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" expires="2013-08-01" id="info.uh.su.HIA80D" lastEdited="2011-10-13T10:10:05" xsi:schemaLocation="http://skolverket.se/education/info/1.2 educationinfo.xsd">
<titles>
<title xml:lang="sv">Arkivvetenskap</title>
<title xml:lang="en">Archival science</title>
</titles>
<identifier>HIA80D</identifier>
<educationLevelDetails>
<typeOfLevel>uoh</typeOfLevel>
<typeOfResponsibleBody>statlig</typeOfResponsibleBody>
<academic>
<course>
<type>avancerad</type>
</course>
</academic>
</educationLevelDetails>
<credits>
<exact>60</exact>
</credits>
<degrees>
<degree>Ingen examen</degree>
</degrees>
<prerequisites>
<academic>uh</academic>
</prerequisites>
<subjects>
<subject>
<code source="vhs">10.300</code>
</subject>
</subjects>
<descriptions>
<ct:description xml:lang="sv">
<ct:text>Arkivvetenskap rör villkoren för befintliga arkiv och modern arkivbildning med fokus på arkivarieyrkets arbetsuppgifter: bevara, tillgängliggöra och styra information. Under ett år behandlas bl a informations- och dokumenthantering, arkivredovisning, gallring, lagstiftning och arkivteori. I kursen ingår praktik, där man under handledning får arbeta med olika arkivarieuppgifter.</ct:text>
</ct:description>
</descriptions>
</educationInfo>
I am using this code pattern, check my comments for questions:
class InfoData < Nokogiri::XML::SAX::Document
def initialize
# do one-time setup here, called as part of Class.new
# But what should I use hashes or arrays?
end
def start_element(name, attributes = [])
# check the element name here and create an active record object if appropriate
# How do I grab specific element like: ct:text ?
# how do I grab root-element?
end
def characters(s)
# save the characters that appear here and possibly use them in the current tag object
end
def end_element(name)
# check the tag name and possibly use the characters you've collected
# and save your activerecord object now
end
end
parser = Nokogiri::XML::SAX::Parser.new(InfoData.new)
# How do I parse every xml-link?
parser.parse_file('')
I wrote this method to grab references, but don't know where in the class to use it or if I should use it there:
@items = Set.new
def get_links(url)
doc = Nokogiri::HTML(open(url))
doc.xpath('//a/@href').each do |url|
item = {}
item[:url] = url.content
items << item
end
+3
source to share
2 answers
require 'nokogiri'
class LinkGrabber < Nokogiri::XML::SAX::Document
def start_element(name, attrs = [])
if name == 'a'
puts Hash[attrs]['href']
end
end
end
parser = Nokogiri::XML::SAX::Parser.new(LinkGrabber.new)
parser.parse(File.read(ARGV[0], 'rb'))
Now you can use this in your pipeline:
find . -name "*.xml" -print0 | xargs -P 20 -0 -L 1 ruby parse.rb > links
But every time a ruby is launched. So you're better off using jruby (which speeds up anyway) etc.
require 'threach'
require 'find'
require 'nokogiri'
class LinkGrabber < Nokogiri::XML::SAX::Document
def start_element(name, attrs = [])
if name == 'a'
puts Hash[attrs]['href']
end
end
end
# let hope it threadsave
parser = Nokogiri::XML::SAX::Parser.new(LinkGrabber.new)
Find.find(ARGV[0]).threach do |path|
next unless File.file?(path)
parser.parse(File.read(path))
end
0
source to share