Bash Parsing XML into Array
I'm doing a mixed script language with a parent script bash (don't ask why, this is a long story). Part of my script is pulling the source of the XML page into a variable. I want to use bash to process XML in a variable into multiple arrays. XML is configured as follows:
<event>
<id>34287352</id>
<what>New Post</what>
<when>1 Minute Ago 03:50 PM</when>
<title>This is a title</title>
<preview>sdfasd</preview>
<poster>
<![CDATA[ USERNAME ]]>
</poster>
<threadid>2346566</threadid>
<postid>34287352</postid>
<lastpost>1360021837</lastpost>
<userid>3291696</userid>
<forumid>2</forumid>
<forumname>General Discussion</forumname>
<views>201,913</views>
<replies>6,709</replies>
<statusicon>images/statusicon/thread.gif</statusicon>
</event>
There are 20 in the XML file <event>
. I want to pull the title and preview from XML and put them in my own array
I followed the example here at SOF
for tag in what title preview
do
OUT=`grep $tag $source | tr -d '\t' | sed 's/^<.*>\([^<].*\)<.*>$/\1/' `
# This is what I call the eval_trick, difficult to explain in words.
eval ${tag}=`echo -ne \""${OUT}"\"`
done
W_ARRAY=( `echo ${what}` )
T_ARRAY=( `echo ${title}` )
P_ARRAY=( `echo ${preview}` )
echo ${W_ARRAY[0]}
echo ${T_ARRAY[0]}
echo ${P_ARRAY[0]}
But using the above my script always gets carried away and repeats grep: <part of the xml>: No such file or directory
Thoughts?
EDIT:
Well this is ugly, but I managed to get sudoxml into an array
windex=0
tindex=0
pindex=0
while read -r line
do
WHAT=$(echo ${line} | awk -F "</?what>" '{ print $2 }')
if [ "$WHAT" != "" ]; then
W_ARRAY[$windex]=$OUT
let windex+=1
fi
TITLE=$(echo ${line} | awk -F "</?title>" '{ print $2 }')
if [ "$TITLE" != "" ]; then
T_ARRAY[$tindex]=$OUT
let tindex+=1
fi
PREVIEW=$(echo ${line} | awk -F "</?preview>" '{ print $2 }')
if [ "$PREVIEW" != "" ]; then
P_ARRAY[$pindex]=$OUT
let pindex+=1
fi
done <<< "$source"
source to share
There is everything that works. For those of you who have ever planned on doing something like this, here's what:
on run argv
set region to item 1 of argv
set XML_URL to "http://" & region & ".<URL REMOVED>.com/board/vaispy-secret.php?do=xml"
try
tell application "Safari"
set URL of tab 1 of front window to XML_URL
my waitforload()
--delay 5
-- Get page source
set currentTab to current tab of front window
set currentSource to currentTab source
return currentSource
end tell
on error err
log "Could not retrieve source."
log err
display dialog err
--return "NULL"
end try
end run
on waitforload()
--check if page has loaded
local loadflag, zarg, test_html
set loadflag to 0
repeat until loadflag is 1
delay 0.5
tell application "Safari"
set test_html to source of document 1
end tell
try
set zarg to text ((count of characters in test_html) - 10) thru (count of characters in test_html) of test_html
if "</events>" is in text ((count of characters in test_html) - 10) thru (count of characters in test_html) of test_html then
set loadflag to 1
end if
end try
end repeat
end waitforload
Create bash script:
#!/bin/bash
clear
if [ "$1" == "na" ]; then
region="na"
elif [ "$1" == "eu" ]; then
region="euw"
else
echo "FRcli requires an argument."
echo "usage: [eu|na]"
echo "[eu scans EUW & EUNE]"
echo "[na scans NA]"
exit $?
fi
while true; do
clear
echo "Region: $region"
echo "...Importing Naughty"
declare -a NAUGHTY=()
nindex=0
while read line
do
NAUGHTY[$nindex]=$line
let nindex+=1
done < $HOME/Desktop/naughty.txt
NC=${#NAUGHTY[@]}
let NC-=1
echo "...Pulling Source"
source=$(osascript FRcli.scpt $region)
echo "...Extracting Arrays"
windex=0
tindex=0
pindex=0
dindex=0
while read -r line
do
#WHAT=$(echo ${line} | awk -F "</?what>" '{ print $2 }')
WHAT=$(echo ${line} | sed -n 's/^.*<what>\([^<]*\).*/\1/p')
if [ "$WHAT" != "" ]; then
W_ARRAY[$windex]=$WHAT
let windex+=1
fi
#TITLE=$(echo ${line} | awk -F "</?title>" '{ print $2 }')
TITLE=$(echo ${line} | sed -n 's/^.*<title>\([^<]*\).*/\1/p')
if [ "$TITLE" != "" ]; then
T_ARRAY[$tindex]=$TITLE
let tindex+=1
fi
#PREVIEW=$(echo ${line} | awk -F "</?preview>" '{ print $2 }')
#PREVIEW=$(echo ${line} | sed -n '/<preview*/,/<\/preview>/p')
PREVIEW=$(echo ${line} | sed -n 's/^.*<preview>\([^<]*\).*/\1/p')
if [ "$PREVIEW" != "" ]; then
P_ARRAY[$pindex]=$PREVIEW
let pindex+=1
fi
POSTID=$(echo ${line} | sed -n 's/^.*<postid>\([^<]*\).*/\1/p')
if [ "$POSTID" != "" ]; then
D_ARRAY[$dindex]=$POSTID
let dindex+=1
fi
done <<< "$source"
echo "What: ${#W_ARRAY[@]}"
echo "Title: ${#T_ARRAY[@]}"
echo "Preview: ${#P_ARRAY[@]}"
echo "PostID: ${#D_ARRAY[@]}"
for ((i=0; i <= 19; i++))
do
found=0
fpid=""
if [ "${W_ARRAY[$i]}" = "New Thread" ]; then
echo "Scanning Thread"
scan=$(echo ${T_ARRAY[$i]} ${P_ARRAY[$i]})
echo "Title: ${T_ARRAY[$i]}"
echo "Post: ${P_ARRAY[$i]}"
else
echo "Scanning Post"
scan=$(echo ${P_ARRAY[$i]})
echo "Post: ${scan}"
fi
sleep .5
for ((n=0; n<=$NC; n++))
do
nw=${NAUGHTY[$n]}
a=$(echo ${scan} | tr [:lower:] [:upper:])
b=$(echo ${nw} | tr [:lower:] [:upper:])
echo "Checking: $b"
#echo "$a"
if [[ $a == *$b* ]]; then
## Change != to == in release
echo "Found: $b"
found=1
echo "...Loading PID"
declare -a PID=()
pindex=0
while read line
do
PID[$pindex]=$line
let pindex+=1
done < $HOME/Desktop/pid.txt
PIDC=${#PID[@]}
for (( p=0; p<=$PIDC ; p++))
do
lpid=${PID[$p]}
if [ "$region ${D_ARRAY[$i]}" == "$lpid" ]; then
echo "Found: $lpid"
echo "Ignoring Flag"
fpid=1
elif [ "$region ${D_ARRAY[$i]}" != "$lpid" ]; then
echo "$region ${D_ARRAY[$i]} $lpid"
echo "PID not found, opening URL."
fpid=0
break
else
echo "Hi"
fpid=1
fi
done
if [ "$found" == "1" -a "$fpid" == "0" ]; then
FFURL="http://$region.<URL REMOVED>.com/board/showthread.php?p=${D_ARRAY[$i]}&highlight=$nw"
open -a Firefox "$FFURL"
echo $region ${D_ARRAY[$i]} >> $HOME/Desktop/pid.txt
found=0
fipd=""
fi
fi
done
sleep .5
done
if [ "$1" == "eu" ]; then
if [ "$region" == "euw" ]; then
region="eune"
else
region="euw"
fi
fi
clear
done I'm sure they are much more efficient. Using cURL in a bash script would make this a once script a bargain (cannot with this script due to security for these iSpy boards). But it works and it's pretty zippy. Only uses AVG 32.7 Mem and as far as I can tell has no memory leaks (like my 100% version of this app)
source to share
I had something similar to sooo, parsing wise, here is the hacked version
I am using xsltproc (which is in ubuntu but dont remember if i installed it)
Command line
xsltproc tfile.xslt tfile.xml
tfile.xml (your example copied 3 times) wrapped in event tags i.e.
<events>
<event> ... </event>
<event> ... </event>
<event> ... </event>
</events>
tfile.xsl:
<?xml version='1.0'?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method='text'/>
<!-- ================================================================== -->
<xsl:template match="/">
<xsl:apply-templates select="//event"/>
</xsl:template>
<xsl:template match="event">
<xsl:text>event[</xsl:text><xsl:value-of select="position()"/><xsl:text>]['id']=</xsl:text>
<xsl:value-of select="id"/> <xsl:text> </xsl:text>
<xsl:text>event[</xsl:text><xsl:value-of select="position()"/><xsl:text>]['what']=</xsl:text>
<xsl:value-of select="what"/><xsl:text> </xsl:text>
<xsl:text>event[</xsl:text><xsl:value-of select="position()"/><xsl:text>]['preview']=</xsl:text>
<xsl:value-of select="preview"/><xsl:text> </xsl:text>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
Output
event[1]['id']=34287352 event[1]['what']=New Post event[1]['preview']=sdfasd
event[2]['id']=34287353 event[2]['what']=New Post3 event[2]['preview']=sdfasd
event[3]['id']=34287354 event[3]['what']=New Post4 event[3]['preview']=sdfasd
Hope you know a bit of xslt handling, change the output as you like.
source to share
Well this is completely useless now, but I'm working on a xml string parser right now. If it were finished (it would be already if I wasn't distracted by the topcoder marathon march show ...) you could write it the same way:
eval $(echo "$source" | xidel - -e '<event>
<what>{$W_ARRAY}</what>
<title>{$T_ARRAY}</title>
<preview>{$P_ARRAY}</preview>
</event>*' --output-format bash)
Looks awesome, doesn't it?
source to share
In summary of my comments, here's what happened to your code:
1- Since your variable is $source
not a filename, in your grep you should use:
OUT=`echo $source | grep $tag | tr -d '\t' | sed 's/^<.*>\([^<].*\)<.*>$/\1/' `
2- Your command tr
replaces all tabs in your XML like variable. However, the variable rour does not contain tabs, but 4 spaces instead.
So instead, you need:
... | tr -d ' ' | ...
3- An alternative solution could be:
OUT=`echo $source | grep $tag | sed 's/<.*>\([^<].*\)<.*>$/\1/' `
(note that ^
in sed
is removed)
source to share