Windows PowerShell parses local HTML file

I would like to build an array from an HTML file using PowerShell.

I am using a script that downloads an HTML file from Mozilla Firefox Developer Edition (I download an index file) locally, and I would like to parse it to get the value of the options items inside the select element, enter id in id_country.

I was advised to use XPath for this, but I can't figure out how to parse the file and build an array from the result. Perhaps using a regex could be a workaround.

The HTML file is located here:

http://pastebin.com/b8cShFLA

And I would like all the parameter element values ​​to be here:

<select aria-required="true" id="id_country" name="country" required="required">
   <option value="af">Afghanistan</option>
   <option value="al">Albania</option>
   <option value="dz">Algeria</option>
   <option value="as">American Samoa</option>
   <option value="ad">Andorra</option>

      

...

I am new to PowerShell so I am not aware of the various solutions I could use. I need something pretty fast as part of a package installation.

Basically the script will try to figure out if there is an installer that matches the language of the user's computer, and if it is not the default for English, then why would I need to get the values ​​from that list to check the firefox dev for the locales available.

Regards, Oh

+3


source to share


3 answers


I don't see the sample code getting fixed, so I'll do it.

If it was remote html I would use it Invoke-WebRequest

, but it doesn't work very well with local files.

For parsing local files, I would recommend using the HTML Agility Pack to parse the HTML file, and then use xPath to get the parameters you are looking for. Example.



Add-Type -Path .\HTMLAgilityPack\HtmlAgilityPack.dll
$url = (get-item .\b8cShFLA.html).FullName

$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.LoadHtml((get-content $url))

#Create hashtable to store data in
$langs = @{}

$doc.DocumentNode.SelectSingleNode("//select[@name='country']").SelectNodes("option") | ForEach-Object {
    $short = $_.Attributes[0].Value
    $long = $_.NextSibling.InnerText

    #Store data in hashtable
    $langs[$short] = $long
}

$langs

      

Ouput:

Name                           Value
----                           -----
rw                             Rwanda
tv                             Tuvalu
to                             Tonga
pn                             Pitcairn
bh                             Bahrain
lc                             Saint Lucia   

      

+5


source


If you are using PS 3.0 or higher, you can use Invoke-WebRequest for pages that exist on the Internet. If you are working with a local file, this can be a little tedious .

Invoke-WebRequest returns an HtmlWebResponseObject with a ParsedHtml property . This object has a method called getElementById that we can use since we know the id_country in your select tag. From there, it's simple to test the parameter tags and filter to return the properties we would like ... "Text" and "Value".

The following example produces a custom object containing the country name and country code:

code:

# I'm using your raw pastebin endpoint for this example
$result = Invoke-WebRequest "http://pastebin.com/raw.php?i=b8cShFLA"

# Only return specific properties from the elements you're looking for
$countries = $result.ParsedHtml.getElementById("id_country") | 
    Where tagName -eq "option" | 
    Select -Property Text, Value

# Country name and code are stored to this variable
$countries

      



Output:

text                                                        value
----                                                        -----
Afghanistan                                                 af
Albania                                                     al
Algeria                                                     dz
American Samoa                                              as
Andorra                                                     ad
...                                                         ...

      

Then you can use the country name and code like any other property of powershell objects.

As far as the internet endpoint is concerned, it looks like you can modify this script to point to the Mozilla source page you are pulling from this HTML?

+5


source


For most HTML, another option is to download the XML file and use it that way. See an example in my powershell file loader:

https://github.com/jefflomax/powershell-download-tumbler-images

0


source







All Articles