Home / Solution / Invoke-WebRequest: Parsing HTML Webpages with Powershell

Invoke-WebRequest: Parsing HTML Webpages with Powershell

In PowerShell three.zero you possibly can immediately entry and parse HTML internet pages on the Internet. To do it, a particular Invoke-WebRequest cmdlet was launched. This cmdlet means that you can implement many eventualities: from the flexibility to obtain/add a file from/to any web site by way of HTTP/HTTPS/FTP, to the flexibility to parse HTML pages, internet providers monitoring, fill in and submit internet types. In common, the brand new cmdlet supplies all essential strategies to navigate the DOM tree of an HTML doc. In this text, we’ll cowl primary examples of utilizing the Invoke-WebRequest cmdlet in PowerShell.

Tip. The Invoke-WebRequest cmdlet is offered in Windows PowerShell three.zero, so earlier than you begin, just be sure you are utilizing this or newer.

Using the Invoke-WebRequest Cmdlet

The Invoke-WebRequest cmdlet (alias wget) can ship and obtain HTTP, HTTPS and FTP requests, and course of the response returned by the net server. The obtained response is a group of types, hyperlinks, photos and different essential parts of an HTML doc.

Run the next command:

Invoke-WebRequest -Uri "http://woshub.com"

Tip. If you’re related to the Internet by way of a proxy server, then for PoweShell cmdlets to work accurately, use the tips from the article: “”.

As you possibly can see, the cmdlet returned not a easy HTML code. You can see varied properties of an online doc. The Invoke-WebRequest cmdlet, like most different PowerShell cmdlets, handles objects. Invoke-WebRequest returns an object of sort HtmlWebResponseObject. Let’s have a look at all of the properties of this object:

$WebResponseObj = Invoke-WebRequest -Uri "http://woshub.com"
$WebResponseObj| Get-Member

To get the uncooked HTML code of the net web page that’s contained within the HtmlWebResponseObject object, run:

$WebResponseObj.content material

You can checklist the HTML code alongside with the HTTP headers returned by the net server:

$WebResponseObj.rawcontent

You can test solely the net server HTTP standing code and the HTTP headers of the HTML web page:

$WebResponseObj.Headers

As you possibly can see, the server has returned the response 200, i. e. the request has been profitable, and the net server is offered and works accurately.

How to Extract a List of HTML Links on the WebWeb page

Let’s open the principle web page of our web site and get the checklist of the HTML hyperlinks on it:

$SiteAdress = "http://woshub.com"
$HttpContent = Invoke-WebRequest -URI $SiteAdress
$HttpContent.Links | Foreach

To get the hyperlink textual content itself (contained within the InnerText factor), you should use the next command:

$HttpContent.Links | fl innerText, href

You can solely choose hyperlinks with a selected CSS class:

$HttpContent.Links | Where-Object | fl innerText, href

Or particular textual content within the url deal with:

$HttpContent.Links | Where-Object | fl innerText,href

Parsing and Scraping HTML Web Content Using PowerShell

The Invoke-WebRequest cmdlet means that you can shortly and conveniently parse the content material of any internet web page. When processing an HTML web page, collections of hyperlinks, internet types, photos, scripts , and so forth., are created.

Let’s get the content material of the house web page of our web site utilizing PowerShell:

$Img = Invoke-WebRequest "http://woshub.com/"

Then show an inventory of all photos on this web page:

$Img.Images

Create a group of full URL paths to those photos:

$photos = $Img.Images | choose src

Initialize a brand new occasion of WebShopper class:

$wc = New-Object System.Net.WebShopper

And obtain all of the picture information from the web page (with their unique filenames) to the c:too1s folder:

$photos | foreach

Downloading Files Using HTTP with Powershell

Invoke-WebRequest can work as Wget or cURL for Windows and permitting to obtain a information from an online web page or ftp web site. Suppose, you could obtain a file by way of HTTP utilizing PowerShell (on this case set up file of Mozilla Firefox). Run this command:

Invoke-WebRequest "https://obtain.mozilla.org/?product=firefox-34.zero.5-SSL&os=win&lang=en-US” -outfile “c:too1sfirefox setup 34.zero.5.exe”

This cmdlet downloads a file from the required URL and saves it to the c:instruments folder beneath the identify “firefox setup 34.zero.5.exe”. If you could obtain a file from FTP, simply change http: // with ftp://.

Thus, you possibly can simply discover on a selected internet web page all of the hyperlinks that fall beneath particular standards (hyperlink class, decision within the file identify, url deal with, and so forth.), and obtain the information from the obtained hyperlinks. For instance, there’s a sure web site with a bunch of hyperlinks to PDF paperwork. Your job is to obtain all these information to your laptop. The spine of the PowerShell script for bulk file downloads over HTTP could seem like this:

$OutDir="C:docsdownloadPDF"
$SiteAdress = "https://sometechdocs.com/pdf"
$HttpContent = Invoke-WebRequest -URI $SiteAdress
$HttpContent.Links | Where-Object | %

As a results of the script within the goal listing, all pdf information from the web page can be downloaded. Each file is saved beneath an random identify.

In PowerShell 6.1, the Invoke-WebRequest Commander helps resume mode. Thus, utilizing the Invoke-WebRequest -Uri $Uri -OutFile $OutFile –Resume parameter, you possibly can resume downloading the file in case of a channel or server crash.

Filling and Submitting HTML Forms by way of Powershell

Many internet providers require varied information to be entered into HTML types. With Invoke-WebRequest, you possibly can entry any HTML kind, fill within the essential fields and submit the crammed kind again to the server, In this instance we’ll present how to go online Facebook by way of its commonplace internet kind utilizing Powershell.

With the next command, save the details about connection cookies in a separate session variable:

$fbauth = Invoke-WebRequest https://www.fb.com/login.php -SessionVariable session

Using the subsequent command, show the checklist of the fields to be crammed within the login HTML kind (login_form):

$fbauth.Forms["login_form"].Fields

Assign the required values to all fields:

$fbauth.Forms["login_form"].Fields["email"] = "[email protected]"

$fbauth.Forms["login_form"].Fields["pass"] = "Coo1P$wd"

Etc.

To submit (despatched) the crammed kind to the server, name the motion attribute of the HTML kind.

$Log = Invoke-WebRequest -method POST -URI ("https://www.fb.com/login.php" + $fbauth.Forms["login_form"].Action) -Body $fbauth.Forms["login_form"].Fields -WebSession $session

Disadvantages of the Invoke-WebRequest cmdlet

One of the numerous disadvantage of the Invoke-WebRequest cmdlet is the relatively low efficiency. When an HTTP file is downloaded, the whole stream is buffered into the reminiscence, and solely after the complete obtain is accomplished, it’s saved to native drive. Thus, when downloading massive information utilizing Invoke-WebReques, you might encounter an absence of RAM.

Another problem is that the Invoke-WebRequest cmdlet is carefully associated to Internet Explorer. For instance, in Windows Server Core editions wherein IE just isn’t put in, the Invoke-WebRequest cmdlet can’t be used.

If a is used on the HTTP web site, the Invoke-WebRequest cmdlet refuses to obtain information from it. To ignore an invalid SSL certificates, use the next code:

add-type @"
utilizing System.Net;
utilizing System.Security.Cryptography.X509Certificates;
public class TrustAllCertsPolicy : ICertificatePolicy
"@
[System.Net.ServicePointManager]::CertificatePolicy = New-Object TrustAllCertsPolicy
$consequence = Invoke-WebRequest -Uri "https://somesite.web"

Check Also

How to Enable Access-Based Enumeration (ABE) on Windows Server

Access-based Enumeration (ABE) permits on a community shared folder to cover objects (information and folders) …

Leave a Reply

Your email address will not be published. Required fields are marked *