HTML-Scraping with RegEx

by | October 20,2011

Table of Contents

Scraping Website Data with PowerShell

To scrape valuable information from websites with PowerShell you can download the HTML code and then use regular expressions to extract what you are after. That’s not hard. Here is a sample:

$webclient = New-Object System.Net.WebClient
$html = $webclient.DownloadString('http://www.cnn.com') | Out-String

$headerpattern = '(?i)<h1>(.*?)</h1>'

$header = ([regex]$headerpattern).Matches($html) |
  ForEach-Object { $_.Groups[1].Value }
$header


Downloads the HTML Content

It downloads the HTML content from www.cnn.com and then extracts all <h1>…</h1> headers. That way, you get a quick headline overview.

ReTweet this Tip!