Iris Classon
Iris Classon - In Love with Code

Background jobs in PowerShell- a screen scraping example

Scroll down to skip life of me stuff ;)

Getting back to blogging turns out to be harder than I thought. It’s not that I don’t want to, or that I lack the time. It’s rather that things have changed so much, people have changed me, and I’m still trying to figure out how to find back to me and who that might be. And more important, can I trust again? Deep shit. Anyway, I’ll stop with that and share some fun things instead. And please give me some more time to reply to you all- I’m taking my time to make sure the reply is personal and not rushed.

The stones here have it easy. Hanging out at the beach each day doing nothing, yet in perfect shape. *Badush*

I’m in Corfu, giddy about Saturdays dives which brings me over the 60 dives I need for my dive master, and I’m coding every evening with umbrella drinks, and studying math in the morning’s in-between swims, long walks and yoga.

Developers love cats, right? This one is keeping me company this evening as I’m writing this. The ears indicate he doesn’t appreciate the camera :P

Obviously this is a vacation :D One that I’m fortunate to enjoy with my mother who is busy coding next to me. I’m recovering, rethinking and living. It’s perfect. And you know what else is perfect?

PowerShell.

I love it. Just love it. Okay, maybe not truly perfect but the amount of time and trouble it has saved me makes me want to publicly declare my love for it.

Stop scrolling here.

Here is an example. A terrible, terrible bad thing to do, but I’m feeling naughty. For unknown reasons, say that a friend (it’s always a friend) wants to get some information from a website that provides the information across several pages. There is no API, and our friend gets the brilliant idea of scraping the pages desperate for that glorious information.

PoSh to the rescue.

The scraping itself is rather straight forward, he/she knows the class names and the graph structure of the objects. Using PowerShell our friend quickly does a request, in a loop (knowing in advance how many pages). With the markup at hand it’s easy to grab whatever he/she wants, hoping not to be blocked at the X nr call and being blacklisted on the bot list. But. It’s slow. One page takes shit loads of time. And this friend is an impatient one. So why not run the calls async? Each job representing one job? Piping the output to separate files, and once the work is all done, just merge then all together.

What a terrible idea. But PoSh is great. Hellobeloved background jobs <3

When a cmdlet runs as a background job, the work is done asynchronously in its own thread separate from the pipeline thread that the cmdlet is using. From the user perspective, when a cmdlet runs as a background job, the command prompt returns immediately even if the job takes an extended amount of time to complete, and the user can continue without interruption while the job runs. - MSDN

Here is the code, our friend was too lazy to refactor so feel free to comment and improve- I’ll make sure to pass on the info ;)

Do notice the $using:myVariableinside the job, this how we access the values declared outside of the job scope. Start-Jobhas an alias:sajb

 <span style="color: #008080;"><strong>$p = "the site"
$d =""
$f = "where to save file\" + "file name prefix"
$c = "class name"

for ($i=1; $i -le 101; $i++){
   Start-Job{ 
            $res = Invoke-WebRequest "$using:p$using:i.html"
            $html = $res.ParsedHtml
            $d = $html.getElementsByTagName('div') | ? { $_.className -eq $using:c }
            ($d).innerText | % { 
               $a = $_.Split("`r`n");
               $l = ('{0},{1}'-f $a[8],$a[10]);
               $l >> "$using:f$using:i.csv" 
               }
             $i++
       }
}
</strong></span>
<span style="color: #808000;"><strong># Merging all files
# ls *prefix*.csv | % {gc $_ | % { $_ >> allNew.csv}}
</strong></span> 

Comments

Leave a comment below, or by email.
James Billings
8/13/2015 1:52:25 PM
I remember Greek islands had a *lot* of random cats walking about... good to hear you're doing better now :) 
Jernej Jerin
8/17/2015 1:45:27 PM
Nice to see Powershell being used for scraping. But may I recommend you a Scrapy framework. It is written in python and contains strong DOM query features such as Xpath and CSS selectors. 
Jessee
8/17/2015 4:40:53 PM
Wow Iris, I didn't know you were into diving also. You sure are an interesting person :) I couldn't imagine my mom trying to code...sounds like you are having a good time. Enjoy 


Last modified on 2015-08-13

comments powered by Disqus