2009-12-23

scrapping web pages with embedded javascript in perl

recently i had to download various pdf files from a vendor website, which list them according to different criterias. since i'm lazy, i wanted to write a script that would download all those files for me.

that would be quite easy, but the list was generated by some ajax when i was changing the criterias... so, what to do to pally this problem?

there are some solutions out there for the perl programmer.

although www::mechanize's author stated that he does not want to bother with integrating javascript support, someone wrote a javascript plugin for mechanize. it's still experimental by now, though.

the same author, leveraging his knowledge, also wrote www::scripter that's supposed to achieve the same result.

but before going in those complex beasts, it may be worth trying to understand what the javascript is doing, and if the hidden url that gives the wanted list of files is easy to guess. to do that, you can go the hard way and read the javascript... or the easy way and wiretap the network!

enters http::proxy which is a configurable proxy written in perl by book. among the various examples, it comes with a logger.pl script that displays all the urls accessed by your browser (with cookies). run it as is, configure your browser to use a proxy on localhost port 3128, and check the output of logger.pl while tinkering with your ajaxy application.

chances are that you'll see a url such as http://www.examplecom/path/to/script/list/?crit1=foo&crit2=bar&limit=10

well, that was the case for me, and i could just download the list according to my criterias, parse it with the excellent html::tree and download each of the files separately... \o/

so the title of this blog post is wrong: no javascript webscrapping (although i gave some hints on how to do it), but cheap trick to achieve the same result! :-)

No comments:

Post a Comment