Raving Bytes: Fetching images from any Web page using Clojure

As title says this post will be about downloading images from given page link. To make this task a little easier I am using Enlive library available here: https://github.com/cgrand/enlive

There are primarily 4 main steps to do this.

- read page source from given address

- parse it

- find all <img> tags and store its src (image link) value

- use these links to fetch images directly and store them to disk

Because function that reads image from URL needs full address like: "http://..." I'm adding additional step:

- check if image URL has a valid root

It means that if image link starts with "/" I must add Web page URL at the beginning.

This is how it looks like in Clojure. Remember to add

[enlive "1.1.1"] dependency to project.clj if you're using Leiningen. Also create target directory if it doesn't exist.

I hope code is self-explanatory.

(ns fetcher.core

(:use [clojure.java.io :as io])

(:use [net.cgrand.enlive-html]))

(defn fetch-all-images-from-url [src-url dest-folder]

(let [

; get root from src-url: http://google.com/alias1/2/3 ---> http://google.com

root-url (clojure.string/join "/" (take 3 (clojure.string/split src-url #"/")))

; function: if image url starts with "/" ( / character is: \/ in Clojure, ex \a \b etc...) append root url

complete-url (fn [url]

(let [t (first url)]

(if (not= t \/)

url

(str root-url url))))

; get page source

html-src (html-resource (java.net.URL. src-url))

; parse html creating list of mapped url links and image names, create set to avoid duplicates

image-list (set (map #(let [url (complete-url (:src (:attrs %)))

img-name (last (clojure.string/split url #"/"))]

{:url url :img-name img-name})

(select html-src #{[:img]})))

; save to file function

fetch-to-file (fn [url file]

(with-open [in (io/input-stream url)

out (io/output-stream file)]

(io/copy in out)))

]

; actual work here

(dorun (map #(do (println "Fetching" (:url %) "...")

(fetch-to-file (:url %) (str dest-folder "/" (:img-name %))))

image-list))))

; running:

(time (fetch-all-images-from-url "http://www.reddit.com" "/tmp/imgs"))

Output:

... cut ...

Fetching 1flDE6_4AZvmq7SE.png ...

Fetching 2fK5Sh_g6f2--4qm.jpg ...

Fetching zut90T1zjCO_R1D8.jpg ...

Fetching ynsO-YoyYCeK4_e6.png ...

"Elapsed time: 3124.432108 msecs"

You can upgrade this code a little and execute fetching in parallel using pmap. Just edit the line:

(dorun (pmap #(do (println "Fetching"...

Output:

...

Fetching WvL8v5ZLqPpNK3Ww.jpg ...

Fetching zut90T1zjCO_R1D8.jpg ...

Fetching xxJOeJRdwhrF63PF.jpg ...

Fetching B43z7slN_Tpo9nf-.jpg ...

"Elapsed time: 1658.57191 msecs"

Twice as fast. :)

Raving Bytes

2013-05-06

Fetching images from any Web page using Clojure

Brak komentarzy:

Prześlij komentarz