"The Grid. A digital frontier. I tried to picture clusters of information as they move through the computer. What did they look like? Ships? Motorcycles? Were the circuits like freeways? I kept dreaming of a world I thought I’d never see. And then, one day, I got in." — Tron: Legacy

2013-05-06

Fetching images from any Web page using Clojure

As title says this post will be about downloading images from given page link.  To make this task a little easier I am using Enlive library available here: https://github.com/cgrand/enlive

There are primarily 4 main steps to do this. 
- read page source from given address
- parse it 
- find all <img> tags and store its src (image link) value
- use these links to fetch images directly and store them to disk

Because function that reads image from URL needs full address like: "http://..." I'm adding additional step: 
- check if image URL has a valid root

It means that if image link starts with "/" I must add Web page URL at the beginning.  

This is how it looks like in Clojure. Remember to add 
[enlive "1.1.1"] dependency to project.clj if you're using Leiningen. Also create target directory if it doesn't exist.


I hope code is self-explanatory. 


(ns fetcher.core
(:use [clojure.java.io :as io])
(:use [net.cgrand.enlive-html]))

(defn fetch-all-images-from-url [src-url dest-folder]
      (let [
            ; get root from src-url: http://google.com/alias1/2/3 ---> http://google.com
            root-url (clojure.string/join "/" (take 3 (clojure.string/split src-url #"/")))

            ; function: if image url starts with "/" ( / character is: \/ in Clojure, ex \a \b etc...) append root url
            complete-url (fn [url]
                            (let [t (first url)]
                                 (if (not= t \/)
                                      url
                                     (str root-url url))))

            ; get page source
            html-src (html-resource (java.net.URL. src-url))

            ; parse html creating list of mapped url links and image names, create set to avoid duplicates
            image-list (set (map #(let [url (complete-url (:src (:attrs %)))
                                        img-name (last (clojure.string/split url #"/"))]
                                       {:url url :img-name img-name})
                                (select html-src #{[:img]})))

            ; save to file function
            fetch-to-file (fn [url file]
                              (with-open [in (io/input-stream url) 
                                          out (io/output-stream file)]
                                         (io/copy in out)))
            ]
           ; actual work here
           (dorun (map #(do (println "Fetching" (:url %) "...")
                            (fetch-to-file (:url %) (str dest-folder "/" (:img-name %))))
                       image-list))))


; running:
(time (fetch-all-images-from-url "http://www.reddit.com" "/tmp/imgs"))

Output:
... cut ...
Fetching 1flDE6_4AZvmq7SE.png ...
Fetching 2fK5Sh_g6f2--4qm.jpg ...
Fetching zut90T1zjCO_R1D8.jpg ...
Fetching ynsO-YoyYCeK4_e6.png ...
"Elapsed time: 3124.432108 msecs"

You can upgrade this code a little and execute fetching in parallel using pmap.  Just edit the line:

(dorun (pmap #(do (println "Fetching"...


Output:
...
Fetching WvL8v5ZLqPpNK3Ww.jpg ...
Fetching zut90T1zjCO_R1D8.jpg ...
Fetching xxJOeJRdwhrF63PF.jpg ...
Fetching B43z7slN_Tpo9nf-.jpg ...
"Elapsed time: 1658.57191 msecs"

Twice as fast. :)

Brak komentarzy:

Prześlij komentarz