There are primarily 4 main steps to do this.
- read page source from given address
- parse it
- find all <img> tags and store its src (image link) value
- use these links to fetch images directly and store them to disk
Because function that reads image from URL needs full address like: "http://..." I'm adding additional step:
- check if image URL has a valid root
It means that if image link starts with "/" I must add Web page URL at the beginning.
This is how it looks like in Clojure. Remember to add
[enlive "1.1.1"] dependency to project.clj if you're using Leiningen. Also create target directory if it doesn't exist.
I hope code is self-explanatory.
(ns fetcher.core
(:use
[clojure.java.io :as
io])
(:use
[net.cgrand.enlive-html]))
(defn
fetch-all-images-from-url [src-url dest-folder]
(let
[
;
get root from src-url: http://google.com/alias1/2/3 --->
http://google.com
root-url
(clojure.string/join
"/"
(take
3 (clojure.string/split
src-url #"/")))
;
function: if image url starts with "/" ( / character is: \/
in Clojure, ex \a \b etc...) append root url
complete-url (fn
[url]
(let
[t (first
url)]
(if
(not=
t \/)
url
(str
root-url url))))
;
get page source
html-src
(html-resource
(java.net.URL.
src-url))
;
parse html creating list of mapped url links and image names, create
set to avoid duplicates
image-list (set
(map
#(let
[url (complete-url
(:src
(:attrs
%)))
img-name
(last
(clojure.string/split
url #"/"))]
{:url
url :img-name
img-name})
(select
html-src #{[:img]})))
;
save to file function
fetch-to-file (fn
[url file]
(with-open
[in (io/input-stream
url)
out (io/output-stream
file)]
(io/copy
in out)))
]
;
actual work here
(dorun
(map
#(do (println
"Fetching"
(:url
%)
"...")
(fetch-to-file
(:url
%)
(str
dest-folder "/"
(:img-name
%))))
image-list))))
; running:
(time
(fetch-all-images-from-url
"http://www.reddit.com"
"/tmp/imgs"))
Output:
... cut ...
Fetching 1flDE6_4AZvmq7SE.png ...
Fetching 2fK5Sh_g6f2--4qm.jpg ...
Fetching zut90T1zjCO_R1D8.jpg ...
Fetching ynsO-YoyYCeK4_e6.png ...
"Elapsed time: 3124.432108 msecs"
You can upgrade this code a little and execute fetching in parallel using pmap. Just edit the line:
(dorun (pmap #(do (println "Fetching"...
Output:
...
Fetching WvL8v5ZLqPpNK3Ww.jpg ...
Fetching zut90T1zjCO_R1D8.jpg ...
Fetching xxJOeJRdwhrF63PF.jpg ...
Fetching B43z7slN_Tpo9nf-.jpg ...
"Elapsed time: 1658.57191 msecs"
Brak komentarzy:
Prześlij komentarz