Content from 2017-07

posted on 2017-07-04 22:3011:+01:00

So, let's replicate this blog post in CL. In case it's gone, we want to fetch all pictures from Reddit's /r/pics:

curl -s -H "User-Agent: cli:bash:v0.0.0 (by /u/codesharer)" \
  https://www.reddit.com/r/pics/.json \
  | jq '.data.children[].data.url' \
  | xargs -P 0 -n 1 -I {} bash -c 'curl -s -O {}'

First, we need a HTTP client (drakma being canonical) and a JSON parser (cl-json will suffice here, you might want to use a different one depending on the parsing/serialisation needs).

(asdf:load-systems '#:drakma '#:cl-json)

Next, we define our workflow, fetch JSON content from a URL, get the list of URLs and download each one of them.

We need to download the JSON first, so let's start with that:

(defun scrape-sub (name)
  (drakma:http-request
   (format NIL "https://www.reddit.com/r/~A/.json" name)
   :user-agent "repl:abcl:v0.0.0 (by /u/nyir)"))

If we run, we'll see that the output is just a byte vector:

#(123 34 107 ...)

That's actually fine for the JSON library, but more readably we could either set a flag for Drakma, or convert it manually:

(babel:octets-to-string *)

The second step is parsing it from the JSON format, so extending it will be like the following:

(defun scrape-sub (name)
  (cl-json:decode-json-from-source
   (babel:octets-to-string
    (drakma:http-request
     (format NIL "https://www.reddit.com/r/~A/.json" name)
     :user-agent "repl:abcl:v0.0.0 (by /u/nyir)"))))

The output now looks a bit different:

((:KIND . "Listing") (:DATA (:MODHASH . "") ...))

But it's already getting more manageable. Next we want the URL bit of the data. Unfortunately I don't know of a good library that would allow us to specify something as a kind of XPath-style selector. So we'll go ahead and to it manually. The .data.children bit will be something like (cdr (assoc :children (cdr (assoc :data <json>)))), since cl-json returns an association list; children[] means we'll iterate over all children and collect the results; data.url again is the same kind of accessor like (cdr (assoc :url (cdr (assoc :data <json>)))):

(defun scrape-sub (name)
  (let ((json (cl-json:decode-json-from-source
               (babel:octets-to-string
                (drakma:http-request
                 (format NIL "https://www.reddit.com/r/~A/.json" name)
                 :user-agent "repl:abcl:v0.0.0 (by /u/nyir)")))))
    (mapcar (lambda (child)
              (cdr (assoc :url (cdr (assoc :data child)))))
            (cdr (assoc :children (cdr (assoc :data json)))))))

Now the output is just a list of strings:

("https://www.reddit.com/r/pics/comments/6ewxd6/may_2017_transparency_report/" "https://i.redd.it/luxhqoj95q5z.png" ...)

Here's one a addition I'll put in, filtering for image file types. That might still be unreliable of course, but it'll remove a whole bunch of potentially wrong links already. For filtering, MAPCAR isn't suitable, either we could do it in multiple stages, or we'll use something like MAPCAN, or an explicit iteration construct like LOOP/ITERATE. I'll go with MAPCAN here, meaning every element to collect needs to be wrapped in a list:

(defun scrape-sub (name)
  (let ((json (cl-json:decode-json-from-source
               (babel:octets-to-string
                (drakma:http-request
                 (format NIL "https://www.reddit.com/r/~A/.json" name)
                 :user-agent "repl:abcl:v0.0.0 (by /u/nyir)")))))
    (mapcan (lambda (child)
              (let ((url (cdr (assoc :url (cdr (assoc :data child))))))
                (and url
                     (member (pathname-type (pathname (puri:uri-path (puri:parse-uri url))))
                             '("jpg" "png")
                             :test #'string-equal)
                     (list url))))
            (cdr (assoc :children (cdr (assoc :data json)))))))

I'm happy with that and it now filters for two image types.

Last point, actually downloading all scraped results. For this, we just iterate and download them as before:

(defun scrape-sub (name)
  (let* ((agent "repl:abcl:v0.0.0 (by /u/nyir)")
         (json (cl-json:decode-json-from-source
                (babel:octets-to-string
                 (drakma:http-request
                  (format NIL "https://www.reddit.com/r/~A/.json" name)
                  :user-agent agent))))
         (downloads
           (mapcan (lambda (child)
                     (let ((url (cdr (assoc :url (cdr (assoc :data child))))))
                       (when url
                         (let ((pathname (pathname (puri:uri-path (puri:parse-uri url)))))
                           (when (member (pathname-type pathname)
                                         '("jpg" "png")
                                         :test #'string-equal)
                             `((,url ,pathname)))))))
                   (cdr (assoc :children (cdr (assoc :data json)))))))
    (mapc (lambda (download)
            (destructuring-bind (url pathname) download
              (with-open-file (stream (merge-pathnames *default-pathname-defaults* pathname)
                                      :direction :output
                                      :element-type '(unsigned-byte 8))
                (write-sequence
                 (drakma:http-request url :user-agent agent)
                 stream))))
          downloads)))

And this works.

Now. This is decidedly not a demonstration of how the final result should look like. In fact there a whole lof of things to improve and to consider when you'd put this into a reusable script.

From a maintainability perspective, we'd put each functional part into it's own component, be it a function or method, in order to make them easier to reason about and to test each bit individually.

From a performance part ... oh my, there's so much wrong with it, mostly slurping everything into memory multiple times, while drakma does support streams as results and HTTP Keep-Alive, both would improve things. The JSON parser could in theory also operate on tokens, but that's rarely worth the hassle (the CXML API can be used for that, by converting JSON "events" into a stream of SAX events basically). Lastly creating the output lists isn't necessary, this could all be done inline or with continuation passing style, but that's worse for maintaining a nice split between functions.

From a correctness part, all the URLS might have weird characters in them that don't work well with pathnames and/or the local filesystem. In fact PURI might not be the best choice here either. Also, even if the URLs are different, more than one of them might have the same filename, meaning there should either be some error handling in there, or the URLs should be hashed to be used as filename or some other scheme accomplishing the same thing. Lastly, the download files should be checked for emptiness, wrong content (HTML bits would indicate a failed download too), broken images, etc.

Another nice thing to add would be xattr support for indicating where the file was downloaded from.

This blog covers work, unix, tachikoma, scala, sbt, redis, postgresql, no-ads, lisp, kotlin, jvm, java, hardware, go, git, emacs, docker

View content from 2014-08, 2014-11, 2014-12, 2015-01, 2015-02, 2015-04, 2015-06, 2015-08, 2015-11, 2016-08, 2016-09, 2016-10, 2016-11, 2017-06, 2017-07, 2017-12, 2018-04, 2018-07, 2018-08, 2018-12, 2020-04, 2021-03