Who is Reddit? (2014-06)

What?

A recent AskReddit posed the question "What does Reddit look like?".

Interesting. I decided to take a shot at visualising the varied people of Reddit. From a total of 6887 images from 18166 comments, I ended up with a 165MB 8832x12288 PNG.

Here are:

a lower resolution static version
the full interactive zoomable montage (self hosted and may or may not be up)
a miniscule thumbnail:

Details

Details of the process undertaken

Building the montage required a few steps:

1. Downloading the comments

The most difficult aspect of this project (by far) was downloading the comment tree from Reddit.

Initially, the post was put into contest mode, which completely prevents retrieval of all comments, returning a random subset on each load. I messaged the moderators and asked them if they could provide me with a dump of the comments in lieu of retrieving them myself, but they had subsequently killed the post and decided to disable contest mode instead. No problem...

Now that contest mode was disabled, I started building some Python code to download all comments in the post using the Reddit API. I used PRAW, which worked well, but I was unable to successfully download all of the comments, getting timeout exceptions, or only a subset of the comments.

To constrain resources, Reddit limits the number of comments that are shown in each page view to 200 (by default). Reddit provides a "show more comments" link that will load a further 200 set of comments.

After killing a couple of hours with the Python script, I decided to go old school and manually click "load more comments" until my fingers bled. This sucked, so I whipped up a quick snippet of JS in the Safari console to automate it. The script took about four hours to complete, but in the end I had loaded (presumably) everything.

With all the comments open in Safari, I then saved the page...to find that Safari doesn't save the in-memory page but instead re-requests the original URL. This resulted in only the first 200 initial comments. Fail. Fortunately, saving a webarchive copy of the page does use the in-memory versions, and I was of and running.

2. Extracting the list of image URLs

lxml + Python quickly resulted in a complete list of the URLS. Basically, I used something like the following:

dom = lxml.html.parse(StringIO(html)).getroot()
for comment in dom.cssselect('.entry'):
  links = comment.cssselect('.usertext-body a')

3. Expanding URLs to direct image links.

Quite a few of the links posted were to preview pages, rather than direct links. Where possible, I expanded these URLs to the final destination image using some basic heuristics.

95% of uploaded images were on Imgur.

Notable sites where this was not possible without considerably more effort were photobucket.com and flickr.com.

4. Downloading the images.

Python's concurrent.futures ThreadPoolExecutre combined with requests made short work of the downloads, resulting in a 6887 images totalling 1.6GB.

5. Creating the montage

After a bit of Googling, I found metapixel. I basically just followed the instructions:

metapixel-prepare images library
metapixel --metapixel reddit-alien.png montage.jpg \
  -l ./library -s 6 -e global

This resulted in most images being used, though unfortunately some are missing because they don't fit within the tile dimensions of the final image.

The final images is a 165MB 8832x12288 PNG.

6. Configuring the large image viewer

PanoJS is a JS library for viewing extremely large images. Again I basically just followed the instructions, using imgcnv to generate the tilesets:

imgcnv -i montage.png -o ./viewer/whoisreddit/256.jpg \
   -t jpeg -tile 256 -options 'quality 90'