Skip to main content

SHOW DETAILS
up-solid down-solid
eye
Title
Date Favorited
Creator
Community Data
by Shital Shah
data
eye 970
favorite 1
comment 1
Dump of Hacker News stories and comments up to 2014-05-29 From the HN post: Downloading All of Hacker News Posts and Comments https://news.ycombinator.com/item?id=7835605 http://shitalshah.com/p/downloading-all-of-hacker-news-posts-and-comments/
( 1 reviews )
Topics: hackernews, archive, stories, comments
WARCZone: Outsider WARCs
data
eye 197
favorite 2
comment 0
Dagobah is a large archive of ancient 4chan flash animations, dating all the way back to 2008 when the site was first founded. Anyone can upload files to this site. Because of it's 13099+ collection containing flash animations that date from 4chan's earliest history, the Bibliotheca Anonoma is conducting a contingency archival of the site. We used custom built Python scraping scripts to reduce strain on the server, and avoid the many pitfalls encountered by scraping an automatically generated...
The Dataset Collection
data
eye 4,160
favorite 2
comment 0
I took the Reddit comment archive and converted all the JSON into one SQLite database using this program that I wrote: https://gist.github.com/ers35/3b615a75fa0ed5e6d5cc I ran a few tests to make sure the number of database rows matches the number of JSON records. "SELECT MAX(rowid) FROM comment" and "SELECT COUNT(id) FROM comment" both return 1659361605. This gives me some confidence as to the integrity of the dataset, but I cannot be 100% sure. The compressed size is 163G....