LibGen's Bloat Problem
2022-08-21
-
LibGen and Sci-Hub are my two most favorite resources on the Internet, an utopia came true through the herculean efforts of few individuals taking massive risks and thousands of active contributors donating their time, bandwidth, and/or money.
-
The database dump of LibGen I have at hand tells that the library has around 3.16M non-fiction e-books with a total 51.50 TB.
-
The more the merrier of course, although its magnitude makes it more difficult to redistribute/decentralize the collection.
-
Unfortunately LibGen is full of duplicate books, low-quality high-size scans, and worse, all sorts of binary data from executables to formats you may have never heard of.
-
Without further ado, by filtering any "books" (rather, files) that are larger than 30 MiB we can reduce the total size of the collection from 51.50 TB to 18.91 TB, shaving a whopping 32.59 TB of data by excluding only 391.54k books that consist 12% of the entire collection in exchange for 63% savings in the disk space.
-
Bear in mind that this is still before deduplication; we are merely filtering by file-size.
-
I chose 30 MiB somewhat arbitrarily based on my personal e-book library, thinking "30 MiB ought to be enough for anyone"—you may adjust it as you'd like.
-
Today we have hard drives that are capable of storing 20 TB so the entire LibGen (without its cruft) can be stored by anyone at home, at college, at university, or at work for around 500 USD.
-
Wouldn't that be marvellous?
Edit: I feel compelled to make a small clarification; I'm not arguing for purging all files greater than X MiB (even though I still think LibGen has a non-negligible amount of junk and duplicate content but neither are easy to remove without significant manpower and LibGen lacks mechanisms to crowdsource the work). Instead, I'm advocating a leaner and more "practical" version of LibGen that is easier to self-host and distribute. An extreme example of such is libgen-text, LibGen in a text only form.