Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

https://archive.ph/

We need to keep making more of these.



This article was archived 4 days ago. :-)

https://archive.ph/dSeku


A slightly different take - archive only the text, like "reader mode"

https://github.com/carterworks/yazzy


https://web.archive.org

https://commoncrawl.org

I would prefer more of these.

Alas, archive.today (archive.ph, archive.is, archive.vn, etc.) is sometimes blocked in some countries, it sometimes serves CAPTCHAs, it tries to create a "fingerprint" using Javascript, and it contains a tracking pixel.

Neither Internet Archive nor Common Crawl do those things. (There are other archives I am not mentioning that do not do these things either.)

When it works, archive.today may seem like a perfect solution to "paywalls". And then it stops working. In truth most paywalls are solved by controlling HTTP headers like UA and X-forwarded-for, controlling Javascript and controlling cookies. This control requires no third party intermediary (middleman) like Archive.today. Or Internet Archive, for that matter.

None of these archives are perfect and it's true the public could use more of them. But there are better ways to avoid "paywalls" which are just a means of collecting data about non-subscribers while deliberately annoying them with Javascript.


The Internet Archive is significantly less useful because they allow people to exclude their public social media accounts or websites. On a couple occasions I have tried to find a source for old deleted statements using the IA only to find that the data had been scrubbed. Fortunately archive.today still had a copy in one case, but in the other one I was out of luck.


What were you looking for that was prone to scrubbing? Just curious because I have a collection of historical data to go through and don't know what to expect


In one case it was a personal website, the other was a Twitter account. Both got scrubbed from the IA.

Apparently they will comply with GDPR and DMCA requests, I'm not sure what precise mechanism was used in those cases.

https://www.reddit.com/r/privacy/comments/eut3na/can_i_get_p...

https://www.joshualowcock.com/guide/how-to-delete-your-site-...


The Internet Archive operates within the law (mostly), while archive.foo is blatantly illegal, which is why it has so many domain names, among other things. Think Anna's Archive vs Library of Congress.

The future is going to be some kind of bland corporate internet of useless corporate things (only people with a team of lawyers can afford to operate any service on this dark-forest light-network), paired with some kind of dark web full of very useful uncorporate things, which corporations are constantly trying to hunt down, which everyone will use every day.


> The Internet Archive operates within the law

It most certainly does not. The archive is home to petabytes of pirated content, and Jason Scott himself has told people many times on many different platforms/interviews to intentionally upload copyrighted content, because "if we had to police everything, we would have no content... so upload first, then let the rightsholder deal with requesting takedowns".

All you have to do is click on the "software" link at the top of the page, and you can find just about any copyrighted app or game that has ever been released, on any platform, available to download instantly for free. Besides usenet, it's the largest centralized cache of pirated content on the planet.

It's one thing to claim Section 230 because you are a service provider and you don't control what your users upload, but it's entirely another thing to publicly acknowledge that you're aware that people do it, you encourage them to do it, AND you don't care.

And regarding archive.foo, just because they have many domains doesn't make it illegal... it means they have enemies who are guilty of the Streisand Effect. Enemies who are known to attack registrars, DNS providers, upstream ISPs/hosting providers and anyone else who will entertain a false flag attempt at claiming a ToS violation in order to get a site taken offline.


The Internet Archive does turn a blind eye when it comes to pro-actively moderating uploads, but they're not required to do that. They do follow takedown requests as the law proscribes (including taking down lots of stuff that is legal and really shouldn't be taken down, because the takedown laws have no exceptions for it).

The Internet Archive tries to push boundaries sometimes - all corporations do. IA having a link to "software" and then not pro-actively moderating that section is like Uber not getting medallions for its taxi driver employees who it calls contractors. It's not the same as the flagrant disobedience from archive.today. IA did flagrantly disobey one time, and it almost catastrophically deleted them from existence, to the detriment of everyone.


Or stop talking about them. No but seriously I always wonder how other sites or workaround get taken down, but nobody cares about archive. I just hope it continues to stay under the radar.


The only long term solution is to stop sharing paywalled content.


The dirty secret is that the news media needs archive.today in order to function. Anyone writing an article about subject Y needs to know what every paper wrote about it. Back in the 00's it got out that you could log into almost any newspaper web site with "media/media", something that got clamped down on when it got out.

You'd think The New York Times could afford to get a subscription to other newspapers for their reporters but there is no way they could stoop so low as to admit that they're dependent on or equal to them in in any way. Most smaller papers are such marginal operations that they couldn't afford it even for writers who are on the paywall. It's more ramshackle than you think since even a lot of New York Times articles are written by freelancers who have no real connection with the organization and it's even more true for all the papers that are hanging on a shoestring.

If archive.today didn't exist they'd have to make one.


Most newsrooms have access to LexisNexis or a similar service, so they have access to other paper's stories. My old newsroom also directly paid for subscriptions to other newspapers, it's not uncommon.


Bit of a hassle, though? My college had access to a ton of publications through this or that subscription, and it was so, so much more simple to just hit up z-lib or whatever, especially when away from the library.



Or create a deeper underground where the masses do not get involved?


There’s no real way to tell if the content you share today will be paywalled tomorrow.


Screenshot FTW!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: