• 4 Posts
  • 41 Comments
Joined 4 years ago
cake
Cake day: November 8th, 2021

help-circle

  • I don’t think it’s merely “reporting unfortunate news” It’s about the flipper zero, not really about car theft per say and shitty, evil car security system where the dealer scams you as much as the thief for a key.

    There’s really no reason we can’t use contactless smartcards for this, and that we can’t program them ourselves with open source software.

    The flipper zero itself is completely irrelevant about this. It’s just a generic ISM band transceiver … Only of note to the ignorant and technologically incompetent, but the journos have made this the centerpiece of the article.









  • Like anubis, that’s not going to last, the point isn’t to hammer the web servers off the net, it’s to get the precious data. The more standardized and streamlined that’s going to be made and only if there’s no preferential treatment to certain players (open ai / google facebook) then the dumb scraper will burn themselves out.

    One nice thing about anubis and nepenthes is that it’s going to burn out those dumb scrapers faster and force them to become more sophisticated and stealth. That’s should resolve the ddos problem on its own.

    For the truly public data sources, I think coordinated database dumps is the way to go, for hostile carrier, like reddit and facebook, it’s going to be scrapper arms race warfare like Cory Doctorow predicted.


  • The ddos is caused by the gatekeeping, there was no such issue before the 2023 API wars, fork over the goods and nobody gets hurt, it’s not complicated, you want to publish information to the public, don’t scrunch it up behind diseased trackers and ad infested pages which burn you cpu cycles. Or just put it in a big tarball torrent, the web is turning into a cesspool, how long until our browsers don’t even query websites at all but self-hosted crawler and search like searxng, at least then I won’t be catching cooties from your javascript cryptomining bots embed into the pages !


  • Even if your server is a cell phone from 2015, if it’s operating correctly and the cpu is maxed out, that means it’s fully utilized and services hundreds of megabits of information.

    You’ve decided to let the entire world read from your server, that indiscriminatory policy is letting people you don’t want getting your data, get your data and use your resources.

    You want to correct that by making everyone that comes in solve a puzzle, therefore in some way degrading their access, it’s not surprising that they’re going to complain. The other day I had to wait over 30 second at an anubis puzzle page, when I know that the AI scrappers have no problem getting through, something on my computer, probably some anti-crypto mining protection is getting triggered by it and now I can’t no-script the web either because of that thing and it can’t even stop scrappers anyway !

    So, anubis is going to be left behind, all the real users are, for years, going to be annoyed and have their entire internet degraded by it while the scrappers got that institutionally figured out in days.

    If it’s freely available public data then the solution isn’t restricting access trying to play a futile arms race with the scrapper and throwing the real users to the dogs, it’s to have standardized incremental efficient database dumps so the scrappers stop assuming every website is interoperability-hostile and scrape them. Let facebook and xitter fight the scrappers, let anyone trying to leverage public (and especially user contributed data) fight the scrappers.


  • You need to set you http serving process to a priority below the administrative processes (in the place where you are starting it, so assuming linux server that would be your init script or systemd service unit).

    Actual crash causing reboot ? Do you have faulty ram maybe ? That’s really not ever supposed to happen from anything happenning in userland. That’s not AI, your stuff might be straight up broken.

    Only thing that isn’t broken that could reboot a server is a watchdog timer.

    You server shouldn’t crash, reboot or become unreachable from the admin interface even at 100% load and it shouldn’t overheat either, temperatures should never exceed 80C no matter what you do, it’s supposed to be impossible with thermal management, which all processors have had for decades.








  • That’s crazy, it makes no sense, it takes as much bandwidth and processing power on the scraper side to process and use the data as it takes to serve it.

    They also have an open API that makes scraper entirely unnecessary too.

    Here are the relevant quotes from the article you posted

    “Scraping has become so prominent that our outgoing bandwidth has increased by 50% in 2024.”

    “At least 65% of our most expensive requests (the ones that we can’t serve from our caching servers and which are served from the main databases instead) are performed by bots.”

    “Over the past year, we saw a significant increase in the amount of scraper traffic, and also of related site-stability incidents: Site Reliability Engineers have had to enforce on a case-by-case basis rate limiting or banning of crawlers repeatedly to protect our infrastructure.”

    And it’s wikipedia ! The entire data set is trained INTO the models already, it’s not like encyclopedic facts change that often to begin with !

    The only thing I imagine is that it is part of a larger ecosystem issue, there the rare case where a dump and API access is so rare, and so untrust worthy that the scrapers are just using scrape for everything, rather than taking the time to save bandwidth by relying on dumps.

    Maybe it’s consequences from the 2023 API wars, where it was made clear that data repositories would be leveraging their place as pool of knowledge to extract rent from search and AI and places like wikipedia and other wikis and forums are getting hammered as a result of this war.

    If the internet wasn’t becoming a warzone, there really wouldn’t be a need for more than one scraper to scrape a site, even if the site was hostile, like facebook, it only need to be scraped once and then the data could be shared over a torrent swarm efficiently.