Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

Pro@programming.dev · 11 hours ago

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

zbyte64@awful.systems · 55 minutes ago

Is there nightshade but for text and code? Maybe my source headers should include a bunch of special characters that then give a prompt injection. And sprinkle some nonsensical code comments before the real code comment.

Spaz@lemmy.world · edit-2 1 hour ago

Is there a migration tool? If not would be awesome to migrate everything including issues and stuff. Bet even more people woild move.

londos@lemmy.world · 6 hours ago

Can there be a challenge that actually does some maliciously useful compute? Like make their crawlers mine bitcoin or something.

raspberriesareyummy@lemmy.world · 5 hours ago

Did you just say use the words “useful” and “bitcoin” in the same sentence? o_O

polle@feddit.org · 4 hours ago

The saddest part is, we thought crypto was the biggest waste of energy ever and then the LLMs entered the chat.

1rre@discuss.tchncs.de · 1 hour ago

At least LLMs produce something, even if it’s slop, all crypto does is… What does crypto even do again?

SeptugenarianSenate@leminal.space · 1 hour ago

Blockchain m8 gg

raspberriesareyummy@lemmy.world · 4 hours ago

ouch. I never made that comparison, but that is on point.

kameecoding@lemmy.world · 5 hours ago

Bro couldn’t even bring himself to mention protein folding because that’s too socialist I guess.

londos@lemmy.world · edit-2 4 hours ago

You’re 100% right. I just grasped at the first example I could think of where the crawlers could do free work. Yours is much better. Left is best.

londos@lemmy.world · 5 hours ago

I went back and added “malicious” because I knew it wasn’t useful in reality. I just wanted to express the AI crawlers doing free work. But you’re right, bitcoin sucks.

raspberriesareyummy@lemmy.world · 4 hours ago

To be fair: it’s a great tool for scamming people (think ransomware) :/

DeathByBigSad@sh.itjust.works · 2 hours ago

Great for money laundering.

T156@lemmy.world · 5 hours ago

Not without making real users also mine bitcoin/avoiding the site because their performance tanked.

Wispy2891@lemmy.world · 4 hours ago

Question: those artificial stupidity bots want to steal the issues or want to steal the code? Because why they’re wasting a lot of resources scraping millions of pages when they can steal everything via SSH (once a month, not 120 times a second)

Passerby6497@lemmy.world · 4 hours ago

That would require having someone with real intelligence running the scraper.

oeuf@slrpnk.net · 7 hours ago

Crazy. DDoS attacks are illegal here in the UK.

rdri@lemmy.world · 5 hours ago

So, sue the attackers?

SufferingSteve@feddit.nu · edit-2 10 hours ago

There once was a dream of the semantic web, also known as web2. The semantic web could have enabled easy to ingest information of webpages, removing soo much of the computation required to get the information. Thus preventing much of the AI crawling cpu overhead.

What we got as web2 instead was social media. Destroying facts and making people depressed at a newer before seen rate.

Web3 was about enabling us to securely transfer value between people digitally and without middlemen.

What crypto gave us was fraud, expensive jpgs and scams. The term web is now even so eroded that it has lost much of its meaning. The information age gave way for the misinformation age, where everything is fake.

muusemuuse@sh.itjust.works · 5 hours ago

Sound like it went the same way everything else went. The less money is involved the more trustworthy it is.

kameecoding@lemmy.world · 5 hours ago

Web3 was about enabling us to securely transfer value between people digitally and without middlemen

I don’t think it ever was that, I think folding ideas has the best explanation of what it was meant to be, it was meant to be a way to grab power, away from those who already have it

https://youtu.be/YQ_xWvX1n9g

tourist@lemmy.world · 8 hours ago

Web3 was about enabling us to securely transfer value between people digitally and without middlemen.

It’s ironic that the middlemen showed up anyway and busted all the security of those transfers

You want some bipcoin to buy weed drugs on the slip road? Don’t bother figuring out how to set up that wallet shit, come to our nifty token exchange where you can buy and sell all kinds of bipcoins

oh btw every government on the planet showed up and dug through our insecure records. hope you weren’t actually buying shroom drugs on the slip rod

also we got hacked, you lost all your bipcoins sorry

At least, that’s my recollection of events. I was getting my illegal narcotics the old fashioned way.

raspberriesareyummy@lemmy.world · 5 hours ago

also we got hacked, you lost all your bipcoins sorry

aaaaaaaaand - it’s gone!

Chee_Koala@lemmy.world · 8 hours ago

the old fashioned way.

A whole swath of trained toads using a special made tube network?

tourist@lemmy.world · 6 hours ago

getting into a car with a stranger who said he was 15 minutes away two hours ago

Quetzalcutlass@lemmy.world · 8 hours ago

Nah, they clearly meant in liquid form.

very_well_lost@lemmy.world · 7 hours ago

Liquid narcotics, you say?

GreenShimada@lemmy.world · 7 hours ago

Mr. Internet, tear down these walls! (for all these walled gardens)

Return the internet to the wild. Let it run feral like dinosaurs on an island.

Let the grannies and idiots stick themselves in the reservations and asylums run by billionaires.

Let’s all make Neocities pages about our hobbies and dirtiest, innermost thoughts. With gifs all over.

Marshezezz@lemmy.blahaj.zone · 10 hours ago

Capitalism is grand, innit. Wait, not grand, I meant to say cancer

Serinus@lemmy.world · 18 minutes ago

I feel like half of the blame capitalism gets is valid, but the other half is just society. I don’t care what kind of system you’re under, you’re going to have to deal with other people.

Oh, and if you try the system where you don’t have to deal with people, that just means other people end up handling you.

hansolo@lemmy.today · 9 hours ago

Preach!

r00ty@kbin.life · 7 hours ago

For mbin I managed to kill the attack of the scrapers only using cloudflare managed challenge for all except to fediverse post endpoints, from fediverse ua agents on certain get endpoints. Managed challenge on everything else.

So far, they’ve not gotten past it. But, a matter of time.

PrettyFlyForAFatGuy@feddit.uk · 5 hours ago

man, you’d think they’d just use the actual activitypub protocol to inhale all that data at once and not bother with costly scraping.

This A aint very I

muusemuuse@sh.itjust.works · 5 hours ago

AI was never intelligent. It’s a marketing term, that’s all. It has absolutely no meaning.

Wispy2891@lemmy.world · 4 hours ago

Same for all the WordPress blogs, by default in all of them there’s an API without authentication that lets you download ALL the posts in an easy JSON.

Dear artificial stupidity bot… WHY THE FUCK ARE YOU FUCKING SCRAPING THE WHOLE PAGE 50 TIMES A SECOND???

r00ty@kbin.life · 3 hours ago

Well the posts to inbox are generally for incoming info. Yes, there’s endpoints for fetching objects. But, they don’t work for indexing, at least not on mbin/kbin. If you have a link, you can use activitypub to traverse upwards from that object to the root post. But you cannot iterate down to child comments from any point.

The purpose is that say I receive an “event” from your instance. You click like on a post I don’t have on my instance. Then the like event has a link to the object for that on activitypub. If I fetch that object it will have a link to the comment, if I fetch the comment it will have the comment it was in reply to, or the post. It’s not intended to be used to backfill.

So they do it the old fashioned way, traversing the human side links. Which is essentially what I lock down with the managed challenge. And this is all on the free tier too.

zifk@sh.itjust.works · 10 hours ago

Anubis isn’t supposed to be hard to avoid, but expensive to avoid. Not really surprised that a big company might be willing to throw a bunch of cash at it.

randomblock1@lemmy.world · 7 hours ago

No, it’s expensive to comply (at a massive scale), but easy to avoid. Just change the user agent. There’s even a dedicated extension for bypassing Anubis.

Even then AI servers have plenty of compute, it realistically doesn’t cost much. Maybe like a thousandth of a cent per solve? They’re spending billions on GPU power, they don’t care.

I’ve been saying this since day 1 of Anubis but nobody wants to hear it.

T156@lemmy.world · 5 hours ago

The website would also have to display to users at the end of the day. It’s a similar problem as trying to solve media piracy. Worst comes to it, the crawlers could read the page like a person would.

sudo@programming.dev · edit-2 9 hours ago

This is what I’ve kept saying about POW being a shit bot management tactic. Its a flat tax across all users, real or fake. The fake users are making money to access your site and will just eat the added expense. You can raise the tax to cost more than what your data is worth to them, but that also affects your real users. Nothing about Anubis even attempts to differentiate between bots and real users.

If the bots take the time, they can set up a pipeline to solve Anubis tokens outside of the browser more efficiently than real users.

black_flag@lemmy.dbzer0.com · 8 hours ago

Yeah but ai companies are losing money so in the long run Anubis seems like it should eventually return to working.

r00ty@kbin.life · 7 hours ago

It’s the usual enshittification tactic. Make AI cheap so companies fire tech workers. Keep it cheap long enough that we all have established careers as McDonald’s branch managers, then whack up the prices once they’re locked in.

sudo@programming.dev · edit-2 6 hours ago

Costs of solving PoW for Anubis is absolutely not a factor in any AI companies budget. Just the costs of answering one question is millions of times more expensive than running sha256sum for Anubis.

Just in case you’re being glib and mean the businesses will go under regardless of Anubis: most of these are coming from China. China absolutely will keep running these companies at a loss for the sake of strategic development.

black_flag@sh.itjust.works · 5 hours ago

Thanks for the info 👍 would not have thought Anubis would be so irrelevant

OpenPassageways@lemmy.zip · 7 hours ago

What the alternative?

sudo@programming.dev · 6 hours ago

Not much for open source solutions. A simple captcha however would cost scrapers more to crack than Anubis.

But when it comes to “real” bot management solutions: The least invasive solutions will try to match User-Agent and other headers against the TLS fingerprint and block if they don’t match. More invasive solutions will fingerprint your browser and even your GPU, then either block you or issue you a tracking cookie which is often pinned to your IP and user-agent. Both of those solutions require a large base of data to know what real and fake traffic actually looks like. Only large hosting providers like CloudFlare and Akamai have that data and can provide those sorts of solutions.

PhilipTheBucket@piefed.social · 11 hours ago

I feel like at some point it needs to be active response. Phase 1 is a teergrube type of slowness to muck up the crawlers, with warnings in the headers and response body, and then phase 2 is a DDOS in response or maybe just a drone strike and cut out the middleman. Once you’ve actively evading Anubis, fuckin’ game on.

TurboWafflz@lemmy.world · 10 hours ago

I think the best thing to do is to not block them when they’re detected but poison them instead. Feed them tons of text generated by tiny old language models, it’s harder to detect and also messes up their training and makes the models less reliable. Of course you would want to do that on a separate server so it doesn’t slow down real users, but you probably don’t need much power since the scrapers probably don’t really care about the speed

xthexder@l.sw0.com · 10 hours ago

I love catching bots in tarpits, it’s actually quite fun

phx@lemmy.ca · 8 hours ago

Yeah that was my thought. Don’t reject them, that’s obvious and they’ll work around it. Feed them shit data - but not too obviously shit - and they’ll not only swallow it but eventually build up to levels where it compromises them.

I’ve suggested the same for plain old non-AI data stealing. Make the data useless to them and cost more work to separate good from bad, and they’ll eventually either sod off or die.

A low power AI actually seems like a good way to generate a ton of believable - but bad - data that can be used to fight the bad AI’s. It doesn’t need to be done real-time either as datasets can be generated in advance

31ank@ani.social · edit-2 10 hours ago

Some guy also used zip bombs against AI crawlers, don’t know if it still works. Link to the lemmy post

sudo@programming.dev · edit-2 9 hours ago

The problem is primarily the resource drain on the server and tarpitting tactics usually increase that resource burden by maintaining the open connections.

The Infinite Nematode@feddit.uk · 10 hours ago

Wasn’t this called black ice in Neuromancer? Security systems that actively tried to harm the hacker?

traches@sh.itjust.works · 10 hours ago

These crawlers come from random people’s devices via shady apps. Each request comes from a different IP

AmbitiousProcess (they/them)@piefed.social · 10 hours ago

Most of these AI crawlers are from major corporations operating out of datacenters with known IP ranges, which is why they do IP range blocks. That’s why in Codeberg’s response, they mention that after they fixed the configuration issue that only blocked those IP ranges on non-Anubis routes, the crawling stopped.

For example, OpenAI publishes a list of IP ranges that their crawlers can come from, and also displays user agents for each bot.

Perplexity also publishes IP ranges, but Cloudflare later found them bypassing no-crawl directives with undeclared crawlers. They did use different IPs, but not from “shady apps.” Instead, they would simply rotate ASNs, and request a new IP.

The reason they do this is because it is still legal for them to do so. Rotating ASNs and IPs within that ASN is not a crime. However, maliciously utilizing apps installed on people’s devices to route network traffic they’re unaware of is. It also carries much higher latency, and could even allow for man-in-the-middle attacks, which they clearly don’t want.

PhilipTheBucket@piefed.social · 10 hours ago

Honestly, man, I get what you’re saying, but also at some point all that stuff just becomes someone else’s problem.

This is what people forget about the social contract: It goes both ways, it was an agreement for the benefit of all. The old way was that if you had a problem with someone, you showed up at their house with a bat / with some friends. That wasn’t really the way, and so we arrived at this deal where no one had to do that, but then people always start to fuck over other people involved in the system thinking that that “no one will show up at my place with a bat, whatever I do” arrangement is a law of nature. It’s not.

amelaxxx@piefed.social · 10 hours ago

sudo@programming.dev · 9 hours ago

Or your TV or IOT devices. Residential proxies are extremely shady businesses.

PhilipTheBucket@piefed.social · 10 hours ago

Is that really true? I guess I have no reason to doubt it, I just hadn’t heard it before.

sudo@programming.dev · 9 hours ago

Here’s one example of a proxy provider offering to pay developers to inject their proxies into their apps. (“100% ethical proxies” because they signed a ToS). Another is BrightData proxies traffic through users of their free HolaVPN.

IOT and smart TVs are also obvious suspects.

NuXCOM_90Percent@lemmy.zip · 10 hours ago

Yes. A nonprofit organization in Germany is going to be launching drone strikes globally. That is totally a better world.

Its also important to understand that a significant chunk of these botnets are just normal people with viruses/compromised machines. And the fastest way to launch a DDOS attack is to… rent the same botnet from the same blackhat org to attack itself. And while that would be funny, I would also rather orgs I donate to not giving that money to blackhat orgs. But that is just me.

bleistift2@sopuli.xyz · edit-2 10 hours ago

https://en.wikipedia.org/wiki/Sarcasm, or maybe https://en.wikipedia.org/wiki/Hyperbole

amelaxxx@piefed.social · 10 hours ago

Right

wetbeardhairs@lemmy.dbzer0.com · 8 hours ago

Gosh. Corporations are rampantly attempting to access resources so they can perform copyright infringement en-masse. I wonder if there is a legal mechanism to stop them? Oh, no there isn’t because our government is fully corrupted.

aquovie@lemmy.cafe · 7 hours ago

I think, in this particular case, it’s aggressive apathy/incompetence and not malice. Remember, Trump didn’t even know what Nvidia was.

AI’s don’t have a skin color or use the bathroom so you can’t whip your cult into a frenzy by Othering it. You can’t solidify your fascism by getting bogged down in the details of IP law.

Corkyskog@sh.itjust.works · 4 hours ago

Just say that the AI will be used to train the immigrants to take der jerbs.

UnderpantsWeevil@lemmy.world · 10 hours ago

I mean, we really have to ask ourselves - as a civilization - whether human collaboration is more important than AI data harvesting.

willington@lemmy.dbzer0.com · edit-2 5 hours ago

I was fine before the AI.

The biggest customer of AI are the billionaires who can’t hire enough people for their technofeudalist/surveillance capitalism agenda. The billionaires (wannabe aristocrats) know that machines have no morals, no bottom lines, no scruples, don’t leak info to the press, don’t complain, don’t demand to take time off or to work from home, etc.

AI makes the perfect fascist.

They sell AI like it’s a benefit to us all, but it ain’t that. It’s a benefit to the billionaires who think they own our world.

AI is used for censorship, surveillance pricing, activism/protest analysis, making firing decisions, making kill decisions in battle, etc. It’s a nightmare fuel under our system of absurd wealth concentration.

Fuck AI.

devfuuu@lemmy.world · edit-2 10 hours ago

I think every company in the world is telling everyone for a few months now that what matter is AI data harvesting. There’s not even a hint of it being a question. You either accept the AI overlords or get out of the internet. Our ONLY purpose it to feed the machine, anything else is irrelevant. Play along or you shall be removed.

Goretantath@lemmy.world · 8 hours ago

I knew that was the worse option. Use the one that traps them in an infinite maze.

aquovie@lemmy.cafe · 8 hours ago

You need to properly detect that they’re bots first and then they’ll just figure out how to spoof that. Then you’re back to square one.

Abstractly, POW doesn’t need to determine if you’re a bot or not. To make a request, as a human or bot, you need to pay in cpu-time. The hope is that the cost is not so high that a human notices very much but for a bot trying to hoover up data as fast as possible, the aggregate cost is high.

I think the more horrifying aspect is that they’ll just build ever bigger datacenters to crunch POW tests faster and the carbon cost will skyrocket even more.

Auth@lemmy.world · 28 minutes ago

Trap users in the maze as well :)

mic_check_one_two@lemmy.dbzer0.com · 8 hours ago

Exactly. Imagine needing to pay a penny for every request. Not a huge deal for someone who only makes one or two requests per year. But if you’re running a bot farm and making tens of millions of requests per day, you’ll quickly find that your operating costs have skyrocketed. That’s basically the idea behind Anubis; Make someone pay in CPU time, so the legit users don’t really notice but bots quickly eat up all of their servers’ CPU.

nialv7@lemmy.world · 7 hours ago

Oh I haven’t even considered the carbon aspect. Anubis is an even worse idea than I previously thought…

0x0@lemmy.zip · 11 hours ago

It’s always a cat-n-mouse game.

Allero@lemmy.today · 9 hours ago

Except previously bombarding another person’s server for personal gain was illegal.

carrylex@lemmy.world · 8 hours ago

I don’t know if this is news to you, but most of the internet never cared about what’s legal or not.

0x0@lemmy.zip · 6 hours ago

Not if it’s AI.
/s aside, maybe you could call’em out on involuntary DoSing, but then slashdot and similar sites would get into trouble.

Kyrgizion@lemmy.world · 10 hours ago

Eventually we’ll have “defensive” and “offensive” llm’s managing all kinds of electronic warfare automatically, effectively nullifying each other.

ProdigalFrog@slrpnk.net · 10 hours ago

That’s actually a major plot point in Cyberpunk 2077. There’s thousands of rogue AI’s on the net that are constantly bombarding a giant firewall protecting the main net and everything connected to it from being taken over by the AI.

archchan@lemmy.ml · edit-2 7 hours ago

Not to mention the firewall is itself an AI.

Klear@quokk.au · 9 hours ago

The game is an excellent documentary.

Track_Shovel@slrpnk.net · 9 hours ago

Unrelated, but I saw this headline, and could hear both you and squidward swearing from here.

ChaoticNeutralCzech@feddit.org · 7 hours ago

Obligatory AI ≠ LLM. How would scrapers benefit from the LLMs they help train? The defense is obvious, LLM-generated slop traps against scrapers already exist.

sudo@programming.dev · 9 hours ago

Places like cloudflare and akamai are already using machine learning algorithms to detect bot traffic at a network level. You need to use similar machine learning to evade them. And since most of these scrapers are for AI companies I’d expect a lot of the scrapers to be LLM generated.

zoey@lemmy.librebun.com · 10 hours ago

I’m ashamed to say that I switched my DNS nameservers to CF just for their anti crawler service.
Knowing Cloudflare, god know how much longer it’ll be free for.

AmbiguousProps@lemmy.today · 8 hours ago

Did you enable the AI black hole/tarpit? It’s the main reason I’ve used their stuff.