• panda_abyss@lemmy.ca
    link
    fedilink
    English
    arrow-up
    10
    arrow-down
    33
    ·
    edit-2
    4 days ago

    I actually agree with them

    This feels like cloudflare trying to collect rent from both sides instead of doing what’s best for the website owners.

    There is a problem with AI crawlers, but these technologies are essentially doing a search, fetching a several pages, scanning/summarizing them, then presenting the findings to the user.

    I don’t really think that’s wrong, it’s just a faster version of rummaging through the SEO shit you do when you Google something.

    (I’ve never used perplexity, I do use Kagi’s ki assistant for similar search. It runs 3 searches and scans the top results and then provides citations)

    • drspod@lemmy.ml
      link
      fedilink
      English
      arrow-up
      38
      arrow-down
      2
      ·
      4 days ago

      What’s best for the website owners is to have people actually visit and interact with their website. Blocking AI tools is consistent with that.

      • panda_abyss@lemmy.ca
        link
        fedilink
        English
        arrow-up
        3
        arrow-down
        13
        ·
        edit-2
        4 days ago

        For a lot of AI search I actually end up reading the pages, so I don’t know how much this stops that

        • AstralPath@lemmy.ca
          link
          fedilink
          English
          arrow-up
          16
          arrow-down
          2
          ·
          4 days ago

          You’re the outlier, I promise. People are literally forfeiting their brains in favor of an LLM transplant hese days.

          • Pennomi@lemmy.world
            link
            fedilink
            English
            arrow-up
            4
            arrow-down
            1
            ·
            4 days ago

            On the flip side, most websites are so ad-ridden these days a reader mode or other summary tool is almost required for normal browsing. Not saying that AI is the right move, but I can understand not wanting to visit the actual page any more.

            • snooggums@lemmy.world
              link
              fedilink
              English
              arrow-up
              9
              arrow-down
              1
              ·
              4 days ago

              On the flip side, most websites are so ad-ridden these days a reader mode or other summary tool is almost required for normal browsing.

              Firefox with uBlock Origin works perfectly fine and pages load faster without the ads!

            • HarkMahlberg@kbin.earth
              link
              fedilink
              arrow-up
              5
              ·
              4 days ago

              Maybe I missed something, but ublock still works very fine for me, even on mobile. And running a pihole, while not trivial, also takes care of some ad traffic. Firefox coems with a reader mode (a feature I really like even with the adblockers!).

              So why do people not want to visit pages anymore, if all these tools already existed?

    • r00ty@kbin.life
      link
      fedilink
      arrow-up
      25
      arrow-down
      1
      ·
      4 days ago

      Well. Try running a web server and you’ll find quite quickly that you get hit quick and hard by AI crawlers that do not respect server operators. Unlike web crawlers of old, these will hit a site over and over with sometimes 100s, even 1000s of requests per second to strip mine all the content they can find, as quickly as possible.

      When you try to block them by user agent, they start faking real client user agents.

      When you block the AS Numbers involved traffic starts to go down. But there’s still a large number of non organic requests, coming from, well frankly everywhere. Cellular network in Brazil, cable internet in the USA, other non business subcribers in other countries around the world.

      How do I know they’re not organic? Turn on cloudflare managed challenge and they all go away.

      So, personally that’s my biggest beef against them. Yes ripping off data without permission is bad already, but this level of trying to bypass any clear sign we do not want you is far worse.

      • panda_abyss@lemmy.ca
        link
        fedilink
        English
        arrow-up
        6
        arrow-down
        3
        ·
        edit-2
        4 days ago

        Yeah that’s fair, and I do agree with Cloudflare stamping out that behaviour.

        What I’m trying to say is there are cases where AI agents act for the user in what the traditional user agent role of browsers would be.

        ETA: That doesn’t excuse things like not having a search index to prevent mass scale access, this would be near 1-1 access patterns per user, which would be infrequent/spaced out

      • FauxLiving@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        9
        ·
        4 days ago

        The point of the article is that there is a difference between a bot which is just scraping data as fast as possible and a user requesting information for their own use.

        Cloudflare doesn’t distinguish these things. It would be like Cloudflare blocking your browser because it was automatically fetching JavaScript from multiple sources in order to render the page you navigated to.

        I’m sure you can recognize how annoyed you would be with Cloudflare if you had to enter 4 captchas in order to load a single web page or, as here, have your page fail to load some elements that you requested because Cloudflare thinks fetching JavaScript or pre caching links is the same as web crawler activity.

        • r00ty@kbin.life
          link
          fedilink
          arrow-up
          7
          arrow-down
          2
          ·
          4 days ago

          Yes, but my point is I cannot tell the difference. If they can convince cloudflare they deserve special treatment and exemption then they can probably get it.

          I would argue there being a difference “depends” though. There’s two problems I see. They are only potentially not guilty of one.

          The first problem is, that AI crawlers are a true DDoS and this is I think the main reason most (including myself) do not want them. They cause performance issues by essentially speed running collecting every unique piece of data from your site. If they’re dynamic as the article says then they are potentially not doing this. I cannot say for sure here.

          The second problem is, many sites are monetized from advert revenue or otherwise motivated by actual organic traffic. In this case, I would bet some money that this company is taking the data from these sites, not providing ad revenue or organic traffic and serving it to the querying user with their own ads included. In which case, this is also very very bad.

          So, their beef is only potentially partially valid. Like I say, if they can convince cloudflare, and people like me to add exceptions for them, then great. So far though, I’m not convinced. AI scrapers have a bad reputation in general, and it’s deserved. They need to do a LOT to escape that stigma.

          • FauxLiving@lemmy.world
            link
            fedilink
            English
            arrow-up
            4
            arrow-down
            5
            ·
            4 days ago

            This isn’t about AI crawlers. This is about users using AI tools.

            There’s a massive difference in server load between a user summarizing one page from your site and a bot trying to hit every page simultaneously.

            The second problem is, many sites are monetized from advert revenue or otherwise motivated by actual organic traffic.

            Should Cloudflare block users who use ad block extensions in their browser now?

            The point of the article is that Cloudflare is blocking legitimate traffic, created by individual humans, by classifying that traffic as bot traffic.

            Bot traffic is blocked because it creates outsized server load, this is something that user created traffic doesn’t do.

            People use Cloudflare to protect their sites against bot traffic so that human users can access the site without it being ddos’d by bot traffic. By classifying user generated traffic and scraper generated traffic as the same thing, Cloudflare is incorrectly classifying traffic and blocking human users from accessing websites,

            Websites are not able to opt out of this classification scheme. If they want to use Cloudflare for bot protection then they have to also agree that users using AI tools cannot access their sites even if the website owner wants to allow it. Cloudflare is blocking legitimate traffic and not allowing their customers to opt out of this scheme.

            It should be pretty easy to understand how a website owner would be upset if their users couldn’t access their website.

            • r00ty@kbin.life
              link
              fedilink
              arrow-up
              4
              arrow-down
              1
              ·
              4 days ago

              And their “AI tool” looks just like the hundreds of AI scraping bots. And I’ve already said the answer is easy. They need to differentiate themselves enough to convince cloudflare to make an exception for them.

              Until then, they’re “just another AI company scraping data”

              • FauxLiving@lemmy.world
                link
                fedilink
                English
                arrow-up
                2
                ·
                3 days ago

                Well, Cloudflare is adding, to the control panel, the ability to whitelist Perplexity and other AI sources (default: on).

                Looks like they differentiated themselves enough.

                • r00ty@kbin.life
                  link
                  fedilink
                  arrow-up
                  1
                  ·
                  3 days ago

                  That option is only likely to be for paid accounts. The freebie users like me have to make our own anti bot WAF rules. Or, as I do, just toss every page I expect a user to be using via managed challenge. Adding exceptions uses up precious space in those rules which I’ve used to put in exceptions for genuine instance to instance traffic.

                  But I am glad they were able to convince cloudflare. Good for them.

        • pressanykeynow@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          3 days ago

          Cloudflare doesn’t distinguish these things

          It does.

          You just make useragent like “AI bot request initiated by user” and the website owners will decide for themselves to allow your traffic or not.

          If your bot pretends to not be a bot, it should be blocked.

          Edit. Btw Openai does this.

    • kopasz7@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      24
      ·
      4 days ago

      Search engines been going relatively fine for decades now. But the crawlers from AI companies basically DDOS hosts in comparison, sending so many requests in such a short interval. Crawling dynamic links as well that are expensive to render compared to a static page, ignoring the robots.txt entirely, or even using it discover unlinked pages.

      Servers have finite resources, especially self hosted sites, while AI companies have disproportinately more at their disposal, easily grinding other systems to a halt by overwhelming them with requests.

    • pr06lefs@lemmy.ml
      link
      fedilink
      English
      arrow-up
      9
      arrow-down
      1
      ·
      4 days ago

      If a neighborhood is beset by roving bands of thieves, sooner or later strangers will be greeted by a shotgun rather than an invitation to tea, regardless of their intentions. Them’s the breaks. Bots are going to take a hit now and their operators are just going to have to deal with it. Sucks when people don’t play nice, but this is what you get.

      • FauxLiving@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        6
        ·
        4 days ago

        I’m sure people that are attempting to drive to their house in a new vehicle wouldn’t appreciate being riddled with bullets because the neighborhood watch makes no attempt to distinguish between thieves and homeowners.

          • FauxLiving@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            arrow-down
            6
            ·
            edit-2
            4 days ago

            It isn’t a war zone, it’s a gated community where the guards have suddenly decided that any vehicle made after 2020 is full of thieves.

            They didn’t bother to consult the residents or give them the ability to opt out of having their dinner guests murdered for driving a vehicle the security guards don’t like.

            • pr06lefs@lemmy.ml
              link
              fedilink
              English
              arrow-up
              4
              arrow-down
              1
              ·
              4 days ago

              So you’re a cloudflare customer and you wish they would let the perplexity traffic multiplier through to your website? You can leave cloudflare any time you want.

              • FauxLiving@lemmy.world
                link
                fedilink
                English
                arrow-up
                3
                arrow-down
                6
                ·
                4 days ago

                🙄You’re an Internet user and you don’t like AI so you can leave the Internet anytime you want.

                That’s not a good argument, what about the users who want to block mass scraping but want to make their content available to users who are using these tools? Cloudflare exists because it allows legitimate traffic, that websites want, and blocks mass scraping which the sites don’t want.

                If they’re not able to distinguish mass scraping traffic from user created traffic then they’re blocking legitimate users that some website owners want.

                • pr06lefs@lemmy.ml
                  link
                  fedilink
                  English
                  arrow-up
                  4
                  arrow-down
                  2
                  ·
                  4 days ago

                  Yes your “leave the internet any time you want” strawman is not a good argument.

                  If allowing perplexity while blocking the bad guys is so easy why not find a service that does that for you?

                  • FauxLiving@lemmy.world
                    link
                    fedilink
                    English
                    arrow-up
                    2
                    arrow-down
                    4
                    ·
                    4 days ago

                    The topic is that Cloudflare is classifying human sourced traffic as bot sourced traffic.

                    Saying “Just don’t use it” is a straw man. It doesn’t change the fact that Cloudflare, one of the largest CDNs representing a significant portion of the websites and services in the US, is misclassifying traffic.

                    I used mine intentionally while knowing it was a straw man, did you?

                    The same with “if it’s so easy, just don’t use it” hopefully for obvious reasons.

                    This affects both the customers of Cloudflare (the web service owners) as well as the users of the web services. A single site/user opting out doesn’t change the fact that a large portion of the Internet is classifying human sourced traffic as bot sourced traffic.