05 Aug 2025 6 min read

Bot or Not? What Are All These AI Agents Even Doing?

The recent “AI traffic apocalypse” has sparked a lot of noise. Traffic is down, ad revenue is slipping, and everyone’s asking the same thing: Should I block AI bots?

Totally fair question. But here’s the problem: most people don’t actually know what these bots are doing in the first place.

If you run a major website, you already know about user agents — browsers, crawlers, bots, spiders. (If not, we’ve included a quick primer at the end of the post.)But do you really know what they’re doing with your content? Unlikely.

Let’s fix that.

Why Should I Care?

Let’s start with the obvious: browsers bring people to your site. People read articles, see ads, maybe subscribe. Search engines? Also useful. They help people find your stuff.

But now, two major shifts are changing everything.

Search engines are giving answers without clicks.Snippets became summaries. Now, full AI Overviews tell users what they need without sending them to you.
AI chat tools and “agents” are replacing search altogether.Tools like ChatGPT, Claude, and others pull from multiple sites at once, blend the information, and hand the user a neatly packaged answer.People love it. It’s fast, easy, and (usually) good enough.

So what’s missing?

Your content. Your link. Your brand. Your revenue.When an AI tool answers a user’s question, they don’t see you.And you don’t see them.Neither do your advertisers.

How can I figure out how bad it is for me?

That’s the fun part – you can’t (yet).

Right now, the web doesn’t give publishers much help here; it takes time for web technology to evolve. You can try to piece together the puzzle from logs and bot patterns, but it’s messy and manual.

There’s no standard dashboard that says:

Nearly 20% of traffic last week came from agents learning from your content:

5% for traditional search indexing (Googlebot, Bingbot)
7% for site diagnostics and analytics (Ahrefs, SimilarWeb, SEO crawlers)
8.7% for AI model training, primarily from OpenAI, Anthropic and Meta.

An additional 15% (up 250 basis points month-over-month) came from AI agents actively applying your content in real time - including chatbots, summarization tools, and research assistants. Most of these requests offered limited or no attribution, and less than 10% followed declared access policies.

The remaining 65% of traffic was from direct user activity including browsers, mobile apps, and social previews. Up to 10% of this may be from automated agents masquerading as real user activity.

It’d be amazing to have this right? That’s part of what we’re building at paywalls.net, a way to make all of this legible.

We’re creating a shared understanding of:

Who’s accessing your content
What they’re doing with it
And whether there’s any value coming back to you

Because right now, publishers are playing whack-a-mole in the dark. And it’s not working.

Not All Bots Are Bad (But Some Are Sketchy)

To help publishers make informed choices, we’ve started mapping the landscape. Not all bots are equal so understanding their intent is the first step to deciding what to allow, monitor, or block.

Let’s start at the top: why is this agent fetching your content at all?

The highest-level distinction is purpose - is the bot trying to learn from your content, or apply it in real time?

Learn means the agent is crawling your site to build something:
- An index, like what search engines use
- An insight, like broken link analysis or performance reports
- An inference model, like training a generative AI on your content
Apply means the agent is using your content on the spot:
- Loading it in a browser
- Previewing it on social media
- Feeding it into a chatbot or research assistant that serves answers to someone in real time

This difference matters a lot, especially when Apply use cases don’t return value to you in the form of attribution, traffic, or revenue. And not that I have to say this but keep your eye on the revenue part (we are working on that too).

Another critical factor is how the access was initiated. Was it something a person explicitly asked for? Or did a system decide to fetch your content on its own?

Explicit: A person clicked a link or asked a direct question or query (e.g., “What’s the pricing on AWS CloudFront these days?”)
Semi-autonomous: The system is helping the user who provided a high level goal (e.g., “I’m going skiing at Crystal next weekend - what conditions can I expect?”)
Autonomous: The system fetched your content without any specific human prompt. This includes traditional crawlers, but also personal agents running in the background - like a news assistant that checks headlines each morning, a shopping bot tracking price drops, or a home device pulling weather updates before your commute. These agents act independently, often on a schedule or trigger, and are becoming more common.

We also look at how the agent intends to transform your content when it’s presented to the end user:

Verbatim: Your content is quoted or reused as-is
Summarized: Shortened, paraphrased, or simplified
Synthesized: Blended with other sources into a new output

And finally, attribution - do you get credit?

Citation: You’re clearly listed with a source link
Mention: Your brand name appears, but no link
No attribution: Your content is used, but your name is gone

Plus there are a couple of bonus dimensions that provide extra context:

Operator: The organization (if any) running the agent (Google, OpenAI, Anthropic, etc.)
Agent Name: The declared user agent string independent of version or variant (like Googlebot, ClaudeBot, ChatGPT-User)

This may sound like a lot, but you don’t need to memorize it. The takeaway is simple:Understanding the purpose, presentation, and pathway of content access helps you make smarter decisions.

Some bots give value back. Others don’t. Let’s make it easier to tell the difference.

Show Me the Money (Coming Soon)

This is where it gets real.

We’re launching an Instant Insight feature that lets publishers upload a slice of their access logs and get instant insights. No config, no code. Just paste in your data and see:

Who’s crawling your site
Whether it looks like indexing, summarizing, or synthesis
Whether they’re giving attribution or taking all the credit

If you’re a publisher and want an early peek, reach out or visit Instant Insight. We’d love to help you and we think you’ll want to share what you find.

What Happens Next

We’re not here to fearmonger. AI isn’t going away. And honestly, most of the people building these systems want to do the right thing. But we can’t have a fair system without visibility. And we can’t have visibility without structured access, attribution, and insight.

So let’s build that together.

And if this piece helped clarify the chaos, share it with someone else who’s wondering what the hell all those bots are doing on their site.

Background : What Is a User Agent?

If you're already familiar with browsers, crawlers, and bots, you can safely skip this section. But if you're looking for a quick refresher or need to explain it to colleagues, here’s a breakdown.

What is a "User Agent"?

A user agent is the technical term for any software that makes a request to a web server on behalf of a user. It’s part of how the internet communicates who's asking for what.

Every time a browser or bot makes a request to your website, it includes a user agent string — a line of text that identifies what kind of client is accessing the content. For example, when you open a site in Chrome, the request includes something like:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36

This is the user agent string and it's pretty ugly. It tells the server, "Hi, I’m Chrome on a Mac," so that the right version of the site can be delivered.

Types of User Agents

User agents aren’t limited to browsers. Any software that fetches web content is considered a user agent. The most common types include:

Web browsers: Chrome, Safari, Firefox, Edge. These are human-facing and fetch content that the user sees.
Search engine crawlers: Googlebot, Bingbot, and others index content for search results.
SEO tools and performance scanners: Ahrefs, SEMrush, GTmetrix and others use analytics to provide you with insights for your business.
AI bots and agents: These fetch content to summarize, analyze, or use in AI responses. They may identify as ChatGPT-User, ClaudeBot, or simply generic tools with vague labels.
Scrapers: Often custom-built, sometimes opaque or misleading in their identity. Used for data harvesting, sometimes legitimate, sometimes not.

Each of these shows up in your server logs with its own user agent string. Some are transparent and follow rules. Others are opaque, aggressive, or intentionally disguised.

Why It Matters

Your server logs contain a wealth of information, but only if you can interpret it. Understanding user agents helps you answer key questions:

Is this a real person using a browser or an automated system?
Is this bot indexing my site for search or scraping content for AI training?
Should I allow this request, monitor it, or block it?

Historically, most publishers focused on distinguishing good bots (like Googlebot) from bad actors. But now, with AI agents rapidly expanding, the gray area is getting wider. Some agents might bring value while others might just take.

Understanding user agents is foundational to controlling who uses your content, how they use it, and whether value flows back to you.