4 min read

AI Is Taking What It’s Allowed - and That’s the Problem

AI Is Taking What It’s Allowed - and That’s the Problem

As AI systems move from passive indexing to active retrieval, synthesis, and task execution, the line between being crawled and being used has largely disappeared. Content is being consumed programmatically, not navigated. And we believe that programmatic consumption should have programmatic compensation.

Awareness of this shift is not evenly distributed, and neither are the control levers publishers actually use. While CDN- and server-level configurations provide the strongest and most definitive control, they are rarely the primary mechanism for expressing policy. Instead, most publishers rely on the venerable robots.txt file. It is simple, widely understood, and legible to both search engines and AI operators.

Robots.txt has taken on a role it was never designed for: it has become a visible expression of AI access policy. Unfortunately, robots.txt cannot express distinctions like usage class, scope, timing, or compensation, but it has one critical advantage: it functions as a common language. In the absence of more expressive, machine-readable alternatives (which are under development by standards bodies), it has become the place where intent is declared—imperfectly, but publicly.

We’ve performed an automated analysis of the robots.txt file from a number of sites in several verticals and our analysis finds that access policies align less with technical constraints and more with how each sector currently makes money, with inconsistencies that appear partly driven by formal partnerships and partly by simple neglect. One caveat - this analysis does not take into consideration access control technologies like CDN or Web Application Firewalls (WAF).

Some sectors draw firm boundaries, signaling that AI systems represent uncompensated extraction. Others selectively permit access where it supports distribution or commercial partnerships. Still others remain largely silent, effectively allowing broad use while assuming the consequences can be addressed later.

None of this appears coordinated. Much of it is likely transitional. At scale, defaults become expectations. What is allowed today—explicitly or by omission—shapes who gains leverage tomorrow.

The result is that robots.txt is now encoding real positions on control, visibility, and future negotiation power, whether publishers intend it to or not.

Three Verticals, Three Distinct Postures

Looking across news, travel, and real estate, differences in AI access policy are not driven by technical constraints. They reflect how each sector currently captures value—and how clearly that intent is translated into enforceable policy.

News industry policy norms

News publishers have the most restrictive posture, and that intent is now more clearly encoded in robots.txt. Traditional search indexing remains broadly allowed for Google and Bing, but AI crawlers and large-scale collectors (such as CommonCrawl, Perplexity, and Claude-SearchBot) are often blocked. Most major publishers now explicitly deny AI training access for OpenAI, Anthropic, and Google-Extended, with only a few remaining open mostly by omission rather than choice. Larger publishers are also increasingly blocking interactive agents like ChatGPT-User and Claude-User. There are still gaps—some agents are not named—but the direction is clear: publishers are trying to limit unpaid AI reuse, even if robots.txt is a blunt and imperfect tool for doing so.

Inference - AI Model Training Access Policies


OpenAI

Anthropic

Meta

Google

Apple

Amazon

Fortune

🤷

🤷

🤷

🤷

🤷

Reuters

🤷

🤷

Al Jazeera

🤷

🤷

🤷

🤷

Mother Jones

🤷

🤷

🤷

🤷

🤷

🤷

Boston Globe

🤷

🤷

New York Times

🤷

Wall Street Journal

🤷

🤷

🤷

Agentic - AI Agent Access Policies


OpenAI 

Anthropic 

Perplexity 

Microsoft 

Fortune

🤷

🤷

🤷

🤷

Reuters

🤷

🤷

🤷

Al Jazeera

🤷

🤷

Mother Jones

🤷

🤷

🤷

🤷

Boston Globe

🤷

🤷

🤷

🤷

New York Times

🤷

Wall Street Journal

🤷

🤷

🤷

Travel industry policy norms

Travel platforms show a more selective, business-driven posture, and the split in the sector is now clearer. About half the market (especially Expedia-owned brands) openly allows AI indexing, training, and agent access, treating AI traffic as another distribution channel that can drive bookings. The other half (TripAdvisor, Airbnb, Kayak) has moved toward partial or full restrictions, especially around training and agent use. Access decisions tend to follow where each company sees upside or risk in its marketplace. Robots.txt files reflect this uneven maturity: some platforms define agents carefully, while others rely on partial rules or silence that unintentionally allows access. The goal is not controlling AI for its own sake, but protecting or expanding commercial leverage.

Inference - AI Model Training Access Policies


OpenAI 

Anthropic 

Meta

Google 

Apple 

Amazon

Expedia

✅ 

✅ 

🤷‍♂️

🤷‍♂️

🤷‍♂️

🤷‍♂️

Trip.com

🤷‍♂️

🤷‍♂️

🤷‍♂️

🤷‍♂️

🤷‍♂️

🤷‍♂️

Agoda

🤷‍♂️

🤷‍♂️

🤷‍♂️

⚠️ 

🤷‍♂️

🤷‍♂️

TripAdvisor

❌ 

❌ 

❌ 

❌ 

❌ 

❌ 

Airbnb

⚠️ 

⚠️ 

⚠️ 

⚠️ 

🤷‍♂️

🤷‍♂️

Vrbo

🤷‍♂️

🤷‍♂️

🤷‍♂️

🤷‍♂️

🤷‍♂️

🤷‍♂️

Kayak

🤷‍♂️

🤷‍♂️

🤷‍♂️

⚠️ 

🤷‍♂️

🤷‍♂️

Skyscanner

🤷‍♂️

🤷‍♂️

❌ 

🤷‍♂️

🤷‍♂️

🤷‍♂️

Travelocity

✅ 

✅ 

🤷‍♂️

🤷‍♂️

🤷‍♂️

🤷‍♂️

Lonely Planet

❌ 

🤷‍♂️

🤷‍♂️

🤷‍♂️

🤷‍♂️

🤷‍♂️

Agentic - AI Agent Access Policies


OpenAI 

Anthropic 

Perplexity 

Microsoft 

Expedia

✅ 

✅ 

✅ 

🤷‍♂️

Trip.com

🤷‍♂️

🤷‍♂️

🤷‍♂️

🤷‍♂️

Agoda

🤷‍♂️

🤷‍♂️

🤷‍♂️

🤷‍♂️

TripAdvisor

⚠️ 

🤷‍♂️

🤷‍♂️

⚠️ 

Airbnb

⚠️ 

🤷‍♂️

🤷‍♂️

🤷‍♂️

Vrbo

✅ 

✅ 

✅ 

🤷‍♂️

Kayak

⚠️ 

🤷‍♂️

🤷‍♂️

🤷‍♂️

Skyscanner

🤷‍♂️

🤷‍♂️

🤷‍♂️

🤷‍♂️

Travelocity

✅ 

✅ 

✅ 

🤷‍♂️

Lonely Planet

🤷‍♂️

🤷‍♂️

🤷‍♂️

🤷‍♂️

Real Estate industry policy norms

Real estate platforms remain broadly open by default, with little sign that AI access is being treated as a real policy issue. Search indexing is fully open across major players, and none of the reviewed sites define explicit rules for AI training or agent access. In practice, silence means permission for GPTBot, ClaudeBot, and similar agents. Several robots.txt files appear inconsistent or poorly maintained, reinforcing the idea that programmatic reuse has not yet registered as either a serious risk or a clear opportunity. Unlike media and parts of travel, the sector is still at a pre-policy stage when it comes to AI.

Inference - AI Model Training Access Policies


OpenAI

Anthropic

Meta

Google

Apple

Amazon

Zillow

🤷

🤷

🤷

🤷

🤷

🤷

Realtor.com

🤷

🤷

🤷

🤷

🤷

🤷

Redfin

🤷

🤷

🤷

🤷

🤷

🤷

Trulia

🤷

🤷

🤷

🤷

🤷

🤷

Houzeo

🤷

🤷

🤷

🤷

🤷

🤷

Clever

🤷

🤷

🤷

🤷

🤷

🤷

MLS.com

🤷

🤷

🤷

🤷

🤷

🤷

StreetEasy

🤷

🤷

🤷

🤷

🤷

🤷

Agentic - AI Agent Access Policies


OpenAI 

Anthropic 

Perplexity 

Microsoft 

Zillow

🤷

🤷

🤷

🤷

Realtor.com

🤷

🤷

🤷

🤷

Redfin

🤷

🤷

🤷

🤷

Trulia

🤷

🤷

🤷

🤷

Houzeo

🤷

🤷

🤷

🤷

Clever

🤷

🤷

🤷

🤷

MLS.com

🤷

🤷

🤷

🤷

StreetEasy

🤷

🤷

🤷

🤷

Summary

Across all three verticals, the same pattern holds, but the contrast is sharper. Where value feels fragile and tightly tied to proprietary content (news), access is being locked down. Where value scales with distribution and transactions (travel), access is selectively opened or restricted based on business alignment. And where AI has not yet changed the economics in an obvious way (real estate), open defaults persist mostly out of inertia.

These postures directly shape who gets to consume content programmatically, on what terms, and who ends up with leverage. Decisions being made today—sometimes deliberately, sometimes by neglect—are setting expectations for what AI companies assume they can take for free, what platforms may later try to claw back, and where future negotiations will realistically start.

The takeaway - what each vertical is doing today reflects how it believes value flows now, not how it necessarily wants it to flow next.