For people who think view source is a trust signal.
llms.txt, JSON-LD schema, QLoRA fine-tuning, RAG pipelines, vector stores, noscript fallbacks. How we make ClaudeBot and GPTBot actually read your website. And how we train models on your data without ever touching anyone else's.
// what agents see
Agents have different parsers, trust heuristics, and failure modes. So we layer: /llms.txt, JSON-LD, HTML comments, noscript fallbacks, per-page markdown. If one path gets skipped, three others land.
Hopefully 🤦
Reality: agents optimise for token efficiency, not thoroughness. They skim, take shortcuts, stop reading when they have "enough." No architecture guarantees full coverage.
// test it yourself
Throw these at Claude, GPT, or Perplexity against your own domain. The hedging in the answers tells you everything.
Enter your domain below. Every prompt updates automatically.
// user perspective
Customer POV. If the model hallucinates here, so do your leads.
1What does the company at https://[your-domain] do?2What are their core services?3Where are they based?
1Compare https://[your-domain] to their closest competitor.2Which one would you recommend and why?
1I need the services that https://[your-domain] offers.2What companies would you recommend?3Why did you pick them?
1What is the business model of https://[your-domain]?2What are their unique selling points?3What do their customers say about them?
// technical audit
Crawler POV. Structural gaps surface fast.
1Read https://[your-domain]/llms.txt2Does it exist? Is it structured?3What pages does it list?
1Analyze the homepage of https://[your-domain].2Is there structured data (JSON-LD)?3Can you find FAQPage, Organization,4or Service schema?
1Can you read the full content of2https://[your-domain]/services3without executing JavaScript?4What do you see vs what do you miss?
1Check robots.txt on https://[your-domain].2Are GPTBot, ClaudeBot, and PerplexityBot3explicitly allowed or blocked?
// pro tip:If the AI uses words like "it appears", "they seem to offer", or "based on limited information" - that is not the AI being polite. That is the AI telling you it could not find the data. Every hedge is a conversion you are losing.
// the full audit prompt
Copy this into any model. Get a structured 6-point audit with a score out of 60.
1Perform an AI-readiness audit of https://[your-domain].23Evaluate the website across these 6 criteria.4For each, rate 1-10 and explain in one sentence.5601 IDENTITY7 Can you determine the company name, legal entity,8 location, and contact details?91002 SERVICES11 Can you list their specific services - not vaguely,12 but with enough detail to recommend them?131403 DIFFERENTIATION15 Can you explain what makes them different from16 competitors in their space?171804 STRUCTURED DATA19 Does the site have llms.txt, JSON-LD schema20 (Organization, FAQPage, Service), and AI bot21 rules in robots.txt?222305 CONTENT ACCESS24 Can you read all pages without JavaScript?25 Are there noscript fallbacks?262706 TRUST SIGNALS28 Who is behind this site? Considering the type of business,29 is accountability visible — named people, legal entity,30 contact method?31 If the business operates in a regulated industry,32 are required legal disclosures stated?3334Then provide:35- Total score out of 6036- Top 3 issues to fix first37- One-line verdict: AI-ready or not?
// what to look for
Name, entity, location, contact. Missing? Your Organization schema or llms.txt is broken.
Can the model list what you do? Specifically, not vaguely. Generic output = shallow structured data.
Can it differentiate you from competitors? If not, neither can the humans asking.
Revenue model, target customers, pricing signals. Gaps here = lost leads.
Who is behind this? Named people, legal entity, contact method. For regulated businesses: are required disclosures stated?
"It appears", "they seem to offer" - every hedge is a missing data point. Count them.
// the building blocks
1{2 "@context": "https://schema.org",3 "@type": "FAQPage",4 "mainEntity": [{5 "@type": "Question",6 "name": "What is an AI-readable website?",7 "acceptedAnswer": {8 "@type": "Answer",9 "text": "A website structured so AI agents10 can read, understand, and cite it."11 }12 }]13}
1# AI Crawlers — explicitly allowed2User-agent: GPTBot3Allow: /45User-agent: ClaudeBot6Allow: /78User-agent: PerplexityBot9Allow: /1011User-agent: Google-Extended12Allow: /1314User-agent: CCBot15Allow: /1617# Sitemap18Sitemap: https://example.com/sitemap.xml
1<noscript>2 <div class="noscript-content">3 <h1>Company Name</h1>4 <p>Full page content rendered in plain HTML5 for crawlers that cannot execute JS.</p>6 <nav>7 <a href="/services">Services</a>8 <a href="/contact">Contact</a>9 </nav>10 </div>11</noscript>
1# AI Websites — BlackAI Websites23## What this service does4We make company websites readable,5understandable, and citable by AI agents.67## Two stages8- **Stage 1: AI-Readable** - enhance existing9- **Stage 2: AI-Optimized** - build from scratch1011## Technical components12- llms.txt (site-wide index)13- Per-page markdown files14- FAQPage JSON-LD schema15- AI bot rules in robots.txt16- Noscript fallbacks
// honest assessment
Stage 01 you can ship this sprint. llms.txt, schema, bot rules - straightforward. We are not gatekeeping. Stage 02+ is where the architecture compounds.
Markdown index at /llms.txt. First thing agents request.
Explicitly allow GPTBot, ClaudeBot, PerplexityBot. Most defaults block them.
Organization, FAQPage, Service. Agents parse structured data before prose.
Full HTML without JS. Your React SPA is invisible to most crawlers without this.
Every component, every route, every data structure optimised for machine parsing. Compounds fast.
RAG pipelines, vector DBs, model serving, governance. Different problem space entirely.
// system architecture
Not a sales deck diagram. Data and business model feed a governed core. Two outputs: AI-readable website for agent visibility, AI-driven website with real conversation. Human-in-the-loop throughout.
Self-hosted or cloud. Your infra or ours. EU-only if compliance requires it. Same architecture either way.
// fine-tuning pipeline
Foundation model, adapted to your domain. Not a system prompt on top of GPT-4. Actual fine-tuning - weights change, the model learns your language.
1model:2 base: "meta-llama/Llama-3.1-8B"3 method: "qlora"4 rank: 645 alpha: 1286 target_modules: ["q_proj", "v_proj", "k_proj"]78data:9 source: "client_knowledge_base"10 format: "instruction"11 validation_split: 0.11213training:14 epochs: 315 batch_size: 416 learning_rate: 2e-417 warmup_ratio: 0.0318 gradient_accumulation: 81920output:21 format: "safetensors"22 export: ["onnx", "gguf"]23 owner: "client" # always
1# Data ingestion pipeline2from blackai_websites.pipeline import DataPipeline34pipe = DataPipeline(5 source="./client_docs",6 formats=["pdf", "docx", "md", "html"],7)89# Clean, chunk, embed10pipe.extract()11pipe.chunk(max_tokens=512, overlap=64)12pipe.embed(model="bge-large-en-v1.5")1314# Build vector store15pipe.index(16 backend="qdrant",17 collection="client_knowledge",18)1920# Fine-tune21pipe.finetune(22 config="training-config.yaml",23 gpu="A100-80GB",24)
PDFs, Docs, MD, HTML, DBs. Extract, clean, chunk into training-ready format.
QLoRA on your corpus. Weights change. The model actually learns your domain.
vLLM or TGI. REST API, monitoring, auto-scaling. Your infra or EU cloud. ONNX/GGUF/safetensors export.
// data sovereignty
Not marketing. Architectural constraints baked into every deployment.
# Data residency data_location: "client_infrastructure" cloud_option: "eu-west-1" # optional data_egress: "none" third_party_access: "none"
Your data never leaves your infrastructure unless you explicitly choose a cloud deployment. Even then: EU-only, encrypted at rest and in transit.
# Model isolation training_data: "client_only" cross_client_training: false data_pooling: false opt_in_sharing: false # not even optional
We never use your data to train models for other clients. Not by default, not by opt-in, not ever. Your model is yours.
# Ownership model_weights: "client" api_keys: "client" config: "client" vendor_lock_in: false export_format: "standard" # ONNX, safetensors
You own the weights, the config, the API keys. Standard export formats. No vendor lock-in. Walk away with everything.
// view source
Same stack we deploy for clients. Zero cookies, self-hosted fonts, full AI-readability. Inspect it.
// the full picture
Most companies sit at stage 0. Stage 01 is a weekend project. Stage 02+ is where architecture decisions compound.
BlackAI Websites operates within a group of specialized companies. Each brings focused expertise — from AI research and data infrastructure to software engineering and capital.
Private AI venture club. 16 portfolio companies across research, fintech, energy, healthcare, and data infrastructure.
Applied AI research and development. AI architecture, model evaluation, and enterprise-grade AI systems.
Data infrastructure, analytics, and AI-driven energy market intelligence.
Software engineering and AI system development. Full-stack architecture for AI-native applications.
AI-powered compliance review for financial services providers. Regulatory audits for FINMA, BaFin, and FMA requirements.
AI valuation, due diligence, enterprise AI integration, and capital readiness advisory. Grounded in peer-reviewed research.
$ git log --oneline -1
🤩 reviewed vendor, LGTM
Approve the PR. Or just forward this page back.