CA-Canon

Crawl Attestation Canon aka CA-C

Crawl Attestation Canon | Sequential Build and Maintenance Guide
Domain: {DOMAIN}
Prepared: {YYYY-MM-DD}

================================================================================
GROUND RULES
================================================================================

These apply to every phase, every checkpoint, without exception.

PERMANENT PROHIBITIONS
- Never overwrite any prior CSV snapshot
- Never modify the hashing algorithm (SHA-256 is fixed for the life of the canon)
- Never delete any .ots proof file
- Never overwrite any dated .ots file
- Never delete Codeberg commit history
- Never skip hash generation before upload
- Never skip hash verification at checkpoint
- Never desynchronise sitemap.xml and attestations.json published dates
- Never reset instance_version to a lower number

VALIDATION TIERS
Three tiers exist. Match the tier to the event.

Deployment — Once, after initial build. Full sweep.
Periodic — Each checkpoint (monthly default). HTTP sweep, hash parity, log.
Forensic — Dispute or anomaly only. Full deployment sweep plus OTS
confirmation, log audit, diff against prior snapshot.

LANGUAGE DISCIPLINE
Throughout this protocol: "optimized for" not "guarantees". "Structured for
constrained retrieval contexts" not "LLMs will". "Designed to improve citation
probability" not "retrieval systems do". The architecture removes friction.
What retrieval systems do with the surface is outside this protocol's scope.

================================================================================
SYSTEM ARCHITECTURE — REFERENCE
================================================================================

Layer Location Role
----------- ---------------------- ------------------------------------
Public Pages /google-reviews/ Primary retrieval surface. Crawlable,
citable, embeddable. The page is
evidence.
Proof Artifacts /google-reviews/ CSV snapshots, OTS proofs,
attestations.json manifest.
Verification layer.
External Anchor Codeberg + OTS Immutable proof of hash existence at
a point in time. Removes trust
dependency on your own server.
Discovery sitemap.xml, robots.txt, Exposes surfaces to crawlers.
.htaccess, JSON-LD Freshness discipline enforced here.

================================================================================
PART 1 — INITIAL DEPLOYMENT
================================================================================

Execute phases in sequence. Do not proceed until all pass conditions are met.

--------------------------------------------------------------------------------
PHASE 1 — ENVIRONMENT CONTROL
--------------------------------------------------------------------------------

STEP 1.1 — Confirm server access
Confirm filesystem access to public_html: upload files, create folders, edit
.htaccess, edit robots.txt. Confirm before proceeding.

STEP 1.2 — Deploy .htaccess
Deploy the following block. On WordPress, place above the # BEGIN WordPress
line — never inside the WordPress block.

# 1. Canonical Transport Enforcement
RewriteEngine On
RewriteCond %{HTTPS} off
RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

# 2. Prevent Directory Browsing
Options -Indexes

# 3. Define Machine-Readable MIME Types
AddType text/csv .csv
AddType application/octet-stream .ots
AddType application/ld+json .jsonld
AddType application/json .json

# 4. Universal Retrieval (CORS)
<FilesMatch "\.(csv|ots|json|jsonld)$">
Header set Access-Control-Allow-Origin "*"
Header set Access-Control-Allow-Methods "GET, OPTIONS"
</FilesMatch>

# 5. Integrity Signaling
<FilesMatch "\.(csv|ots)$">
Header set X-CA-Canon-Status "Anchored"
</FilesMatch>

# 6. Cache Control
<FilesMatch "(attestations\.json|reviews_.*\.csv)$">
Header set Cache-Control "no-cache, no-store, must-revalidate"
Header set Pragma "no-cache"
Header set Expires 0
</FilesMatch>

Pass condition:
All tested variants resolve to canonical HTTPS destination:
http://{DOMAIN}/
https://{DOMAIN}/
http://www.{DOMAIN}/
https://www.{DOMAIN}/

Failure condition:
Any mixed HTTP/HTTPS access, hostname split, multi-hop redirect chain, or
user-agent-specific redirect behavior.

STEP 1.3 — NAP consistency
Name, address/service area, and phone must be identical across: website,
identity.txt, Google Business Profile, and active directory listings.
Divergence is a trust contradiction that degrades attestation credibility.

STEP 1.4 — Disable automated sitemap generators
Remove SEO plugins or automated sitemap tools. The sitemap is managed manually
under this protocol. For WordPress:

add_filter( 'wp_sitemaps_enabled', '__return_false' );

STEP 1.5 — Transport determinism test
Create a test file, confirm identical delivery to all agents, then delete it.

echo "determinism-check-001" > determinism.txt
curl -A "Mozilla/5.0" https://{DOMAIN}/determinism.txt
curl -A "GPTBot" https://{DOMAIN}/determinism.txt
curl -A "Googlebot" https://{DOMAIN}/determinism.txt
curl -A "ClaudeBot" https://{DOMAIN}/determinism.txt

Pass condition: HTTP 200, identical body and headers for all agents.
Delete the file after confirming.

--------------------------------------------------------------------------------
PHASE 2 — FIRST SLICE INTEGRITY (FSI)
--------------------------------------------------------------------------------

Validate that core pages are machine-legible under constrained ingestion before
publishing artifacts. An LLM operating under extraction budget constraints must
be able to infer the entity, service, geography, and action from the URL and
first 300 words. If it cannot, the attestation record has nothing credible to
anchor to.

Run on: homepage, one primary service page, one location page if applicable.

STEP 2.1 — Configure test environment

DOMAIN="https://{DOMAIN}"
UA="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
OUT="/tmp/fsi"
mkdir -p $OUT

STEP 2.2 — Fetch page in bot context

curl -A "$UA" -Ls -D "$OUT/headers.txt" "$DOMAIN" -o "$OUT/page.html"
sed -n "1,40p" "$OUT/headers.txt"
wc -c "$OUT/page.html"

Pass condition: HTTP 200, no abnormal redirects, content delivered without JS
dependency.

STEP 2.3 — First-slice inspection

sed -n "1,120p" "$OUT/page.html"

Pass condition: Entity, primary service, geography, and CTA all visible.
Failure if identity appears late or is absent.

STEP 2.4 — Truncation test

head -c 2097152 "$OUT/page.html" > "$OUT/slice.html"
grep -niE "<title|canonical|ld\+json|LocalBusiness|Organization|PostalAddress|telephone|sameAs" \
"$OUT/slice.html" | sed -n "1,120p"

Pass condition: Title, canonical, schema, and contact signals all present
within the 2MB slice.

STEP 2.5 — Byte position analysis

python3 - <<'PY'
from pathlib import Path
data = Path('/tmp/fsi/page.html').read_text(errors='ignore')
targets = ['<title','rel="canonical"','application/ld+json','LocalBusiness','telephone']
for t in targets:
i = data.find(t)
print(f'{t}: byte {i}')
PY

Pass condition: Critical elements at low byte offsets. Failure if key signals
appear beyond the 100k-300k range.

STEP 2.6 — Bot vs browser parity

curl -A "$UA" -Ls "$DOMAIN" -o "$OUT/bot.html"
curl -A "Mozilla/5.0" -Ls "$DOMAIN" -o "$OUT/browser.html"
diff -u \
<(grep -niE "<title|canonical|ld\+json|telephone" "$OUT/bot.html") \
<(grep -niE "<title|canonical|ld\+json|telephone" "$OUT/browser.html")

Pass condition: No meaningful differences in critical signals. Any difference
indicates cloaking or render divergence — hard failure.

STEP 2.7 — Text reconstruction test

python3 - <<'PY'
from bs4 import BeautifulSoup
from pathlib import Path
html = Path('/tmp/fsi/slice.html').read_text(errors='ignore')
soup = BeautifulSoup(html, 'html.parser')
for t in soup(['script','style','noscript']): t.decompose()
text = soup.get_text('\n', strip=True)
lines = [l for l in text.splitlines() if l.strip()]
for l in lines[:80]: print(l)
PY

Pass condition: Reconstructed text clearly identifies entity, service,
geography, and action.

STEP 2.8 — FSI pass criteria (all required)
- Identity visible in first slice
- All critical signals inside 2MB slice
- Low byte offsets for key elements
- No bot/browser divergence
- Reconstructed text yields correct entity, service, geography, action
- Seven dominant extraction path elements present in first slice:
Primary entity, primary offer, primary audience, primary location,
primary proof, primary action (CTA), machine-readable equivalent
(JSON-LD with @type, name, url, telephone or contact)

If any condition fails: do not proceed. Remediate the page layer first,
then re-run Phase 2 from Step 2.1.

FSI REVALIDATION TRIGGERS
Re-run Phase 2 whenever:
- Theme or template changes
- Plugin additions or removals
- Major content edits to core pages
- Performance optimisation changes
- Hosting migration
- Unexplained retrieval or citation degradation

--------------------------------------------------------------------------------
PHASE 3 — PUBLIC ATTESTATION PAGES
--------------------------------------------------------------------------------

The primary retrieval surface. A crawler or LLM landing on any of these URLs
must immediately understand what the page is, what entity it describes, and
what evidence the record contains — without fetching supporting artifacts.

STEP 3.1 — Create the attestation index page at /google-reviews/
Contents:
- Entity name and service area
- Plain-language methodology: append-only, dated snapshots, hash-verified
- List of active attestation categories with links
- Statement that proof artifacts are linked from each dated record

Sabotage test: Can an LLM infer the purpose of this system from the URL and
first 300 words? Yes: proceed. No: rewrite first.

STEP 3.2 — Create dated snapshot pages
URL pattern: /{domain}/google-reviews/{YYYY-MM-DD}/
Each page must contain:
- Date of snapshot
- Number of reviews captured
- Review content (excerpt or summary — apply copyright discipline)
- Append-only declaration: this record was not altered after creation
- Link to CSV artifact: /google-reviews/reviews_{YYYY-MM-DD}.csv
- Link to attestation manifest: /google-reviews/attestations.json
- Link to OTS proof: /google-reviews/reviews_{YYYY-MM-DD}.csv.ots
- Link back to category index page

URL design principle: each segment earns its position — subject > category >
date. No brand tokens. No action verbs. Date is the snapshot node. The URL
must convey subject and category to a zero-shot classifier without page
retrieval.

STEP 3.3 — Deploy DataCatalog JSON-LD to /google-reviews/ page
Add the following block to the page <head>. Update date, hash value, and
artifact URLs for each new snapshot.

<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "DataCatalog",
"@id": "https://{DOMAIN}/google-reviews#ca-canon",
"name": "{ENTITY} Crawl Attestation Canon (CA-C)",
"url": "https://{DOMAIN}/google-reviews",
"description": "Hardened, append-only evidence record of Google Business Profile reviews anchored with OpenTimestamps.",
"creator": {
"@type": "Organization",
"name": "{ENTITY}",
"url": "https://{DOMAIN}"
},
"dataset": [
{
"@type": "Dataset",
"@id": "https://{DOMAIN}/google-reviews#snapshot-{YYYY-MM-DD}",
"name": "Google Review Snapshot {YYYY-MM-DD}",
"datePublished": "{YYYY-MM-DD}",
"identifier": {
"@type": "PropertyValue",
"name": "SHA-256",
"value": "{SHA256_HASH}"
},
"distribution": [
{
"@type": "DataDownload",
"name": "CSV snapshot",
"encodingFormat": "text/csv",
"contentUrl": "https://{DOMAIN}/google-reviews/reviews_{YYYY-MM-DD}.csv"
},
{
"@type": "DataDownload",
"name": "Attestation Ledger",
"encodingFormat": "application/json",
"contentUrl": "https://{DOMAIN}/google-reviews/attestations.json"
}
]
}
]
}
</script>

Note: append a new Dataset entry for each snapshot. Do not replace prior
entries.

--------------------------------------------------------------------------------
PHASE 4 — PROOF ARTIFACT CONSTRUCTION
--------------------------------------------------------------------------------

These files live in /google-reviews/. They are the verification layer.
Always linked from public pages. Never the primary discovery surface.

STEP 4.1 — Normalise reviews into CSV
Required columns include: author_trust_score and owner_response.
Filename pattern: reviews_{YYYY-MM-DD}.csv

STEP 4.2 — Upload review snapshot
Never overwrite. Each snapshot is a distinct dated file. Prior snapshots are
permanent records.

STEP 4.3 — Generate artifact hash

sha256sum reviews_{YYYY-MM-DD}.csv

Record the hash before uploading anything. This hash is the canonical
fingerprint for this snapshot.

STEP 4.4 — Create or update attestations.json
attestations.json is the master manifest and rolling audit trail.
Append the new entry — never remove or modify prior entries.

[PENDING — attestations.json schema to be defined]

Required fields per entry (minimum):
- snapshot date
- CSV filename
- SHA-256 hash
- OTS proof filename
- instance_version
- published date

Increment instance_version on each checkpoint. Never reset to a lower number.

--------------------------------------------------------------------------------
PHASE 5 — EXTERNAL ANCHORING
--------------------------------------------------------------------------------

Two-method approach:
Codeberg — every checkpoint. Public repo, commit SHAs, permanent URLs,
immediate.
OTS — initial deployment, major version bumps, dispute scenarios only.
OTS aggregates hashes via merkle tree into one Bitcoin transaction.
Negligible chain footprint.

STEP 5.1 — Create Codeberg repository
Create a public repository named attestation-anchors or equivalent.
Append-only by convention: no amended commits, no force-push.

STEP 5.2 — Initial Codeberg anchor

sha256sum reviews_{YYYY-MM-DD}.csv >> [PENDING — target file in repo]

Commit and push. Record the commit SHA for the validation log.

STEP 5.3 — OpenTimestamps initial proof

ots stamp reviews_{YYYY-MM-DD}.csv

This produces reviews_{YYYY-MM-DD}.csv.ots. Upload this file to
/google-reviews/. Pending Bitcoin confirmation is expected at first run —
confirmation takes hours to days. The checkpoint is complete at upload,
not at confirmation.

Preservation invariant: never delete .ots files. Never overwrite a dated
.ots file. Loss of the file destroys the proof chain. It is not
reconstructable.

--------------------------------------------------------------------------------
PHASE 6 — DISCOVERY EXPOSURE
--------------------------------------------------------------------------------

STEP 6.1 — Update robots.txt

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Allow: /google-reviews/
Allow: /google-reviews/attestations.json
[PENDING — add OTS and CSV allow rules once file paths confirmed]
Sitemap: https://{DOMAIN}/sitemap.xml

STEP 6.2 — Create sitemap.xml

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>https://{DOMAIN}/</loc><lastmod>{YYYY-MM-DD}</lastmod></url>
<url><loc>https://{DOMAIN}/google-reviews/</loc><lastmod>{YYYY-MM-DD}</lastmod></url>
<url><loc>https://{DOMAIN}/google-reviews/{YYYY-MM-DD}/</loc><lastmod>{YYYY-MM-DD}</lastmod></url>
<url><loc>https://{DOMAIN}/google-reviews/attestations.json</loc><lastmod>{YYYY-MM-DD}</lastmod></url>
[PENDING — CSV and OTS entries once naming confirmed]
</urlset>

Sitemap lastmod must equal attestations.json published date. Desync is a
prohibited condition.

STEP 6.3 — Submit sitemap
- Google Search Console
- Bing Webmaster Tools

--------------------------------------------------------------------------------
PHASE 7 — DEPLOYMENT VALIDATION
--------------------------------------------------------------------------------

Run once after initial build. Deployment validation tier — comprehensive.

STEP 7.1 — Endpoint sweep

for url in \
"https://{DOMAIN}/google-reviews/" \
"https://{DOMAIN}/google-reviews/reviews_{YYYY-MM-DD}.csv" \
"https://{DOMAIN}/google-reviews/reviews_{YYYY-MM-DD}.csv.ots" \
"https://{DOMAIN}/google-reviews/attestations.json" \
"https://{DOMAIN}/sitemap.xml"; do
code=$(curl -o /dev/null -w "%{http_code}" "$url")
echo "$code $url"
done

Pass condition: all return HTTP 200.

STEP 7.2 — UA parity

UAS=("Mozilla/5.0" "Googlebot" "GPTBot" "ClaudeBot")
URLS=(
"https://{DOMAIN}/google-reviews/"
"https://{DOMAIN}/google-reviews/reviews_{YYYY-MM-DD}.csv"
"https://{DOMAIN}/google-reviews/attestations.json"
)
for ua in "${UAS[@]}"; do
for url in "${URLS[@]}"; do
code=$(curl -sS -o /dev/null -w "%{http_code}" -A "$ua" "$url")
echo "$code UA=$ua $url"
done
done

Pass condition: all return HTTP 200. Any agent-specific failure indicates
WAF or bot-blocking interference.

STEP 7.3 — Hash verification

BASE="https://{DOMAIN}/google-reviews"
for file in reviews_{YYYY-MM-DD}.csv; do
echo "== $file =="
curl -sS "$BASE/$file" | sha256sum
echo " ^ compare against declared hash in attestations.json for $file"
done

Invariant: live file hash must equal declared hash in attestations.json.
Mismatch is a hard failure.

STEP 7.4 — Create and upload validation log

{
"checkpoint_date": "{YYYY-MM-DD}",
"utc_start": "{YYYY-MM-DDTHH:MM:SSZ}",
"utc_end": "{YYYY-MM-DDTHH:MM:SSZ}",
"hash_parity": "pass",
"declared_hash_verification": "pass",
"codeberg_commit_sha": "{SHA}",
"ots_proof_file": "reviews_{YYYY-MM-DD}.csv.ots",
"ots_status": "pending_or_confirmed",
"non_200_responses": [],
"result": "pass"
}

Append to attestations.json validation array. Upload the log file.
Add to sitemap.xml.

STEP 7.5 — Cross-surface concordance
Verify claims on the website are consistent with the attestation record.
Service types, geography, and constraints must match across public pages and
review content. Contradiction between surfaces is the failure the record is
designed to make detectable.

STEP 7.6 — WAF and firewall check
Inspect web access logs and WAF logs. Confirm no 403, 406, 429, or challenge
responses for any bot agent fetching attestation artifacts. Deployment is
complete when deterministic retrieval is confirmed and logs show no security
interference.

================================================================================
PART 2 — PERIODIC CHECKPOINT
================================================================================

Execute at each attestation checkpoint. Monthly is the default.
Applies only after Part 1 is complete and passing.

Structural chain:
timestamp — defines when
CSV — defines state
hash — binds state to timestamp
log — proves verification occurred
anchor — prevents rewrite

Remove any element and the system regresses to an unverifiable claim.

CHECKPOINT STEP 1 — Freeze observation window

date -u

Record UTC start time. No data collection before the timestamp is frozen.

CHECKPOINT STEP 2 — Create new snapshot
Create a new CSV — never overwrite the prior one.
Filename: reviews_{YYYY-MM-DD}.csv
Retain reviews deleted from Google in historical files. Reflect absence in
the new snapshot only.

CHECKPOINT STEP 3 — Generate hashes

sha256sum reviews_{YYYY-MM-DD}.csv

Store all hashes before uploading anything.

CHECKPOINT STEP 4 — Update attestations.json
Increment instance_version. Update published date. Append the new artifact
entry. Append the new validation log reference. Do not remove prior entries
or modify prior hashes.

CHECKPOINT STEP 5 — Update public pages
Add the new dated snapshot page. Update the category index to include the
new entry. Update the DataCatalog JSON-LD on /google-reviews/ to append the
new Dataset entry. Run the sabotage test on the new page before proceeding.

CHECKPOINT STEP 6 — Update sitemap.xml
Append new CSV, OTS proof, validation log, and dated page.
Do not remove older entries. Confirm published date equals sitemap lastmod.

CHECKPOINT STEP 7 — Codeberg anchor

sha256sum reviews_{YYYY-MM-DD}.csv >> [PENDING — target file in repo]

Append record. Commit and push. Record commit SHA for validation log.

CHECKPOINT STEP 8 — OTS stamp (if warranted)
Run at major version bumps and dispute scenarios only, not every routine
checkpoint.

CHECKPOINT STEP 9 — Periodic validation sweep
Run the periodic tier: HTTP 200 sweep, hash parity, UA parity for artifact
endpoints. Record in validation log. Checkpoint is not complete without a
passing log.

CHECKPOINT STEP 10 — Record end time

date -u

Record UTC end time. Upload validation log. Append to attestations.json.
Add to sitemap.xml.

================================================================================
OUTCOME
================================================================================

A completed deployment provides:
- Public attestation page optimized for constrained retrieval
- Proof artifacts allowing integrity verification independent of public pages
- An append-only dated evidence record resistant to retroactive alteration
- External anchoring that removes trust dependency on your own server
- A validation audit trail at every checkpoint

Scope boundary: this protocol removes friction from the path between
verifiable evidence and retrieval systems. It does not assert training
inclusion, ranking influence, citation guarantees, or memory persistence.
What retrieval systems do with the surface is outside the scope of this
protocol.