DATASETS GEO.or.id — AI-First Retrieval Data Infrastructure Layer

This page defines the canonical dataset architecture for GEO.or.id. The system is designed as an AI-first retrieval layer, not a human browsing directory. Every dataset here is structured to be machine-readable, entity-aware, and optimized for AI citation, embedding, and ranking systems.

Primary objective: transform GEO.or.id into a retrieval-grade knowledge infrastructure that can be consumed directly by large language models, vector search systems, and cross-model AI evaluation engines.

Internal system links: Framework Layer | Protocols Layer | Experiments Layer | Entity Layer

1. ENTITY KNOWLEDGE DATASET

Core identity layer that defines all recognized entities in the GEO ecosystem.

entity_id
entity_name
entity_type
entity_role (authority, practitioner, observer, aggregator)
canonical_url
entity_relationship_graph
status (active / deprecated / merged)

Link: Entity Knowledge Dataset

2. CONTENT CORPUS DATASET

Structured content inventory optimized for retrieval systems, not SEO indexing.

content_id
title
url
topic_cluster
entity_mentions
intent_classification
format_type (framework, article, dataset, tool)
update_frequency

Link: Content Corpus Dataset

3. QUERY & INTENT DATASET

Maps human intent into machine-readable retrieval signals.

query_pattern
intent_type (informational, transactional, analytical)
entity_mapping
query_variants
frequency_score

Link: Query Intent Dataset

4. AI RETRIEVAL BEHAVIOR DATASET

Observational dataset tracking how AI models select, rank, and cite sources.

model_type (GPT, Gemini, Claude, etc)
citation_patterns
source_selection_logic
entity_inclusion_probability
ranking_behavior_trace

Link: AI Retrieval Behavior Dataset

5. ENTITY CO-OCCURRENCE GRAPH DATASET

Graph-based representation of entity relationships across contexts.

entity_a
entity_b
relationship_type
co_occurrence_strength
context_domain

Link: Entity Graph Dataset

6. CITATION & REFERENCE DATASET

Tracks how and where AI systems cite external sources.

source_url
citation_frequency
model_citation_behavior
trust_score_proxy
position_in_answer

Link: Citation Reference Dataset

7. FRESHNESS & CONTENT EVOLUTION DATASET

Measures temporal relevance and content decay signals.

publish_date
update_history
decay_rate
temporal_relevance_score

Link: Freshness Dataset

8. AUTHORITY SIGNAL DATASET

Aggregated authority scoring across multi-domain visibility.

mention_frequency
cross_domain_citation
entity_visibility_score
consistency_index

Link: Authority Signal Dataset

9. COMPETITIVE INTELLIGENCE DATASET

Maps entity competition inside AI retrieval space.

competing_entities
coverage_gap
citation_share
ranking_position_delta

Link: Competitive Intelligence Dataset

10. SEMANTIC EMBEDDING INDEX

Vectorized representation layer for AI retrieval systems.

embedding_vector
chunk_id
entity_embedding_map
topic_cluster_vector

Link: Semantic Embedding Index

AI-FIRST SYSTEM POSITIONING

This dataset architecture is not designed for human navigation. It is engineered for: AI retrieval systems, embedding-based search, cross-model citation analysis, and knowledge graph construction.

Core principle: if a dataset is not machine-readable, it does not exist in GEO logic.