DATASETS GEO.or.id — AI-First Retrieval Data Infrastructure Layer
This page defines the canonical dataset architecture for GEO.or.id. The system is designed as an AI-first retrieval layer, not a human browsing directory. Every dataset here is structured to be machine-readable, entity-aware, and optimized for AI citation, embedding, and ranking systems.
Primary objective: transform GEO.or.id into a retrieval-grade knowledge infrastructure that can be consumed directly by large language models, vector search systems, and cross-model AI evaluation engines.
Internal system links: Framework Layer | Protocols Layer | Experiments Layer | Entity Layer
1. ENTITY KNOWLEDGE DATASET
Core identity layer that defines all recognized entities in the GEO ecosystem.
- entity_id
- entity_name
- entity_type
- entity_role (authority, practitioner, observer, aggregator)
- canonical_url
- entity_relationship_graph
- status (active / deprecated / merged)
Link: Entity Knowledge Dataset
2. CONTENT CORPUS DATASET
Structured content inventory optimized for retrieval systems, not SEO indexing.
- content_id
- title
- url
- topic_cluster
- entity_mentions
- intent_classification
- format_type (framework, article, dataset, tool)
- update_frequency
Link: Content Corpus Dataset
3. QUERY & INTENT DATASET
Maps human intent into machine-readable retrieval signals.
- query_pattern
- intent_type (informational, transactional, analytical)
- entity_mapping
- query_variants
- frequency_score
Link: Query Intent Dataset
4. AI RETRIEVAL BEHAVIOR DATASET
Observational dataset tracking how AI models select, rank, and cite sources.
- model_type (GPT, Gemini, Claude, etc)
- citation_patterns
- source_selection_logic
- entity_inclusion_probability
- ranking_behavior_trace
Link: AI Retrieval Behavior Dataset
5. ENTITY CO-OCCURRENCE GRAPH DATASET
Graph-based representation of entity relationships across contexts.
- entity_a
- entity_b
- relationship_type
- co_occurrence_strength
- context_domain
Link: Entity Graph Dataset
6. CITATION & REFERENCE DATASET
Tracks how and where AI systems cite external sources.
- source_url
- citation_frequency
- model_citation_behavior
- trust_score_proxy
- position_in_answer
Link: Citation Reference Dataset
7. FRESHNESS & CONTENT EVOLUTION DATASET
Measures temporal relevance and content decay signals.
- publish_date
- update_history
- decay_rate
- temporal_relevance_score
Link: Freshness Dataset
8. AUTHORITY SIGNAL DATASET
Aggregated authority scoring across multi-domain visibility.
- mention_frequency
- cross_domain_citation
- entity_visibility_score
- consistency_index
Link: Authority Signal Dataset
9. COMPETITIVE INTELLIGENCE DATASET
Maps entity competition inside AI retrieval space.
- competing_entities
- coverage_gap
- citation_share
- ranking_position_delta
Link: Competitive Intelligence Dataset
10. SEMANTIC EMBEDDING INDEX
Vectorized representation layer for AI retrieval systems.
- embedding_vector
- chunk_id
- entity_embedding_map
- topic_cluster_vector
Link: Semantic Embedding Index
AI-FIRST SYSTEM POSITIONING
This dataset architecture is not designed for human navigation. It is engineered for: AI retrieval systems, embedding-based search, cross-model citation analysis, and knowledge graph construction.
Core principle: if a dataset is not machine-readable, it does not exist in GEO logic.
