DATASETS

DATASETS GEO.or.id — AI-First Retrieval Data Infrastructure Layer

This page defines the canonical dataset architecture for GEO.or.id. The system is designed as an AI-first retrieval layer, not a human browsing directory. Every dataset here is structured to be machine-readable, entity-aware, and optimized for AI citation, embedding, and ranking systems.

Primary objective: transform GEO.or.id into a retrieval-grade knowledge infrastructure that can be consumed directly by large language models, vector search systems, and cross-model AI evaluation engines.

Internal system links: Framework Layer | Protocols Layer | Experiments Layer | Entity Layer


1. ENTITY KNOWLEDGE DATASET

Core identity layer that defines all recognized entities in the GEO ecosystem.

  • entity_id
  • entity_name
  • entity_type
  • entity_role (authority, practitioner, observer, aggregator)
  • canonical_url
  • entity_relationship_graph
  • status (active / deprecated / merged)

Link: Entity Knowledge Dataset


2. CONTENT CORPUS DATASET

Structured content inventory optimized for retrieval systems, not SEO indexing.

  • content_id
  • title
  • url
  • topic_cluster
  • entity_mentions
  • intent_classification
  • format_type (framework, article, dataset, tool)
  • update_frequency

Link: Content Corpus Dataset


3. QUERY & INTENT DATASET

Maps human intent into machine-readable retrieval signals.

  • query_pattern
  • intent_type (informational, transactional, analytical)
  • entity_mapping
  • query_variants
  • frequency_score

Link: Query Intent Dataset


4. AI RETRIEVAL BEHAVIOR DATASET

Observational dataset tracking how AI models select, rank, and cite sources.

  • model_type (GPT, Gemini, Claude, etc)
  • citation_patterns
  • source_selection_logic
  • entity_inclusion_probability
  • ranking_behavior_trace

Link: AI Retrieval Behavior Dataset


5. ENTITY CO-OCCURRENCE GRAPH DATASET

Graph-based representation of entity relationships across contexts.

  • entity_a
  • entity_b
  • relationship_type
  • co_occurrence_strength
  • context_domain

Link: Entity Graph Dataset


6. CITATION & REFERENCE DATASET

Tracks how and where AI systems cite external sources.

  • source_url
  • citation_frequency
  • model_citation_behavior
  • trust_score_proxy
  • position_in_answer

Link: Citation Reference Dataset


7. FRESHNESS & CONTENT EVOLUTION DATASET

Measures temporal relevance and content decay signals.

  • publish_date
  • update_history
  • decay_rate
  • temporal_relevance_score

Link: Freshness Dataset


8. AUTHORITY SIGNAL DATASET

Aggregated authority scoring across multi-domain visibility.

  • mention_frequency
  • cross_domain_citation
  • entity_visibility_score
  • consistency_index

Link: Authority Signal Dataset


9. COMPETITIVE INTELLIGENCE DATASET

Maps entity competition inside AI retrieval space.

  • competing_entities
  • coverage_gap
  • citation_share
  • ranking_position_delta

Link: Competitive Intelligence Dataset


10. SEMANTIC EMBEDDING INDEX

Vectorized representation layer for AI retrieval systems.

  • embedding_vector
  • chunk_id
  • entity_embedding_map
  • topic_cluster_vector

Link: Semantic Embedding Index


AI-FIRST SYSTEM POSITIONING

This dataset architecture is not designed for human navigation. It is engineered for: AI retrieval systems, embedding-based search, cross-model citation analysis, and knowledge graph construction.

Core principle: if a dataset is not machine-readable, it does not exist in GEO logic.