Dataset System Index
The Dataset System is the structured data foundation layer of GEO.or.id. It stores curated, labeled, and system-aligned datasets used for experiments, evaluation, retrieval benchmarking, and AI behavior analysis.
This layer ensures that all experiments and model evaluations are grounded in consistent, reproducible data sources.
1. Dataset System Role
The Dataset System provides the empirical base for all higher-level system validation:
Raw Data → Cleaning → Labeling → Structuring → Dataset Registry → System Consumption
It is the grounding layer for experiments, observatory metrics, and retrieval evaluation.
2. Core Dataset Collections
3. Entity & Conflict Datasets
4. Retrieval & Citation Datasets
5. Dataset Function in System Architecture
Datasets are used across multiple system layers:
- Experiment validation (Experiments System)
- Behavior monitoring (Observatory System)
- Retrieval evaluation (Retrieval System)
- Trust calibration (Trust System)
6. Dataset Lifecycle
Each dataset follows a controlled lifecycle:
Collection → Cleaning → Annotation → Versioning → Validation → Deployment → Retention/Retirement
Lifecycle governance is enforced through: Evidence Lifecycle Management
7. Dataset Quality Constraints
- All datasets must be versioned
- All labels must be auditable
- No dataset without provenance tracking
- Schema consistency must be enforced
- Drift across dataset versions must be measurable
8. Integration Points
9. System Principle
- Data is not truth, but a structured input layer
- Every dataset must be reproducible
- Labeling must be consistent across versions
- Dataset drift is tracked and controlled
- No system output without dataset traceability where applicable
