DataFab System Architecture

Version: 5.0 Last Updated: February 2026

Platform Overview

DataFab is an AI-powered data intelligence platform built on a metadata-driven architecture. The platform provides unified access to distributed data assets, intelligent automation through AI agents, and comprehensive data governance capabilities.

Platform Components

Component	Purpose	Key Capabilities
Knowledge Fabric	Data integration and intelligence layer	Persistent Knowledge Graph, Entity Resolution, 200+ MCP Connectors, Search Sessions
Studio	Creation and execution environment	DDAs (Data-Driven Agents), Widgets, Datasets, Utilities, Chain of Agents, MCP Integrations, Operational Modes (0-4)
Exchange	Data asset marketplace	Asset Catalog, Wallet & Blockchain (DAAC), Metering & Billing, Access Control, API Gateway
Schema Management	Business domain definition and control	Domain Discovery, Schema Registry, Schema Validation
AI & LLM Layer	Intelligent processing and inference	LightLLM Gateway, Output Consistency, Model Provenance
Graph Operations	Data intelligence and analytics module	Rule Engine, Query Workflows, Analytics, Case Management
DevOps Infrastructure	Deployment and operations	CI/CD Pipelines, Monitoring, Security Operations

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                        PRESENTATION LAYER                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌────────────┐   │
│  │   Web UI     │  │  API Gateway │  │   Exchange   │  │  Messaging │   │
│  │              │  │  (REST)      │  │   Interface  │  │  Mini-App  │   │
│  └──────────────┘  └──────────────┘  └──────────────┘  └────────────┘   │
│                              │                                          │
│                   [Authentication, Rate Limiting, Input Validation]     │
├─────────────────────────────────────────────────────────────────────────┤
│                          APPLICATION LAYER                              │
│                                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                          EXCHANGE                               │    │
│  │  ┌───────────┐  ┌───────────────┐  ┌───────────┐  ┌──────────┐  │    │
│  │  │  Catalog  │  │   Wallet &    │  │ Metering  │  │  Ledger  │  │    │
│  │  │  Service  │  │  Blockchain   │  │ & Billing │  │ Service  │  │    │
│  │  └───────────┘  └───────────────┘  └───────────┘  └──────────┘  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                          STUDIO                                 │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌──────────────┐  │    │
│  │  │  DDA      │  │  Widgets  │  │ Datasets  │  │  Chain of    │  │    │
│  │  │  Builder  │  │ & Utilities│ │ & Media   │  │  DDAs        │  │    │
│  │  └───────────┘  └───────────┘  └───────────┘  └──────────────┘  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                    GRAPH OPERATIONS MODULE                      │    │
│  │  ┌───────────┐  ┌───────────────┐  ┌───────────┐  ┌──────────┐  │    │
│  │  │   Rule    │  │  Query        │  │ Analytics │  │  Case    │  │    │
│  │  │  Engine   │  │  Workflows    │  │  Service  │  │ Manager  │  │    │
│  │  └───────────┘  └───────────────┘  └───────────┘  └──────────┘  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ ┌────────────┐   │
│  │  Search  │ │ Lineage  │ │ Quality  │ │ Discovery  │ │ Governance │   │
│  │ Service  │ │ Service  │ │ Service  │ │  Service   │ │  Service   │   │
│  └──────────┘ └──────────┘ └──────────┘ └────────────┘ └────────────┘   │
│                              │                                          │
│                   [Service-to-Service AuthN/AuthZ, mTLS]                │
├─────────────────────────────────────────────────────────────────────────┤
│                      KNOWLEDGE FABRIC LAYER                             │
│                                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                    KNOWLEDGE GRAPH                              │    │
│  │  ┌──────────────┐  ┌──────────────────┐  ┌──────────────────┐   │    │
│  │  │ Entity Store │  │ Relationship     │  │  Query Engine    │   │    │
│  │  │   (Nodes)    │  │ Store (Edges)    │  │  (Traversal)     │   │    │
│  │  └──────────────┘  └──────────────────┘  └──────────────────┘   │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                  ENTITY RESOLUTION ENGINE                       │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌─────────────┐   │    │
│  │  │ Blocking  │  │ Matching  │  │ Clustering│  │ Golden      │   │    │
│  │  │ Service   │  │ Service   │  │ Service   │  │ Record Mgmt │   │    │
│  │  └───────────┘  └───────────┘  └───────────┘  └─────────────┘   │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                              │                                          │
│                   [Encryption at Rest, Access Control Lists]            │
├─────────────────────────────────────────────────────────────────────────┤
│                        AI & LLM LAYER                                   │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │  ┌────────────┐  ┌──────────────┐  ┌────────────────────────┐   │    │
│  │  │  LightLLM  │  │   Provider   │  │  Prompt Management     │   │    │
│  │  │  Gateway   │  │  Abstraction │  │  & Template Engine     │   │    │
│  │  └────────────┘  └──────────────┘  └────────────────────────┘   │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                              │                                          │
│                   [Input Filtering, Output Validation, PII Detection]   │
├─────────────────────────────────────────────────────────────────────────┤
│                       CONNECTIVITY LAYER                                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌────────────┐   │
│  │  Database    │  │    API       │  │     MCP      │  │   Event    │   │
│  │  Connectors  │  │  Connectors  │  │  Connectors  │  │  Streams   │   │
│  └──────────────┘  └──────────────┘  └──────────────┘  └────────────┘   │
│                              │                                          │
│               [Credential Vault, Secure Connections, Data Sampling]     │
├─────────────────────────────────────────────────────────────────────────┤
│                       SOURCE SYSTEMS                                    │
│  ┌──────────┐ ┌───────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐  │
│  │Databases │ │  Document │ │  APIs    │ │  Data    │ │   External   │  │
│  │          │ │  Stores   │ │          │ │  Sources │ │   Sources    │  │
│  └──────────┘ └───────────┘ └──────────┘ └──────────┘ └──────────────┘  │
│                              │                                          │
│                [Customer-Managed, Customer Credentials]                 │
└─────────────────────────────────────────────────────────────────────────┘

Knowledge Fabric

The Knowledge Fabric serves as the foundational data integration and intelligence layer for the DataFab platform. It implements a metadata-driven architecture that provides unified access to distributed data assets while leaving source data in place.

Core Capabilities

Capability	Description
Persistent Knowledge Graph	Corporate memory with schema-bounded extraction, source provenance, and Knowledge Tree structure
Entity Resolution	Cross-source entity matching using blocking, matching, and clustering algorithms with golden record management
200+ MCP Connectors	Federated queries across databases, SaaS applications, and data systems
Search Sessions	Iterative exploration with session graphs and accumulated context
Data Observability	Quality monitoring, freshness tracking, and automated alerts
Discovery Service	Automated identification and cataloging of data assets
Active Metadata	Continuous metadata analysis, profiling, and enrichment
Data Lineage	End-to-end tracking of data flow from source to consumption

Knowledge Graph Model

The Knowledge Graph stores entities as nodes and relationships as edges, enabling complex traversal queries and network analysis.

Entity Types:

Entity	Description	Use Cases
Person	Individual entities and contacts	Identity, risk assessment, mapping
Organization	Corporate entities and relationships	Corporate structure, analysis
Asset	Data items, resources, applications	Asset management, tracking
Document	Reports, data files, records	Document management, search
Address	Physical and registered addresses	Location analysis, verification
Identifier	Tax IDs, registration numbers, reference IDs	Cross-reference, verification

Relationship Types:

Relationship	Description
OWNS	Ownership stake between entities
CONTROLS	Control relationship (voting, management)
RELATED_TO	Business or data relationship
EMPLOYS	Employment relationship
REFERENCES	Data or document reference
LOCATED_AT	Entity-address relationship

Entity Resolution

The Entity Resolution Engine identifies duplicate and related records across connected sources to build unified entity profiles (Golden Records).

Resolution Pipeline:

Source Records → Blocking → Candidate Pairs → Matching → Clusters → Golden Records
                    │             │              │           │
              (Key generation) (Comparison)  (Scoring)  (Merge rules)

Studio

The Studio (Helix Studio) provides the creation and execution environment for Data-Driven Agents (DDAs), widgets, datasets, utilities, and multi-agent workflows. The platform supports five operational modes enabling organizations to balance automation with human oversight.

Core Capabilities

Capability	Description
DDA Creation	Domain-driven flow with schema selection, natural language definition, query plan, and testing
Widgets	Visual interface components (SYSTEM and OUTPUT types) with dialog/canvas views
Datasets	Structured data collections with file uploads, schema enforcement, and draft/publish lifecycle
Utilities	Reusable components combining external APIs and DDAs with configurable placeholders
Business Domain Discovery	Extract schemas from uploaded documents to define entity structures
Chain of Agents	Orchestrate multiple DDAs with human-in-the-loop review gates
Graph of Agents	(Planned) Non-linear directed graph orchestration with branching, parallelism, and cycles
MCP Integrations	Managed MCP tool connections with types, instances, and credentials
Asset Search	Semantic search across all Studio asset types
AI Hybrid Planning	Automatic DDA creation from natural language descriptions
Operational Modes	Five modes from traditional platform (Mode 0) to fully automated with audit (Mode 4)
Text-to-Pipeline	Natural language pipeline generation with DSL output and iterative refinement

DDA Architecture

Data-Driven Agents (DDAs) are the fundamental execution unit in Studio. Each DDA has:

Component	Description
Definition	Name, description, instructions, model, prompt
Query Plan	Ordered items referencing datasets, MCPs, DDAs, and scripts
Placeholders	Configurable component slots (MCP, dataset, DDA) for runtime mapping
Runtime Config	Per-user configuration with placeholder mappings
Lifecycle	Draft/publish stages with ACTIVE/INACTIVE states
Execution	Execute with file inputs; execution history with SUCCESS/ERROR status

Chain of Agents / Graph of Agents

Chains enable complex workflows by orchestrating multiple DDAs with optional human review. The planned Graph of Agents capability will extend this to arbitrary directed graph topologies with conditional branching, parallel fan-out/fan-in, and cycles.

Item Type	Description	Use Case
DDA	Execute a Data-Driven Agent step	Data processing, analysis
HUMAN_IN_THE_LOOP	Human review/approval gate	Quality control, compliance

Operational Modes

The platform supports five operational modes enabling organizations to configure automation levels:

Mode	Name	Description
0	Traditional Platform	No AI agents; manual investigation and analysis
1	AI-Assisted Manual	Agents in suggest-only mode; all decisions require human approval
2	Routine Automation	Agents handle routine tasks; humans focus on analysis and decisions
3	Autonomous with Escalation	Full automation with escalation on exceptions
4	Fully Automated with Audit	End-to-end automation with post-investigation human audits

Exchange

The Exchange component serves as the platform’s data asset marketplace, enabling organizations to publish, discover, acquire, and monetize data assets with blockchain-backed transactions and transparent revenue sharing.

Core Capabilities

Capability	Description
Asset Catalog	Publish, search, and manage data assets (agents, widgets, datasets, models, utilities, media, chains)
User Profiles	Consumer, provider, and dual-role profiles with verification
Wallet & Blockchain	DAAC token on Ethereum for purchases, deposits, withdrawals, and transfers
Metering & Billing	Usage tracking with configurable policies and automated billing
Access Control	Resource-level permission policies (READ, WRITE, DELETE, ADMIN)
API Gateway	Managed endpoints (REST, GraphQL, Webhook, Proxy) with request logging
Ledger & Revenue	Double-entry accounting with revenue allocation, settlement, and reconciliation

Asset Types

Asset Type	Description
BEHAVIOUR_DATA	Behavioral analytics and pattern datasets
AGENT	Executable AI agents built in Studio
WIDGET	Visual interface components
UTILITY	Reusable processing functions and tools
MEDIA	Media files and content assets
MODEL	Machine learning models
DATASET	Structured data collections
CHAIN	Multi-agent workflow chains

Marketplace Model

Component	Description
Catalog Service	Asset registration, search, publish lifecycle (DRAFT → PUBLISHED → ARCHIVED)
Wallet Service	DAAC/ETH digital wallets with deposit, withdrawal, and transfer operations
Metering Service	Usage event capture, billing generation, pricing plan enforcement
Access Service	Per-asset permission policies with least-privilege enforcement
Gateway Service	Managed API endpoints with rate limiting and request logging
Ledger Service	Double-entry financial records with settlement and reconciliation

Schema Management

The Schema Management system provides a unified approach to defining, discovering, and managing business domain schemas across the platform.

Core Capabilities

Capability	Description
Business Domain Discovery	Extract domain concepts from user documents
Schema Registry	Centralized storage with versioning and access control
Schema Validation	Ensure data conformance across all platform components
Schema Binding	Link schemas to agents, extractors, and MCP connectors

Schema Usage Across Platform

Component	Schema Role
Studio Agents	Input/output validation, data transformation control
Data Extractors	Structure external data according to domain model
MCP Connectors	Ensure data consistency across tool integrations
Knowledge Graph	Entity and relationship type definitions
Pipelines	Data flow validation between processing steps

Document-to-Schema Discovery

Stage	Description
Document Analysis	Extract text, structure, and metadata from user documents
Concept Extraction	Identify entities, attributes, and relationships
Schema Generation	Create structured schema definitions
User Refinement	Interactive review and modification
Publication	Register schema in central registry

AI & LLM Layer

The AI & LLM Layer provides intelligent processing capabilities through a provider-agnostic gateway with comprehensive output consistency and quality controls.

Core Capabilities

Capability	Description
LightLLM Gateway	Unified interface for multiple LLM providers
Output Consistency	Schema-validated extraction and ontology-based execution
Model Provenance	Complete tracking of model versions and configurations
A/B Testing	Controlled model updates with performance comparison
Quality Assurance	Feedback loops, accuracy monitoring, reasoning chain transparency
Provider Abstraction	Swap providers without code changes
Prompt Management	Template library with version control

LLM Output Consistency

Control	Description
Schema-Validated Extraction	All outputs validated against user-defined schemas
Ontology-Based Execution	Responses grounded in domain ontology
Reasoning Chain Transparency	Full reasoning paths logged for audit
Deterministic Components	Separation of deterministic vs. probabilistic processing

Provider Support

Provider Type	Examples	Integration
Commercial APIs	OpenAI, Anthropic, Google	API key authentication
Self-Hosted	Llama, Mistral, custom models	Private endpoint
Enterprise	Azure OpenAI, AWS Bedrock	Cloud provider auth

Graph Operations Module

The Graph Operations module provides comprehensive capabilities for data intelligence, analytics, and case management.

Core Capabilities

Capability	Description
Rule Engine	Data scoring and decision rules
Query Workflows	Complex query orchestration and workflows
Analytics Service	Analytics and reporting capabilities
Case Management	Data case tracking and documentation
Reporting	Analytics, insights, and dashboards

Rule Engine

The Rule Engine calculates scores based on configurable factors:

Factor Category	Examples
Data Quality	Completeness, uniqueness, validity metrics
Relationship	Entity relationships, connection strength
Temporal	Data freshness, change patterns
Behavioral	Access patterns, usage anomalies

Network Architecture

Network Segmentation

Zone	Purpose	Components
Edge/DMZ	External entry point	Load balancers, WAF, API Gateway
Application	Service execution	Studio, Services, AI Layer
Data	Persistent storage	Knowledge Graph, Document Store
Management	Operations	Monitoring, Logging, Admin tools

Connectivity Patterns

Pattern	Description	Use Case
Direct TLS	Encrypted connection over internet	Cloud-hosted sources
VPN Tunnel	Site-to-site encrypted tunnel	On-premises sources
Private Link	Cloud provider private connectivity	Same-cloud sources
Agent-Based	Customer-deployed agent connects outbound	Air-gapped environments

Data Flow Model

DataFab operates on a metadata-first principle. Source data remains in place while metadata flows through the platform.

Data Type	Handling	Storage
Structural Metadata	Extracted and stored	Knowledge Graph
Statistical Profiles	Computed via aggregation	Knowledge Graph
Sample Data	Optional, user-controlled	Ephemeral cache
Query Results	Pass-through federation	Never persisted
Source Credentials	Encrypted storage	Secure Vault

Security Architecture

Defense-in-Depth Model

┌─────────────────────────────────────────────────────────────────────┐
│                    PERIMETER SECURITY                               │
│    DDoS Protection │ WAF │ Rate Limiting │ Geographic Filtering     │
├─────────────────────────────────────────────────────────────────────┤
│                    NETWORK SECURITY                                 │
│    VPC Isolation │ Network Segmentation │ Private Endpoints         │
├─────────────────────────────────────────────────────────────────────┤
│                    APPLICATION SECURITY                             │
│    Input Validation │ Output Encoding │ CSRF Protection │ CSP       │
├─────────────────────────────────────────────────────────────────────┤
│                    IDENTITY & ACCESS                                │
│    OAuth 2.0/OIDC │ RBAC │ ABAC │ MFA │ Session Management          │
├─────────────────────────────────────────────────────────────────────┤
│                    DATA SECURITY                                    │
│    Encryption │ Tokenization │ Data Masking │ Classification        │
├─────────────────────────────────────────────────────────────────────┤
│                    INFRASTRUCTURE SECURITY                          │
│    Hardened Images │ Patch Management │ Container Security          │
└─────────────────────────────────────────────────────────────────────┘

Cryptographic Standards

Purpose	Algorithm	Key Length
Data at Rest	AES-256-GCM	256-bit
Data in Transit	TLS 1.3	256-bit
Key Encryption	RSA-OAEP	2048-bit minimum
Digital Signatures	ECDSA	P-256
Hashing	SHA-256	N/A

Integration Points

External System Integration

System Type	Integration Method	Data Exchange
Data Systems	API / Database connector	Data models, metadata
CRM Systems	API / Webhook	Contacts, organizations, activities
Document Management	API / File system	Documents, metadata
Analytics Systems	API / Database	Reports, dashboards, metrics
External Data	API	Third-party data sources
Monitoring Providers	API	System health, alerts

MCP Protocol

The Model Context Protocol (MCP) provides a standardized interface for tool integration:

Component	Purpose
Tool Registry	Catalog of available tools and capabilities
Schema Definition	Input/output contracts for each tool
Authentication	Tool-specific credential management
Execution	Secure tool invocation with timeout handling

Deployment Models

DataFab is available in multiple deployment configurations to meet varying security, compliance, and operational requirements.

Deployment Options

Model	Description	Use Case
SaaS Multi-Tenant	Shared infrastructure, isolated data	Standard deployment, cost-effective
Dedicated Cloud Tenant	Dedicated infrastructure in DataFab cloud	Enterprise, enhanced isolation
Customer Cloud	Deploy in customer’s cloud tenant (AWS, Azure, GCP)	Data sovereignty, infrastructure control
On-Premises	Fully customer-managed within data centers	Maximum control, air-gapped environments
Hybrid	SaaS control plane, on-premises data plane	Balance of convenience and control

SaaS Multi-Tenant

Aspect	Details
Infrastructure	Shared compute, isolated data stores
Data Isolation	Logical tenant isolation with encryption
Compliance	SOC 2, GDPR compliant infrastructure
Updates	Automatic platform updates
Best For	Standard requirements, rapid deployment

Dedicated Cloud Tenant

Aspect	Details
Infrastructure	Dedicated resources within DataFab cloud
Data Isolation	Physical separation of compute and storage
Compliance	Enhanced compliance posture
Updates	Coordinated update windows
Best For	Enterprise customers requiring dedicated resources

Customer Cloud Deployment

Aspect	Details
Infrastructure	Customer’s own cloud tenant (AWS, Azure, GCP)
Data Control	Customer maintains full infrastructure control
Region Selection	Deploy in preferred cloud region
Management	DataFab manages application, customer manages infrastructure
Best For	Data sovereignty, existing cloud investments

On-Premises Deployment

Aspect	Details
Infrastructure	Customer data centers
Data Control	Complete data custody
Network	Air-gapped capability available
Management	Customer-managed with DataFab support
Best For	Strict regulatory requirements, maximum control

Data Residency Configuration

Region Option	Data Location	LLM Processing
UK-Only	UK data centers	UK-based providers or on-premises
EU-Only	EU data centers (Ireland common)	EU-based providers
US-Only	US data centers	US-based providers
Customer-Specified	Any supported region	Region-aligned providers
On-Premises	Customer data centers	Self-hosted models

LLM Provider Configuration by Deployment

Deployment	LLM Options
SaaS	Cloud providers (OpenAI, Anthropic, Google)
Dedicated Tenant	Cloud providers with dedicated keys
Customer Cloud	Cloud providers or self-hosted in customer cloud
On-Premises	Self-hosted models (Llama, Mistral) for complete data control
Hybrid	Cloud for non-sensitive, on-premises for sensitive

Deployment Feature Comparison

Feature	SaaS	Dedicated	Customer Cloud	On-Premises
Knowledge Fabric	✓	✓	✓	✓
Studio (Agents, Widgets, Datasets)	✓	✓	✓	✓
Exchange Interface	✓	✓	✓	✓
Multi-Tenancy	✓	✓	✓	✓
LLM Integration	✓	✓	✓	✓
Custom Region	Limited	✓	✓	N/A
Self-Hosted LLMs	✗	✗	✓	✓

For detailed information see individual component documents: Knowledge Fabric, Studio, Exchange, AI-LLM, Graph Operations, and API Security.