DataFab System Architecture
Version: 5.0
Last Updated: February 2026
DataFab is an AI-powered data intelligence platform built on a metadata-driven architecture. The platform provides unified access to distributed data assets, intelligent automation through AI agents, and comprehensive data governance capabilities.
| Component |
Purpose |
Key Capabilities |
| Knowledge Fabric |
Data integration and intelligence layer |
Persistent Knowledge Graph, Entity Resolution, 200+ MCP Connectors, Search Sessions |
| Studio |
Creation and execution environment |
DDAs (Data-Driven Agents), Widgets, Datasets, Utilities, Chain of Agents, MCP Integrations, Operational Modes (0-4) |
| Exchange |
Data asset marketplace |
Asset Catalog, Wallet & Blockchain (DAAC), Metering & Billing, Access Control, API Gateway |
| Schema Management |
Business domain definition and control |
Domain Discovery, Schema Registry, Schema Validation |
| AI & LLM Layer |
Intelligent processing and inference |
LightLLM Gateway, Output Consistency, Model Provenance |
| Graph Operations |
Data intelligence and analytics module |
Rule Engine, Query Workflows, Analytics, Case Management |
| DevOps Infrastructure |
Deployment and operations |
CI/CD Pipelines, Monitoring, Security Operations |
Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ PRESENTATION LAYER │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ Web UI │ │ API Gateway │ │ Exchange │ │ Messaging │ │
│ │ │ │ (REST) │ │ Interface │ │ Mini-App │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └────────────┘ │
│ │ │
│ [Authentication, Rate Limiting, Input Validation] │
├─────────────────────────────────────────────────────────────────────┤
│ APPLICATION LAYER │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ EXCHANGE │ │
│ │ ┌───────────┐ ┌───────────────┐ ┌───────────┐ ┌──────────┐ │ │
│ │ │ Catalog │ │ Wallet & │ │ Metering │ │ Ledger │ │ │
│ │ │ Service │ │ Blockchain │ │ & Billing │ │ Service │ │ │
│ │ └───────────┘ └───────────────┘ └───────────┘ └──────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ STUDIO │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌──────────────┐ │ │
│ │ │ DDA │ │ Widgets │ │ Datasets │ │ Chain of │ │ │
│ │ │ Builder │ │ & Utilities│ │ & Media │ │ DDAs │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ GRAPH OPERATIONS MODULE │ │
│ │ ┌───────────┐ ┌───────────────┐ ┌───────────┐ ┌──────────┐ │ │
│ │ │ Rule │ │ Query │ │ Analytics │ │ Case │ │ │
│ │ │ Engine │ │ Workflows │ │ Service │ │ Manager │ │ │
│ │ └───────────┘ └───────────────┘ └───────────┘ └──────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Search │ │ Lineage │ │ Quality │ │ Discovery │ │ Governance │ │
│ │ Service │ │ Service │ │ Service │ │ Service │ │ Service │ │
│ └──────────┘ └──────────┘ └──────────┘ └────────────┘ └────────────┘ │
│ │ │
│ [Service-to-Service AuthN/AuthZ, mTLS] │
├─────────────────────────────────────────────────────────────────────┤
│ KNOWLEDGE FABRIC LAYER │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ KNOWLEDGE GRAPH │ │
│ │ ┌──────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ │
│ │ │ Entity Store │ │ Relationship │ │ Query Engine │ │ │
│ │ │ (Nodes) │ │ Store (Edges) │ │ (Traversal) │ │ │
│ │ └──────────────┘ └──────────────────┘ └──────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ ENTITY RESOLUTION ENGINE │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌─────────────┐ │ │
│ │ │ Blocking │ │ Matching │ │ Clustering│ │ Golden │ │ │
│ │ │ Service │ │ Service │ │ Service │ │ Record Mgmt │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ [Encryption at Rest, Access Control Lists] │
├─────────────────────────────────────────────────────────────────────┤
│ AI & LLM LAYER │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ ┌────────────┐ ┌──────────────┐ ┌────────────────────────┐ │ │
│ │ │ LightLLM │ │ Provider │ │ Prompt Management │ │ │
│ │ │ Gateway │ │ Abstraction │ │ & Template Engine │ │ │
│ │ └────────────┘ └──────────────┘ └────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ [Input Filtering, Output Validation, PII Detection] │
├─────────────────────────────────────────────────────────────────────┤
│ CONNECTIVITY LAYER │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ Database │ │ API │ │ MCP │ │ Event │ │
│ │ Connectors │ │ Connectors │ │ Connectors │ │ Streams │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └────────────┘ │
│ │ │
│ [Credential Vault, Secure Connections, Data Sampling] │
├─────────────────────────────────────────────────────────────────────┤
│ SOURCE SYSTEMS │
│ ┌──────────┐ ┌───────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │Databases │ │ Document │ │ APIs │ │ Data │ │ External │ │
│ │ │ │ Stores │ │ │ │ Sources │ │ Sources │ │
│ └──────────┘ └───────────┘ └──────────┘ └──────────┘ └──────────────┘ │
│ │ │
│ [Customer-Managed, Customer Credentials] │
└─────────────────────────────────────────────────────────────────────┘
Knowledge Fabric
The Knowledge Fabric serves as the foundational data integration and intelligence layer for the DataFab platform. It implements a metadata-driven architecture that provides unified access to distributed data assets while leaving source data in place.
Core Capabilities
| Capability |
Description |
| Persistent Knowledge Graph |
Corporate memory with schema-bounded extraction, source provenance, and Knowledge Tree structure |
| Entity Resolution |
Cross-source entity matching using blocking, matching, and clustering algorithms with golden record management |
| 200+ MCP Connectors |
Federated queries across databases, SaaS applications, and data systems |
| Search Sessions |
Iterative exploration with session graphs and accumulated context |
| Data Observability |
Quality monitoring, freshness tracking, and automated alerts |
| Discovery Service |
Automated identification and cataloging of data assets |
| Active Metadata |
Continuous metadata analysis, profiling, and enrichment |
| Data Lineage |
End-to-end tracking of data flow from source to consumption |
Knowledge Graph Model
The Knowledge Graph stores entities as nodes and relationships as edges, enabling complex traversal queries and network analysis.
Entity Types:
| Entity |
Description |
Use Cases |
| Person |
Individual entities and contacts |
Identity, risk assessment, mapping |
| Organization |
Corporate entities and relationships |
Corporate structure, analysis |
| Asset |
Data items, resources, applications |
Asset management, tracking |
| Document |
Reports, data files, records |
Document management, search |
| Address |
Physical and registered addresses |
Location analysis, verification |
| Identifier |
Tax IDs, registration numbers, reference IDs |
Cross-reference, verification |
Relationship Types:
| Relationship |
Description |
| OWNS |
Ownership stake between entities |
| CONTROLS |
Control relationship (voting, management) |
| RELATED_TO |
Business or data relationship |
| EMPLOYS |
Employment relationship |
| REFERENCES |
Data or document reference |
| LOCATED_AT |
Entity-address relationship |
Entity Resolution
The Entity Resolution Engine identifies duplicate and related records across connected sources to build unified entity profiles (Golden Records).
Resolution Pipeline:
Source Records → Blocking → Candidate Pairs → Matching → Clusters → Golden Records
│ │ │ │
(Key generation) (Comparison) (Scoring) (Merge rules)
Studio
The Studio (Helix Studio) provides the creation and execution environment for Data-Driven Agents (DDAs), widgets, datasets, utilities, and multi-agent workflows. The platform supports five operational modes enabling organizations to balance automation with human oversight.
Core Capabilities
| Capability |
Description |
| DDA Creation |
Domain-driven flow with schema selection, natural language definition, query plan, and testing |
| Widgets |
Visual interface components (SYSTEM and OUTPUT types) with dialog/canvas views |
| Datasets |
Structured data collections with file uploads, schema enforcement, and draft/publish lifecycle |
| Utilities |
Reusable components combining external APIs and DDAs with configurable placeholders |
| Business Domain Discovery |
Extract schemas from uploaded documents to define entity structures |
| Chain of Agents |
Orchestrate multiple DDAs with human-in-the-loop review gates |
| Graph of Agents |
(Planned) Non-linear directed graph orchestration with branching, parallelism, and cycles |
| MCP Integrations |
Managed MCP tool connections with types, instances, and credentials |
| Asset Search |
Semantic search across all Studio asset types |
| AI Hybrid Planning |
Automatic DDA creation from natural language descriptions |
| Operational Modes |
Five modes from traditional platform (Mode 0) to fully automated with audit (Mode 4) |
| Text-to-Pipeline |
Natural language pipeline generation with DSL output and iterative refinement |
DDA Architecture
Data-Driven Agents (DDAs) are the fundamental execution unit in Studio. Each DDA has:
| Component |
Description |
| Definition |
Name, description, instructions, model, prompt |
| Query Plan |
Ordered items referencing datasets, MCPs, DDAs, and scripts |
| Placeholders |
Configurable component slots (MCP, dataset, DDA) for runtime mapping |
| Runtime Config |
Per-user configuration with placeholder mappings |
| Lifecycle |
Draft/publish stages with ACTIVE/INACTIVE states |
| Execution |
Execute with file inputs; execution history with SUCCESS/ERROR status |
Chain of Agents / Graph of Agents
Chains enable complex workflows by orchestrating multiple DDAs with optional human review. The planned Graph of Agents capability will extend this to arbitrary directed graph topologies with conditional branching, parallel fan-out/fan-in, and cycles.
| Item Type |
Description |
Use Case |
| DDA |
Execute a Data-Driven Agent step |
Data processing, analysis |
| HUMAN_IN_THE_LOOP |
Human review/approval gate |
Quality control, compliance |
Operational Modes
The platform supports five operational modes enabling organizations to configure automation levels:
| Mode |
Name |
Description |
| 0 |
Traditional Platform |
No AI agents; manual investigation and analysis |
| 1 |
AI-Assisted Manual |
Agents in suggest-only mode; all decisions require human approval |
| 2 |
Routine Automation |
Agents handle routine tasks; humans focus on analysis and decisions |
| 3 |
Autonomous with Escalation |
Full automation with escalation on exceptions |
| 4 |
Fully Automated with Audit |
End-to-end automation with post-investigation human audits |
Exchange
The Exchange component serves as the platform’s data asset marketplace, enabling organizations to publish, discover, acquire, and monetize data assets with blockchain-backed transactions and transparent revenue sharing.
Core Capabilities
| Capability |
Description |
| Asset Catalog |
Publish, search, and manage data assets (agents, widgets, datasets, models, utilities, media, chains) |
| User Profiles |
Consumer, provider, and dual-role profiles with verification |
| Wallet & Blockchain |
DAAC token on Ethereum for purchases, deposits, withdrawals, and transfers |
| Metering & Billing |
Usage tracking with configurable policies and automated billing |
| Access Control |
Resource-level permission policies (READ, WRITE, DELETE, ADMIN) |
| API Gateway |
Managed endpoints (REST, GraphQL, Webhook, Proxy) with request logging |
| Ledger & Revenue |
Double-entry accounting with revenue allocation, settlement, and reconciliation |
Asset Types
| Asset Type |
Description |
| BEHAVIOUR_DATA |
Behavioral analytics and pattern datasets |
| AGENT |
Executable AI agents built in Studio |
| WIDGET |
Visual interface components |
| UTILITY |
Reusable processing functions and tools |
| MEDIA |
Media files and content assets |
| MODEL |
Machine learning models |
| DATASET |
Structured data collections |
| CHAIN |
Multi-agent workflow chains |
Marketplace Model
| Component |
Description |
| Catalog Service |
Asset registration, search, publish lifecycle (DRAFT → PUBLISHED → ARCHIVED) |
| Wallet Service |
DAAC/ETH digital wallets with deposit, withdrawal, and transfer operations |
| Metering Service |
Usage event capture, billing generation, pricing plan enforcement |
| Access Service |
Per-asset permission policies with least-privilege enforcement |
| Gateway Service |
Managed API endpoints with rate limiting and request logging |
| Ledger Service |
Double-entry financial records with settlement and reconciliation |
Schema Management
The Schema Management system provides a unified approach to defining, discovering, and managing business domain schemas across the platform.
Core Capabilities
| Capability |
Description |
| Business Domain Discovery |
Extract domain concepts from user documents |
| Schema Registry |
Centralized storage with versioning and access control |
| Schema Validation |
Ensure data conformance across all platform components |
| Schema Binding |
Link schemas to agents, extractors, and MCP connectors |
| Component |
Schema Role |
| Studio Agents |
Input/output validation, data transformation control |
| Data Extractors |
Structure external data according to domain model |
| MCP Connectors |
Ensure data consistency across tool integrations |
| Knowledge Graph |
Entity and relationship type definitions |
| Pipelines |
Data flow validation between processing steps |
Document-to-Schema Discovery
| Stage |
Description |
| Document Analysis |
Extract text, structure, and metadata from user documents |
| Concept Extraction |
Identify entities, attributes, and relationships |
| Schema Generation |
Create structured schema definitions |
| User Refinement |
Interactive review and modification |
| Publication |
Register schema in central registry |
AI & LLM Layer
The AI & LLM Layer provides intelligent processing capabilities through a provider-agnostic gateway with comprehensive output consistency and quality controls.
Core Capabilities
| Capability |
Description |
| LightLLM Gateway |
Unified interface for multiple LLM providers |
| Output Consistency |
Schema-validated extraction and ontology-based execution |
| Model Provenance |
Complete tracking of model versions and configurations |
| A/B Testing |
Controlled model updates with performance comparison |
| Quality Assurance |
Feedback loops, accuracy monitoring, reasoning chain transparency |
| Provider Abstraction |
Swap providers without code changes |
| Prompt Management |
Template library with version control |
LLM Output Consistency
| Control |
Description |
| Schema-Validated Extraction |
All outputs validated against user-defined schemas |
| Ontology-Based Execution |
Responses grounded in domain ontology |
| Reasoning Chain Transparency |
Full reasoning paths logged for audit |
| Deterministic Components |
Separation of deterministic vs. probabilistic processing |
Provider Support
| Provider Type |
Examples |
Integration |
| Commercial APIs |
OpenAI, Anthropic, Google |
API key authentication |
| Self-Hosted |
Llama, Mistral, custom models |
Private endpoint |
| Enterprise |
Azure OpenAI, AWS Bedrock |
Cloud provider auth |
Graph Operations Module
The Graph Operations module provides comprehensive capabilities for data intelligence, analytics, and case management.
Core Capabilities
| Capability |
Description |
| Rule Engine |
Data scoring and decision rules |
| Query Workflows |
Complex query orchestration and workflows |
| Analytics Service |
Analytics and reporting capabilities |
| Case Management |
Data case tracking and documentation |
| Reporting |
Analytics, insights, and dashboards |
Rule Engine
The Rule Engine calculates scores based on configurable factors:
| Factor Category |
Examples |
| Data Quality |
Completeness, uniqueness, validity metrics |
| Relationship |
Entity relationships, connection strength |
| Temporal |
Data freshness, change patterns |
| Behavioral |
Access patterns, usage anomalies |
Network Architecture
Network Segmentation
| Zone |
Purpose |
Components |
| Edge/DMZ |
External entry point |
Load balancers, WAF, API Gateway |
| Application |
Service execution |
Studio, Services, AI Layer |
| Data |
Persistent storage |
Knowledge Graph, Document Store |
| Management |
Operations |
Monitoring, Logging, Admin tools |
Connectivity Patterns
| Pattern |
Description |
Use Case |
| Direct TLS |
Encrypted connection over internet |
Cloud-hosted sources |
| VPN Tunnel |
Site-to-site encrypted tunnel |
On-premises sources |
| Private Link |
Cloud provider private connectivity |
Same-cloud sources |
| Agent-Based |
Customer-deployed agent connects outbound |
Air-gapped environments |
Data Flow Model
DataFab operates on a metadata-first principle. Source data remains in place while metadata flows through the platform.
| Data Type |
Handling |
Storage |
| Structural Metadata |
Extracted and stored |
Knowledge Graph |
| Statistical Profiles |
Computed via aggregation |
Knowledge Graph |
| Sample Data |
Optional, user-controlled |
Ephemeral cache |
| Query Results |
Pass-through federation |
Never persisted |
| Source Credentials |
Encrypted storage |
Secure Vault |
Security Architecture
Defense-in-Depth Model
┌─────────────────────────────────────────────────────────────────────┐
│ PERIMETER SECURITY │
│ DDoS Protection │ WAF │ Rate Limiting │ Geographic Filtering │
├─────────────────────────────────────────────────────────────────────┤
│ NETWORK SECURITY │
│ VPC Isolation │ Network Segmentation │ Private Endpoints │
├─────────────────────────────────────────────────────────────────────┤
│ APPLICATION SECURITY │
│ Input Validation │ Output Encoding │ CSRF Protection │ CSP │
├─────────────────────────────────────────────────────────────────────┤
│ IDENTITY & ACCESS │
│ OAuth 2.0/OIDC │ RBAC │ ABAC │ MFA │ Session Management │
├─────────────────────────────────────────────────────────────────────┤
│ DATA SECURITY │
│ Encryption │ Tokenization │ Data Masking │ Classification │
├─────────────────────────────────────────────────────────────────────┤
│ INFRASTRUCTURE SECURITY │
│ Hardened Images │ Patch Management │ Container Security │
└─────────────────────────────────────────────────────────────────────┘
Cryptographic Standards
| Purpose |
Algorithm |
Key Length |
| Data at Rest |
AES-256-GCM |
256-bit |
| Data in Transit |
TLS 1.3 |
256-bit |
| Key Encryption |
RSA-OAEP |
2048-bit minimum |
| Digital Signatures |
ECDSA |
P-256 |
| Hashing |
SHA-256 |
N/A |
Integration Points
External System Integration
| System Type |
Integration Method |
Data Exchange |
| Data Systems |
API / Database connector |
Data models, metadata |
| CRM Systems |
API / Webhook |
Contacts, organizations, activities |
| Document Management |
API / File system |
Documents, metadata |
| Analytics Systems |
API / Database |
Reports, dashboards, metrics |
| External Data |
API |
Third-party data sources |
| Monitoring Providers |
API |
System health, alerts |
MCP Protocol
The Model Context Protocol (MCP) provides a standardized interface for tool integration:
| Component |
Purpose |
| Tool Registry |
Catalog of available tools and capabilities |
| Schema Definition |
Input/output contracts for each tool |
| Authentication |
Tool-specific credential management |
| Execution |
Secure tool invocation with timeout handling |
Deployment Models
DataFab is available in multiple deployment configurations to meet varying security, compliance, and operational requirements.
Deployment Options
| Model |
Description |
Use Case |
| SaaS Multi-Tenant |
Shared infrastructure, isolated data |
Standard deployment, cost-effective |
| Dedicated Cloud Tenant |
Dedicated infrastructure in DataFab cloud |
Enterprise, enhanced isolation |
| Customer Cloud |
Deploy in customer’s cloud tenant (AWS, Azure, GCP) |
Data sovereignty, infrastructure control |
| On-Premises |
Fully customer-managed within data centers |
Maximum control, air-gapped environments |
| Hybrid |
SaaS control plane, on-premises data plane |
Balance of convenience and control |
SaaS Multi-Tenant
| Aspect |
Details |
| Infrastructure |
Shared compute, isolated data stores |
| Data Isolation |
Logical tenant isolation with encryption |
| Compliance |
SOC 2, GDPR compliant infrastructure |
| Updates |
Automatic platform updates |
| Best For |
Standard requirements, rapid deployment |
Dedicated Cloud Tenant
| Aspect |
Details |
| Infrastructure |
Dedicated resources within DataFab cloud |
| Data Isolation |
Physical separation of compute and storage |
| Compliance |
Enhanced compliance posture |
| Updates |
Coordinated update windows |
| Best For |
Enterprise customers requiring dedicated resources |
Customer Cloud Deployment
| Aspect |
Details |
| Infrastructure |
Customer’s own cloud tenant (AWS, Azure, GCP) |
| Data Control |
Customer maintains full infrastructure control |
| Region Selection |
Deploy in preferred cloud region |
| Management |
DataFab manages application, customer manages infrastructure |
| Best For |
Data sovereignty, existing cloud investments |
On-Premises Deployment
| Aspect |
Details |
| Infrastructure |
Customer data centers |
| Data Control |
Complete data custody |
| Network |
Air-gapped capability available |
| Management |
Customer-managed with DataFab support |
| Best For |
Strict regulatory requirements, maximum control |
Data Residency Configuration
| Region Option |
Data Location |
LLM Processing |
| UK-Only |
UK data centers |
UK-based providers or on-premises |
| EU-Only |
EU data centers (Ireland common) |
EU-based providers |
| US-Only |
US data centers |
US-based providers |
| Customer-Specified |
Any supported region |
Region-aligned providers |
| On-Premises |
Customer data centers |
Self-hosted models |
LLM Provider Configuration by Deployment
| Deployment |
LLM Options |
| SaaS |
Cloud providers (OpenAI, Anthropic, Google) |
| Dedicated Tenant |
Cloud providers with dedicated keys |
| Customer Cloud |
Cloud providers or self-hosted in customer cloud |
| On-Premises |
Self-hosted models (Llama, Mistral) for complete data control |
| Hybrid |
Cloud for non-sensitive, on-premises for sensitive |
Deployment Feature Comparison
| Feature |
SaaS |
Dedicated |
Customer Cloud |
On-Premises |
| Knowledge Fabric |
✓ |
✓ |
✓ |
✓ |
| Studio (Agents, Widgets, Datasets) |
✓ |
✓ |
✓ |
✓ |
| Exchange Interface |
✓ |
✓ |
✓ |
✓ |
| Multi-Tenancy |
✓ |
✓ |
✓ |
✓ |
| LLM Integration |
✓ |
✓ |
✓ |
✓ |
| Custom Region |
Limited |
✓ |
✓ |
N/A |
| Self-Hosted LLMs |
✗ |
✗ |
✓ |
✓ |
For detailed information see individual component documents: Knowledge Fabric, Studio, Exchange, AI-LLM, Graph Operations, and API Security.