Tabular Data Generation

The SynthAIr project developed specialized synthetic data generators to address critical challenges in Air Traffic Management: data scarcity, privacy constraints, and commercial sensitivity. Our tabular data pipeline processes mixed-type operational data including flight schedules, delays, and turnaround times using five complementary approaches, each optimized for different data characteristics and use cases. To learn more, see our research publications: Pre-Tactical Flight-Delay and Turnaround Forecasting with Synthetic Aviation Data and Synthetic Flight Data Generation Using Generative Models.Open-source implementations are available in our tabular generators repositories. Public deliverables describing the tabular data generators are currently under evaluation by the SESAR Joint Undertaking and will be made publicly available upon approval.

Overview

Tabular synthetic data generation in aviation requires handling complex mixed-type datasets with categorical features (airlines, airports, aircraft types), continuous variables (delays, durations), and temporal information (schedules, timestamps). Our approach addresses three fundamental challenges:

Data Scarcity: Limited access to comprehensive flight operations data
Privacy Protection: Commercial sensitivity of operational records
Operational Diversity: Wide range of flight scenarios and conditions

We evaluate generators using the Train on Synthetic, Test on Real (TSTR) methodology across three critical prediction tasks: departure delays, arrival delays, and aircraft turnaround times.

Figure 1: Tabular Data Generation Pipeline. From EU flight operations data (OAG), our AI models generate synthetic flight records validated across three dimensions: privacy protection (Distance to Closest Record), statistical fidelity (correlation preservation, distribution similarity), and utility (performance in downstream prediction tasks).

Model Architectures

REaLTabFormer (Transformer-based)

REaLTabFormer leverages transformer architectures to generate synthetic tabular data by treating each flight record as a sequence of tokens. This autoregressive approach captures complex dependencies between features through sequential generation.

Key Features:

GPT-2 based architecture with 6 transformer decoder layers
Row-wise, token-by-token generation preserving feature relationships
Target masking regularization prevents data memorization
Overfitting detection using Q_δ statistic and Distance to Closest Record (DCR)
Performance: Achieves 94-97% of real-data predictive performance

Figure 2: REaLTabFormer Architecture. Transformer-based autoregressive model treating tabular data as token sequences with specialized regularization and overfitting detection mechanisms.

TabSyn (Diffusion-based)

TabSyn employs a two-stage pipeline combining Variational Autoencoders with diffusion models in latent space, enabling efficient high-quality synthesis of mixed-type tabular data.

Key Features:

“Encode → Diffuse → Decode” architecture operating in continuous latent space
VAE encoder transforms mixed-type data into continuous representations
Score-based diffusion with linear noise schedule (20-50 sampling steps)
Adaptive β-VAE training with dynamic KL-divergence scheduling
Performance: Strong fidelity with efficient sampling

Figure 3: TabSyn Architecture. Two-stage pipeline with VAE encoding to continuous latent space followed by efficient diffusion-based generation.

CTGAN (Conditional Adversarial)

Conditional Tabular GAN (CTGAN) addresses aviation data’s mixed-type nature through specialized preprocessing and conditional generation capabilities, particularly effective for handling imbalanced categorical distributions.

Key Features:

Mode-specific normalization using Bayesian Gaussian Mixture Models
Conditional generation with training-by-sampling for rare categories
WGAN-GP loss with gradient penalty for stable adversarial training
PacGAN framework processing 10 samples jointly to prevent mode collapse
Performance: Effective for targeted scenario generation

Figure 4: CTGAN Architecture. Adversarial framework with specialized preprocessing for mixed-type data and conditional generation capabilities.

TVAE (Variational Autoencoder)

Tabular Variational Autoencoder (TVAE) provides a probabilistic approach to synthetic data generation with lightweight computational requirements and stable training characteristics.

Key Features:

Evidence Lower Bound (ELBO) optimization
Reparameterization trick for differentiable latent sampling
Batch normalization and L2 regularization
Minimal GPU requirements with fast training
Performance: Good utility-to-compute ratio

Gaussian Copula (Statistical)

Gaussian Copula represents a classical statistical approach that models dependencies by separating marginal distributions from their correlation structure.

Key Features:

Kernel Density Estimation for marginal distribution modeling
Multivariate Gaussian copula for dependency structure
No neural network training required
Strongest privacy protection among all models
Performance: Best for privacy-critical applications

Evaluation Framework

Our evaluation spans three dimensions ensuring synthetic data maintains essential properties for aviation operations:

Fidelity Assessment

Statistical Similarity: Kolmogorov-Smirnov tests for continuous distributions, Chi-squared for categorical
Correlation Preservation: Pearson, Spearman, Kendall correlations plus mixed-type measures
Joint Distribution Fidelity: KL-divergence for multivariate distributional alignment
Likelihood-based Assessment: Bayesian Network and Gaussian Mixture Model likelihood
Detection Difficulty: Logistic regression classifier performance (synthetic vs. real)

Privacy Protection

Privacy evaluation uses Distance to Closest Record (DCR) metrics measuring how closely synthetic records resemble real training data:

Baseline Protection: Compares synthetic-to-real distances against random data baseline
Overfitting Protection: Tests memorization by comparing proximity to training versus holdout sets

Utility Evaluation

Predictive Performance: RMSE, MAE, R² metrics across prediction tasks
Feature Importance Alignment: Cosine similarity of feature importance vectors
Utility Scores: Normalized performance ratios quantifying synthetic-to-real substitutability