Synthetic Data Quality Report

Detailed analysis of synthetic data generation results

{{ generator_info.method }} {{ generator_info.num_samples }} samples Random seed: {{ generator_info.random_state }}

Executive Summary

This report presents a detailed analysis of the synthetic data generated using the {{ generator_info.method }} method. The synthetic dataset contains {{ generator_info.num_samples }} samples and aims to preserve the statistical properties of the original dataset while maintaining privacy.

Generation Method: {{ generator_info.method }}

Number of Samples: {{ generator_info.num_samples }}

Random Seed: {{ generator_info.random_state }}

Overall Quality Score: {{ "%.2f"|format(quality_score) }}

Statistical Similarity: {{ "%.2f"|format(metrics.overall.statistical_similarity) }}

Privacy Score: {{ "%.2f"|format(metrics.overall.privacy_score) }}

Real Data Size: {{ metrics.overall.real_data_size }}

Synthetic Data Size: {{ metrics.overall.synthetic_data_size }}

Size Ratio: {{ "%.2f"|format(metrics.overall.size_ratio) }}

Distribution Analysis

{% for col_name in numerical_cols[:6] %}

{{ col_name }}

Real Data
Synthetic Data
{% endfor %}

Categorical Analysis

{% for col_name in categorical_cols[:6] %}

{{ col_name }}

Real Data
Synthetic Data
{% endfor %}

Correlation Analysis

Real Data Correlation Matrix

Synthetic Data Correlation Matrix

Quality Metrics

Statistical Similarity
{{ "%.2f"|format(metrics.overall.statistical_similarity) }}
Based on distribution matching
Privacy Score
{{ "%.2f"|format(metrics.overall.privacy_score) }}
Lower re-identification risk
Correlation Preservation
{{ "%.2f"|format(metrics.overall.correlation_fidelity_score) }}
Correlation structure similarity
Overall Quality
{{ "%.2f"|format(quality_score) }}
Comprehensive quality score
{% for category, category_metrics in detailed_metrics.items() %} {% for metric_name, metric_value in category_metrics.items() %} {% endfor %} {% endfor %}
Category Metric Value Interpretation
{{ category|capitalize }} {{ metric_name|replace('_', ' ')|capitalize }} {{ "%.4f"|format(metric_value) if metric_value is number else metric_value }} {% if category == 'statistical' and metric_name in ['avg_ks_statistic', 'avg_distribution_difference', 'correlation_mean_difference'] %} {% if metric_value < 0.1 %} Excellent {% elif metric_value < 0.2 %} Good {% elif metric_value < 0.3 %} Acceptable {% else %} Poor {% endif %} {% elif category == 'privacy' and metric_name == 'at_risk_percentage' %} {% if metric_value < 1 %} Very Low Risk {% elif metric_value < 5 %} Low Risk {% elif metric_value < 10 %} Moderate Risk {% else %} High Risk {% endif %} {% elif category == 'utility' and 'score' in metric_name %} {% if metric_value > 0.9 %} Excellent {% elif metric_value > 0.8 %} Good {% elif metric_value > 0.7 %} Acceptable {% else %} Poor {% endif %} {% else %} - {% endif %}

Data Samples

Real Data Sample

{% for col in real_sample_columns %} {% endfor %} {% for row in real_sample_data %} {% for value in row %} {% endfor %} {% endfor %}
{{ col }}
{{ value }}

Synthetic Data Sample

{% for col in synthetic_sample_columns %} {% endfor %} {% for row in synthetic_sample_data %} {% for value in row %} {% endfor %} {% endfor %}
{{ col }}
{{ value }}

Conclusion and Recommendations

Key Findings

  • The {{ generator_info.method }} method generated {{ generator_info.num_samples }} synthetic samples with an overall quality score of {{ "%.2f"|format(quality_score) }}
  • Statistical similarity between real and synthetic data is {{ 'excellent' if metrics.overall.statistical_similarity > 0.9 else 'good' if metrics.overall.statistical_similarity > 0.8 else 'acceptable' if metrics.overall.statistical_similarity > 0.7 else 'poor' }}
  • Privacy risk is {{ 'very low' if metrics.overall.privacy_score > 0.9 else 'low' if metrics.overall.privacy_score > 0.8 else 'moderate' if metrics.overall.privacy_score > 0.7 else 'high' }}
  • Correlation structure is {{ 'well preserved' if metrics.overall.correlation_fidelity_score > 0.8 else 'reasonably preserved' if metrics.overall.correlation_fidelity_score > 0.6 else 'poorly preserved' }}

Recommendations

  1. Usage Recommendation: This synthetic dataset is {{ 'suitable' if quality_score > 0.8 else 'potentially suitable' if quality_score > 0.7 else 'not recommended' }} for {{ 'all analytical purposes' if quality_score > 0.8 else 'most analytical purposes, but caution is advised for sensitive analyses' if quality_score > 0.7 else 'analytical purposes without significant improvements' }}
  2. Improvements: {% if metrics.overall.statistical_similarity < 0.8 %} Statistical similarity could be improved by adjusting the generation parameters or trying alternative methods. {% endif %} {% if metrics.overall.privacy_score < 0.8 %} Privacy protection could be enhanced by applying additional anonymization techniques. {% endif %} {% if metrics.overall.correlation_fidelity_score < 0.8 %} Correlation preservation could be improved by using methods that better capture relationships between variables. {% endif %} {% if metrics.overall.statistical_similarity >= 0.8 and metrics.overall.privacy_score >= 0.8 and metrics.overall.correlation_fidelity_score >= 0.8 %} The synthetic data shows good quality across all dimensions. Consider increasing the sample size for even better results. {% endif %}