Quality Scoring

Scoring Dimensions

Auto-Pilot evaluates data quality across four dimensions, each with a specific weight that reflects its importance in overall data quality.

1. Completeness (30% weight)

Measures how much data is missing. Null values can break queries and skew analysis results.

Percentage of null values in the dataset.

Calculation:

completeness_score = (1 - null_count / total_rows) * 100

Penalties:

0% null: Full points
1-5% null: -5 points
5-10% null: -10 points
10-20% null: -20 points
>20% null: -30 points

2. Uniqueness (25% weight)

Detects duplicate records that can distort analysis results and violate database constraints.

Duplicate row detection.

Calculation:

uniqueness_score = (unique_rows / total_rows) * 100

Penalties:

100% unique: Full points
95-99% unique: -5 points
90-94% unique: -10 points
80-89% unique: -20 points
<80% unique: -30 points

3. Validity (25% weight)

Verifies that data values conform to expected types and formats. Invalid values cause errors in processing and analysis.

Type and format checking.

Checks:

Type consistency (integers in integer columns)
Format validation (emails, dates, phone numbers)
Range violations (negative ages, future dates)
Regex pattern matches

Penalties:

100% valid: Full points
95-99% valid: -5 points
90-94% valid: -10 points
<90% valid: -20 to -30 points

4. Consistency (20% weight)

Analyzes value distributions to detect anomalies and statistical outliers that might indicate data entry errors.

Value distribution analysis.

Checks:

Outlier detection (statistical outliers)
Standard deviation analysis
Value distribution patterns
Domain knowledge validation

Penalties:

Normal distribution: Full points
Minor outliers: -5 points
Moderate outliers: -10 points
Major outliers: -20 points

Quality Grades

The four dimension scores are combined using a weighted average to produce an overall quality score from 0 to 100, which is then converted to a letter grade for easy interpretation.

Final score = weighted sum of all dimensions

Grade	Score Range	Meaning
A	90-100	Excellent quality
B	75-89	Good quality with minor issues
C	60-74	Acceptable with notable issues
D	0-59	Poor quality requires attention

Example Report

Quality reports provide a detailed breakdown of scores, issues, and recommendations for improving data quality.

{
  "table": "products",
  "overall_score": 87,
  "grade": "B",
  "dimensions": {
    "completeness": {"score": 95, "weight": 0.30, "issues": []},
    "uniqueness": {"score": 82, "weight": 0.25, "issues": ["3 duplicate rows found"]},
    "validity": {"score": 90, "weight": 0.25, "issues": ["5 invalid email formats"]},
    "consistency": {"score": 85, "weight": 0.20, "issues": ["Price column has outliers"]}
  },
  "issues": [
    {
      "column": "email",
      "type": "format",
      "severity": "MEDIUM",
      "message": "5 rows have invalid email format"
    },
    {
      "column": "price",
      "type": "outlier",
      "severity": "LOW",
      "message": "10 prices exceed 3 standard deviations from mean"
    }
  ]
}

Issue Types

Issues are categorized by severity to help you prioritize which problems to address first.

By Severity

HIGH: Critical data integrity issues
- Null values in NOT NULL columns
- Type mismatches >10%
- Duplicate primary keys
MEDIUM: Quality improvements
- Null values 5-10%
- Some format violations
- Outliers detected
LOW: Nice-to-have optimizations
- Minor formatting issues
- Slight inconsistencies
- Naming convention suggestions

By Category

Issues are also grouped by the type of problem they represent, making it easier to identify systematic data quality issues.

Category	Examples
Null Values	Missing required data
Duplicates	Duplicate rows, duplicate keys
Type Mismatches	Text in numeric columns
Format Issues	Invalid emails, dates
Outliers	Statistical anomalies
Range Issues	Negative ages, future dates

Column-Level Analysis

Quality analysis drills down to individual columns, providing detailed statistics for each field including null counts, unique values, and outliers.

Each column is analyzed individually:

{
  "column_name": "price",
  "type": "float",
  "quality_score": 85,
  "null_count": 5,
  "null_percentage": 3.3,
  "unique_count": 140,
  "unique_percentage": 93.3,
  "min_value": 0.99,
  "max_value": 9999.99,
  "mean": 52.47,
  "std_dev": 234.56,
  "outliers": [
    {"row": 42, "value": 9999.99, "z_score": 4.2}
  ]
}

Generate Quality Report

You can generate detailed HTML quality reports using either the CLI or Python SDK. Reports include dimension breakdowns, column analysis, and actionable recommendations.

CLI

excel-to-sql profile --table products --output quality_report.html

Python SDK

from excel_to_sql import ExcelToSqlite

sdk = ExcelToSqlite()
sdk.generate_quality_report("products", output="quality_report.html")

Opens in browser with:

Overall score and grade
Dimension breakdowns
Column-level analysis
Issue list with severity
Recommendations

# In mapping configuration
{
  "column_mappings": {
    "price": {"target": "price", "type": "float", "default": 0.0}
  }
}

Remove Duplicates

# Use validation rules
{
  "validation_rules": [
    {"column": "id", "type": "unique"}
  ]
}

Fix Format Issues

# Use regex validation
{
  "validation_rules": [
    {
      "column": "email",
      "type": "regex",
      "pattern": "^[^@]+@[^@]+\\.[^@]+$"
    }
  ]
}

Quality Scoring

Scoring Dimensions

1. Completeness (30% weight)

2. Uniqueness (25% weight)

3. Validity (25% weight)

4. Consistency (20% weight)

Quality Grades

Example Report

Issue Types

By Severity

By Category

Column-Level Analysis

Generate Quality Report

CLI

Python SDK

Interpret Scores

Score: 95-100 (Grade A)

Score: 75-89 (Grade B)

Score: 60-74 (Grade C)

Score: 0-59 (Grade D)

Improve Quality Score

Fix Null Values

Remove Duplicates

Fix Format Issues

See Also

On this page