Challenges

Technical obstacles encountered while building the MDS, and how I solved them. For business impact metrics, see Case Studies.

Challenge 1: DuckDB Concurrency in Orchestration

Problem:

DuckDB is a single-process database. When Dagster tried to run multiple DBT jobs concurrently, they competed for locks on the same DuckDB file, causing failures and data corruption risks.

Solution:

Implemented tag-based concurrency limits in Dagster’s configuration:

# dagster.yaml
run_coordinator:
  tag_concurrency_limits:
    - key: "dbt_duckdb"
      limit: 1

Every DBT-related asset gets tagged with dbt_duckdb, ensuring only one DuckDB operation runs at a time while allowing other non-DuckDB tasks to continue in parallel.

@asset(
    name="build_daily_sales_models",
    op_tags={"dbt_duckdb": "true"}
)
def build_daily_sales_models(context):
    # DBT operations here

Lesson Learned:

Understanding your database’s concurrency model is crucial when designing orchestration. DuckDB’s single-writer limitation isn’t a bug—it’s a design choice for simplicity. The solution is to embrace it at the orchestration layer, not fight it.

Challenge 2: Cross-Market Customer Identity

Problem:

Customer IDs (member_no) overlap between markets. A UK customer with ID 12345 is a completely different person from a German customer with ID 12345. Early models that forgot this created nonsensical cross-market joins.

Solution:

Enforced a composite key pattern (market, member_no) across all models:

-- Every customer query must use both keys
SELECT
    market,
    member_no,
    COUNT(DISTINCT (market, member_no)) AS unique_customers
FROM orders
GROUP BY market, member_no

Added DBT tests to catch violations:

# schema.yml
models:
  - name: fct_orders
    tests:
      - unique:
          column_name: "market || '_' || member_no || '_' || order_no"

Documented the pattern prominently in CLAUDE.md so AI assistants and new developers don’t make the same mistake.

Lesson Learned:

Data modelling assumptions from source systems don’t always translate to analytical contexts. When consolidating multiple systems, identity resolution must be explicit and tested—never assumed.

Challenge 3: Corporate Network Restrictions

Problem:

Corporate firewalls and proxy settings blocked pip install for external packages. Couldn’t install dbt-utils, dbt-expectations, or any community packages. SSL certificate errors added complexity to package management.

Solution:

Adopted a “no external packages” policy and implemented all transformations using DuckDB’s native functions:

-- Instead of dbt_utils.surrogate_key
md5(concat_ws(chr(0), market, member_no, order_date)) AS unique_key
 
-- Instead of dbt_utils.date_spine
SELECT unnest(generate_series(
    DATE '2020-01-01', current_date, INTERVAL '1 day'
)) AS date
 
-- Instead of dbt_utils.pivot
PIVOT sales_data ON market USING SUM(revenue)

Created internal documentation mapping common dbt_utils functions to DuckDB equivalents.

Lesson Learned:

Constraints breed creativity. By not having access to packages, I learned DuckDB’s native capabilities deeply. The resulting code is actually simpler and has no external dependencies to manage.

Challenge 4: European CSV Dialect Handling

Problem:

German and Italian market data exports use semicolons as delimiters (EU convention), while UK uses commas. DuckDB’s auto-detection sometimes failed, especially with edge cases like quoted fields containing the delimiter.

Solution:

Built a fallback parsing system that tries multiple dialect configurations:

fallback_configs = [
    {'delim': ';', 'quote': '"', 'escape': '"'},   # EU standard
    {'delim': ';', 'strict_mode': False},           # EU lenient
    {'delim': ',', 'quote': '"', 'escape': '"'},   # UK/US standard
]
 
for config in fallback_configs:
    try:
        conn.execute(build_query(config))
        break  # Success, stop trying
    except Exception:
        continue  # Try next config

Added market-specific encoding handling (CP1252 for Windows-generated EU files).

Lesson Learned:

“Works on my machine” failures often come from regional data format differences. Build defensive parsing that handles the messiness of real-world data, not just clean test data.

Challenge 5: Safe File Operations

Problem:

The pipeline overwrites Parquet files daily. If a write fails mid-operation, you’re left with a corrupted file and no way to recover. Happened twice in production before we caught it.

Solution:

Implemented backup-before-write pattern:

def process_blob(self, output_path):
    backup_path = None
 
    try:
        # Create backup before overwriting
        if output_path.exists():
            backup_path = output_path.with_suffix('.parquet.backup')
            shutil.copy2(output_path, backup_path)
 
        # Write new file
        conn.execute(f"COPY (...) TO '{output_path}'")
 
        # Verify output
        rows = conn.execute(f"SELECT COUNT(*) FROM '{output_path}'").fetchone()[0]
        if rows == 0:
            raise ValueError("Output has 0 rows")
 
        # Success - remove backup
        if backup_path and backup_path.exists():
            backup_path.unlink()
 
    except Exception:
        # Restore from backup on failure
        if backup_path and backup_path.exists():
            shutil.move(backup_path, output_path)
        raise

Lesson Learned:

Production data pipelines need the same defensive coding as any critical system. Always have a rollback path. Verify outputs before declaring success.

Challenge 6: Single Point of Failure

Problem:

The entire data platform ran on one desktop machine. If that machine went offline, all data refreshes stopped. No redundancy, no monitoring, no alerts.

Solution:

Multi-pronged approach:

State tracking: Pipeline state stored in JSON files that can be picked up by any machine
Stateless design: Each pipeline run checks what’s missing and processes only that
Backup/restore procedures: Documented recovery process tested monthly
Monitoring: Push notifications via ntfy on failures

# State persisted to file, not memory
state = {
    'market': 'UK',
    'processed_files': ['file1.csv', 'file2.csv'],
    'last_sync': '2024-01-15T10:30:00',
    'total_rows_loaded': 15_000_000
}

Lesson Learned:

Grassroots projects can’t wait for enterprise infrastructure. Build resilience into the design from day one, even with limited resources. Document everything so anyone can take over.

Challenge 7: Network vs Local Build Strategy

Problem:

DuckDB is designed as a local-first database — ideally, the database file and source data live on the same machine. But our source Parquet files lived on network drives. This created a choice: build the database on the network (so others could access it), or build locally (for speed).

Initially, I tried building directly on the network drive. Multiple analysts wanted access to the same DuckDB file, and network storage seemed like the obvious solution. It wasn’t.

What I Tried:

Build on network — Slow. DuckDB explicitly recommends against this due to the heavy I/O. Builds that should take minutes took much longer.
Mirror network to local, then build — Fastest option. Sync all Parquet files to local SSD, build the database there. But maintaining that mirror added operational overhead that wasn’t worth it.
Reference network, build locally (final solution) — Source files stay on the network, but the DuckDB database file lives on the local SSD. This dramatically improved build times.

The Materialisation Trick:

The key insight was reducing repeated network reads. For large tables like orders (millions of rows), I materialise them once into DuckDB as staging tables:

-- stg_orders is materialised as a table, not a view
-- This pulls from network ONCE, then lives in DuckDB
{{ config(materialized='table') }}
 
SELECT * FROM read_parquet('//network/path/orders/*.parquet')

Subsequent intermediate and mart models reference stg_orders — which now lives locally in DuckDB — rather than pulling from network every time. It’s a one-time transfer cost amortised across all downstream models.

The Numbers:

Network speed: ~500 Mbps
Local SSD: ~2000 Mbps (4x faster)
But with staging tables materialised locally, the difference rarely mattered — most queries hit the local DuckDB, not the network

Lesson Learned:

DuckDB’s local-first design isn’t a limitation to work around — it’s a feature to embrace. Rather than fighting the architecture (building on network), work with it: keep the database local, materialise large source tables once, and let DuckDB do what it’s good at.

Themes Across Challenges

Looking back, these challenges share common patterns:

Constraints are features - Corporate restrictions forced simpler, more maintainable solutions
Defensive coding matters - Production systems fail; the question is whether they recover gracefully
Documentation is code - CLAUDE.md and inline comments prevented recurring mistakes
Test your assumptions - Customer IDs, CSV formats, concurrency—all seemed obvious until they weren’t

QVC MDS

Explorer

Challenges

Challenge 1: DuckDB Concurrency in Orchestration

Challenge 2: Cross-Market Customer Identity

Challenge 3: Corporate Network Restrictions

Challenge 4: European CSV Dialect Handling

Challenge 5: Safe File Operations

Challenge 6: Single Point of Failure

Challenge 7: Network vs Local Build Strategy

Themes Across Challenges

Graph View

Table of Contents

Backlinks