ML Working Data — Regulatory Pipeline Intelligence

Data trust — last validation run: 2026-03-23

Automated Validation Checks 5 checks · 4 pass · 1 known issue

Script: snowflake/03_queries/validate_ml_claims.py — runs after every Monday ML refresh. Compares FCT source tables to ML layer outputs. Threshold: 5% discrepancy = FAIL.

FAIL

Check A — Period Totals: FCT recent wells vs ML_MOMENTUM_INPUT

FCT = 715 wells (dismissed excluded) · ML snapshot = 969 wells · diff = +26%

Root cause: ML_MOMENTUM_INPUT has no PIPELINE_STATUS filter — it includes dismissed cases. Validation query excludes dismissed. The 254-well gap is wells from cases the commission ultimately rejected. This is a known design decision, not a pipeline error. See Known Issues below.

PASS

Check B — Top Operator: FCT vs ML_OPERATOR_MOMENTUM

FCT top operator = Phoenix Operating LLC · ML top = Phoenix Operating LLC ✓

PASS

Check C — Murex Zero: 0 recent cases in last 90 days

Murex Petroleum recent cases = 0 ✓

PASS

Check D — Pending Case Count: ML_CASE_SCORES rows vs FCT pending

ML_CASE_SCORES = 701 · FCT pending (new_pending + continued_pending) = 701 · diff = 0% ✓

PASS

Check E — Embedding Coverage: ML_CASEFILE_EMBEDDINGS vs RAW_CASEFILES_INTELLIGENCE

Embeddings = 354 · Raw casefiles = 354 · diff = 0% ✓

Known issues — open as of 2026-03-23

What We Know Is Wrong or Incomplete

🔴

ML_MOMENTUM_INPUT includes dismissed cases — the +421.9% Phoenix growth and all momentum numbers include wells from cases the commission later rejected. If you want "wells likely to be drilled," the momentum figures are inflated by ~26%. Fix: add WHERE PIPELINE_STATUS != 'dismissed' to ML_MOMENTUM_INPUT. Deferred — need to decide whether this view measures activity volume or approved pipeline.

⚠️

HEARING_DATE = most recent after continuances — a case filed in January that was continued to March shows March as its hearing date. Continued cases appear in the recent 90-day window even if the regulatory process started much earlier. The IS_CONTINUED_FROM_PRIOR flag and continuation_count feature are the reliable indicators. Raw case counts by hearing date will look higher than filing date counts — both are correct, they measure different things.

⚠️

WELLS_AUTHORIZED comes from the docket hearing clerk, not the casefile PDF — operators describe requests in acres/sections ("2560-acre standup spacing unit"), not discrete well counts. The clerk records the authorized count at the hearing. The casefile wells_requested field is often null. This is expected, but it means the source of the number is a human clerk record, not machine-extracted data.

ℹ️

regulatory_decision_signals is null for all 354 casefiles — the V2 embedding_signals vector falls back to case_type + county + operator only. The signals field is populated in order files but not casefiles. The V2 improvement is real but embedding_signals is less semantically rich than intended until casefile extraction is fixed.

ℹ️

DAYS_SINCE_FILING has zero variance — all CASE_FILING_DATE values are null in the current data, so this feature defaults to 90 days for every case. It is included in the model for when the extraction pipeline is fixed, but contributes nothing now.

ℹ️

ML score spread (0.28%–99.97%) was broken until Phase 2 — in Round 1 all 701 pending cases scored ~91.2% uniformly. Root cause was a code bug: feature keys were passed as lowercase strings to PREDICT(), but Snowflake ML requires uppercase column names, so all features silently evaluated to NULL and the model returned the base rate. Fixed 2026-03-23.

Model details — case_progression_model (Phase 2)

Classifier: How Probabilities Are Computed

Snowflake ML CLASSIFICATION model trained on historical docket outcomes. Predicts probability that a pending case will progress to a signed order (vs. dismissed).

Training Data

Total labeled examples 391

Progressed to order (TRUE) 246

Dismissed (FALSE) 145

Base approval rate 62.9%

Window Last 12 months from FCT

Evaluation Metrics (hold-out set)

F1 — Progressed (TRUE) 0.842

F1 — Dismissed (FALSE) 0.762

Score distribution (pending) 0.3% – 99.97%

Avg score · Std dev 70.6% · 0.327

Distinct scores (2 d.p.) 91

Feature Importances Snowflake ML internal (SHAP-based)

⚠️

DAYS_SINCE_FILING (0.0%) is dead weight — all CASE_FILING_DATE values are null, so this feature defaults to 90 for every case and teaches the model nothing. It will become useful once the extraction pipeline populates filing dates.

POOLS

18.4%

COUNTIES

13.6%

WELLS_AUTHORIZED

13.1%

OPERATOR_WIN_RATE

12.6%

CONTINUATION_COUNT

10.3%

SPACING_ACRES

9.2%

APPLICATION_TYPE

8.3%

CASE_CLASSIFICATION

7.0%

COMMERCIAL_RELEVANCE

5.7%

APPLICANT

3.7%

IS_CONTINUED_FROM_PRIOR

2.5%

DAYS_SINCE_FILING

0.0%

Blue bars = Phase 2 new features. Grey = feature present but contributing nothing.

Live data — from operator_momentum_report.json

Operator Pipeline Rankings weighted ML score

Pending cases × ML progression probability. Source: ANALYTICS.V_REGULATORY_MOMENTUM. Updated weekly (Mondays) by snowflake/03_queries/generate_applied_intelligence.py.

How weighted_well_pipeline is calculated

⚠️

Denominator includes dismissed cases — V_REGULATORY_MOMENTUM joins ML_CASE_SCORES (pending only) to FCT_DOCKET_CASES. The ML probability was trained on a dataset that includes dismissed cases in the training window. The weighted score should be interpreted as "weighted by model confidence," not "guaranteed drilled wells."

-- V_REGULATORY_MOMENTUM source (simplified)
SELECT
    dc.APPLICANT                                    AS operator,
    dc.COUNTIES                                     AS county,
    dc.POOLS                                        AS formation,
    COUNT(dc.CASE_NO)                               AS pending_cases,
    SUM(dc.WELLS_AUTHORIZED)                        AS wells_in_pipeline,
    SUM(dc.WELLS_AUTHORIZED
        * s.prediction['probability']['True']::FLOAT
    )                                               AS weighted_well_pipeline,
    AVG(s.prediction['probability']['True']::FLOAT) AS avg_progression_probability
FROM ANALYTICS.FCT_DOCKET_CASES dc
JOIN ANALYTICS.ML_CASE_SCORES s
    ON dc.CASE_NO = s.CASE_NO
WHERE dc.PIPELINE_STATUS IN ('new_pending','continued_pending')
GROUP BY dc.APPLICANT, dc.COUNTIES, dc.POOLS
ORDER BY weighted_well_pipeline DESC

WELLS_AUTHORIZED = clerk-recorded count at hearing (not from casefile PDF). prediction['probability']['True'] = Snowflake ML CLASSIFICATION output.

Top Operators by Weighted Well Pipeline pending cases × prob(progression) · run 2026-03-23

Loading operator data…

Live data — from county_momentum.json

County Momentum recent vs baseline wells

Recent 90-day well authorizations vs prior 90-day baseline. Growth = (recent − baseline) / baseline. Source: ANALYTICS.FCT_DOCKET_CASES. Note: HEARING_DATE = most recent after continuances.

How county growth is calculated

🔴

Dismissed cases ARE included — ML_MOMENTUM_INPUT (which feeds TOP_INSIGHTS) uses no PIPELINE_STATUS filter. The query below excludes dismissed, matching what you'd expect for "approved pipeline." This creates the 715 vs 969 discrepancy in Check A. The county growth numbers below use the stricter (dismissed-excluded) filter.

-- County momentum source SQL
SELECT
    COUNTIES                                        AS county,
    SUM(CASE WHEN HEARING_DATE >= DATEADD(day,-90,CURRENT_DATE())
             THEN COALESCE(WELLS_AUTHORIZED,1) ELSE 0 END)
                                                    AS recent_wells,
    SUM(CASE WHEN HEARING_DATE BETWEEN
             DATEADD(day,-180,CURRENT_DATE())
             AND DATEADD(day,-90,CURRENT_DATE())
             THEN COALESCE(WELLS_AUTHORIZED,1) ELSE 0 END)
                                                    AS baseline_wells,
    COUNT(CASE WHEN HEARING_DATE >= DATEADD(day,-90,CURRENT_DATE())
               AND PIPELINE_STATUS NOT IN ('dismissed')
               THEN 1 END)                          AS recent_cases,
    COUNT(CASE WHEN PIPELINE_STATUS IN
               ('new_pending','continued_pending')
               THEN 1 END)                          AS pending_cases
FROM ANALYTICS.FCT_DOCKET_CASES
WHERE PIPELINE_STATUS NOT IN ('dismissed')
GROUP BY COUNTIES
ORDER BY recent_wells DESC

COALESCE(WELLS_AUTHORIZED, 1): cases with no clerk-recorded count are proxied as 1 well. Multi-county rows (e.g., "Divide, Williams") are filtered out of this display.

County Activity — Recent 90d vs Baseline 90d sorted by recent well authorizations · single-county rows only

Loading county data…

Model details — embeddings (Phase 2 V2)

Semantic Embeddings: What We Embed and Why

Embedding Strategy: V1 vs V2

V1 — ML_CASEFILE_EMBEDDINGS

One embedding per document. Concatenates: case_type + county + operator + operator_request + technical_justification.

354 rows · 768 dims · semantic detail diluted by metadata

V2 — ML_CASEFILE_EMBEDDINGS_V2

Three embeddings per document: embedding_request (operator request text only), embedding_justification (technical justification only), embedding_signals (case_type + county + operator — falls back because regulatory_decision_signals is null in all 354 casefiles).

354 rows × 3 vectors · similarity scores 0.04–0.11 higher than V1

⚠️

V2 improvement verified on 5 test cases. Most meaningful: case 32392 — V1 returned a Denied precedent as top match (diluted by averaged metadata). V2 correctly surfaced a Granted precedent by isolating request-language similarity. Score range improved from 0.71–0.85 (V1) to 0.80–0.89 (V2).

-- ML_CASEFILE_EMBEDDINGS_V2 (three vectors per casefile)
SELECT
    raw_json:case_identification:case_number::STRING  AS case_number,
    -- Vector 1: what the operator asked for
    SNOWFLAKE.CORTEX.EMBED_TEXT_768(
        'snowflake-arctic-embed-m-v1.5',
        COALESCE(raw_json:case_purpose_and_application:operator_request::STRING,
                 'no request text available')
    ) AS embedding_request,
    -- Vector 2: their technical argument
    SNOWFLAKE.CORTEX.EMBED_TEXT_768(
        'snowflake-arctic-embed-m-v1.5',
        COALESCE(raw_json:case_purpose_and_application:technical_justification::STRING,
                 'no justification text available')
    ) AS embedding_justification,
    -- Vector 3: structural context (falls back — signals null in casefiles)
    SNOWFLAKE.CORTEX.EMBED_TEXT_768(
        'snowflake-arctic-embed-m-v1.5',
        TRIM(
            COALESCE(raw_json:case_identification:case_type::STRING, '')
            || ' ' ||
            COALESCE(raw_json:case_identification:county::STRING, '')
            || ' ' ||
            COALESCE(raw_json:operator_and_well_intelligence:applicant_operator::STRING, '')
        )
    ) AS embedding_signals
FROM RAW.RAW_CASEFILES_INTELLIGENCE
WHERE raw_json:case_identification:case_number IS NOT NULL

Data generated: . Source tables: ANALYTICS.FCT_DOCKET_CASES, ANALYTICS.ML_CASE_SCORES, ANALYTICS.V_REGULATORY_MOMENTUM, RAW.RAW_CASEFILES_INTELLIGENCE. ML platform: Snowflake ML CLASSIFICATION + CORTEX.EMBED_TEXT_768 (snowflake-arctic-embed-m-v1.5). Validation script: snowflake/03_queries/validate_ml_claims.py · runs every Monday after refresh.