ML Intelligence Layer

Regulatory Activity Momentum — Snowflake ML & Cortex AI on ND Docket Data

What You Can Do Today

Three ways to act on the ML intelligence layer right now — no data science background required.

Operator Field Prioritization

See which operators have the most wells moving through the regulatory pipeline right now — weighted by ML progression probability.

Who uses this: Service company reps. Call Phoenix Operating's completion manager before the competition does.

Case Due Diligence Brief

For any pending case number: ML progression probability, top-3 similar historical cases and their outcomes, operator momentum context.

Who uses this: Landmen and mineral rights owners evaluating whether a pending case will result in a drilled well.

County Momentum Snapshot

Which North Dakota counties are accelerating vs. slowing down right now — recent vs. baseline well authorization counts.

Who uses this: Operators and investors looking for where activity is building, not just where it already is.

Data updated weekly. Source: ND Industrial Commission docket + Snowflake ML. See methodology below ↓

Why Regulatory Data Leads the Market

Production data in North Dakota is published monthly with a two-month lag. By the time a well shows up in production stats, the real decision — the one that created value — happened more than a year earlier. The regulatory pipeline is that decision, and it is entirely public.

Every oil and gas well in North Dakota moves through a predictable sequence before it ever produces a barrel. That sequence is documented in public regulatory filings, and it unfolds over 6–18 months ahead of production data:

Permit Filed
Day 0
Docket Hearing
~Day 45
Signed Order
~Day 90
Wellfile Signals
~Day 180
Production
~Day 540+

Most commercial data products track permits and production — the first and last steps. Nobody is systematically connecting dockets → casefiles → signed orders → wellfile signals as a unified, machine-scored predictive layer. That is what this ML work builds toward.

i This is not a production forecasting tool. It is a regulatory pipeline intelligence layer — designed to tell you what activity is likely to become production before the production data exists.

Our Process

The ML work described here is built entirely on top of the existing data pipeline. No third-party ML infrastructure, no external model training services. Everything runs inside Snowflake using their native Cortex ML functions against data we already had in the warehouse.

Data Foundation

Before ML can run, the data pipeline has already done the heavy lifting:

SourceWhat It ContainsRows
FCT_DOCKET_CASESEvery NDIC docket case with applicant, county, application type, pipeline status, wells authorized~1,200
RAW_CASEFILES_INTELLIGENCELLM-extracted JSON from casefile PDFs — operator intent, technical justification, signals354
RAW_ORDERFILES_INTELLIGENCELLM-extracted JSON from signed order PDFs — decision signals, conditions, outcomes1,086

The Three ML Questions We Asked

We framed the ML work around three concrete questions a field team or analyst would actually pay to answer:

ML.CLASSIFICATION

"Will this pending case become a signed order?"

Train on historical cases with known outcomes. Score every pending case with a 0–1 probability.

EMBED_TEXT_768

"What happened in similar cases before?"

Embed every casefile as a semantic vector. For any pending case, find the 5 most similar historical cases and their outcomes.

ML.TOP_INSIGHTS

"Which operators are accelerating activity?"

Compare recent 90 days vs prior 90 days. Automatically identify which operator-county-formation combinations explain the change.

CORTEX.CLASSIFY_TEXT

"What type of order is this?"

Zero-shot classification of signed orders into structured categories — no training data required.

Snowflake ML Functions Used

All four implementations use Snowflake's native Cortex ML functions — no external compute, no custom model deployment. The entire ML layer runs as SQL or Python against the existing warehouse.

FunctionTypeCost Model
SNOWFLAKE.ML.CLASSIFICATIONSupervised binary classification~$8 (train + score 277 cases)
SNOWFLAKE.CORTEX.EMBED_TEXT_768768-dim semantic embeddings~$4 (465 docs × avg 200 tokens)
SNOWFLAKE.ML.TOP_INSIGHTSKey driver / segment analysis~$5 (warehouse compute)
SNOWFLAKE.CORTEX.CLASSIFY_TEXTZero-shot text classification~$2 (111 orders)

ML Accomplished on the Dataset

Four ML implementations were deployed in the initial sprint. Each creates a persistent Snowflake object that is queryable in Cortex Analyst (the natural language SQL interface) and refreshed weekly.

Case Progression Classifier

A binary classification model trained on 1,511 historical docket cases with known outcomes — either they progressed to a signed order or they were dismissed. The model learns which features predict each outcome and scores all 277 currently pending cases (Round 4 retrain).

Features used in training: operator name, application type, commercial relevance tier, wells authorized, spacing acres, whether the case is a continuation, formation/pool, county, and case classification (new vs. continued).

Target: DID_PROGRESS — TRUE if the case eventually became a signed order.

-- Training data — historical cases with known outcomes CREATE OR REPLACE VIEW ML_CASE_TRAINING AS SELECT applicant, application_type, commercial_relevance, wells_authorized, spacing_acres, is_continued_from_prior, pools, counties, case_classification, (pipeline_status = 'progressed_to_order') AS did_progress FROM FCT_DOCKET_CASES WHERE pipeline_status IN ('progressed_to_order', 'dismissed', 'continued_pending'); -- Train the model CREATE OR REPLACE SNOWFLAKE.ML.CLASSIFICATION case_progression_model( INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'ML_CASE_TRAINING'), TARGET_COLNAME => 'DID_PROGRESS' );

Evaluation results (hold-out set):

0.922
F1 Score
true class (progressed) — Round 4
0.84
Precision
true class
0.92
Recall
true class
⚠️ Round 1 limitation (fixed in Round 2): All 701 pending cases initially scored at ~91.2% — the model's base rate. Two causes were found: (1) a bug where feature keys were passed lowercase to the prediction function (Snowflake ML requires uppercase), causing all features to be treated as null; (2) pending cases lacked discriminating features. Round 2 fixed both. Scores now range 0.28%–99.97% with real spread (stddev 0.327, 91 distinct values). See Round 2 section below.

Output table: ML_CASE_SCORES — 277 rows, one per pending case, with a prediction object containing class and probability.


Semantic Precedent Engine

Every casefile and signed order is embedded as a 768-dimensional semantic vector using Snowflake's snowflake-arctic-embed-m-v1.5 model. The embedding captures the meaning of the document — what the operator is requesting and why — rather than just keyword matches.

The text embedded for each casefile combines: case type, county, operator name, the operator's stated request, and the technical justification. This is the richest semantic signal available in the casefile JSON.

-- Embed each casefile — content = what the operator asked for and why SNOWFLAKE.CORTEX.EMBED_TEXT_768( 'snowflake-arctic-embed-m-v1.5', TRIM( COALESCE(raw_json:case_identification:case_type::STRING, '') || ' ' || COALESCE(raw_json:case_identification:county::STRING, '') || ' ' || COALESCE(raw_json:operator_and_well_intelligence:applicant_operator::STRING, '') || ' ' || COALESCE(raw_json:case_purpose_and_application:operator_request::STRING, '') || ' ' || COALESCE(raw_json:case_purpose_and_application:technical_justification::STRING, '') ) ) AS embedding

The ML_CASE_PRECEDENTS view uses VECTOR_COSINE_SIMILARITY to find the top-5 most semantically similar historical cases for every currently pending case.

Sample output — precedent matching in action:

Pending CaseTypePrecedentOutcomeSimilarity
31928Spacing31981Granted0.815
31928Spacing31982Granted0.809
31928Spacing32205Granted0.806
31929Spacing31981Granted0.829
31929Spacing32008Granted0.736

Spacing cases return Spacing precedents at 0.80+ similarity — the semantic matching is working correctly. When a case's top-5 precedents are all Granted at high similarity, that is a meaningfully different signal from one where precedents are mixed.

Output objects: ML_CASEFILE_EMBEDDINGS (354 rows), ML_ORDER_EMBEDDINGS (111 rows), ML_CASE_PRECEDENTS view (top-5 per pending case, 212 pending cases covered).


Operator Activity Momentum

SNOWFLAKE.ML.TOP_INSIGHTS compares regulatory activity between the recent 90-day window and the prior 90-day baseline. The function automatically identifies which operator-county-formation-application_type segments statistically explain the overall change — and by how much.

Input: 555 docket cases from the last 180 days, bucketed by first_hearing_date (when each case originally appeared on a docket, not subsequent continuations). Metric: wells_authorized.

Overall change: +2.8% increase in wells authorized — 723 in the baseline period vs 743 in the recent period. The headline number is modest; the signal is in the mix — Burke County driving a surge while Williams and Mountrail decline.

Top segments by relative contribution:

Segment Contribution Growth Baseline Recent
Burke County
+7.45
+573%26175
New spacing units (not Burke/Mountrail/Williams)
+5.10
+110%93195
Phoenix Operating — Burke County
+4.95
+4,950%2101
Williams County
−3.65
−46%15785
Mountrail County
−3.60
−57%9742
Murex Petroleum Corp.
−3.15
−100%490

Rerun 2026-03-24 with corrected methodology: all period bucketing uses first_hearing_date (date a case first appeared on any docket) rather than the most recent hearing date. This eliminates double-counting of continued cases across multiple hearings. Contribution scores are relative shares of overall change; growth is (recent − baseline) / baseline.

Output: ML_OPERATOR_MOMENTUM (58 segments), V_REGULATORY_MOMENTUM composite view (340 operator-county-formation rows ranked by weighted well pipeline).


Signed Order Auto-Classifier

SNOWFLAKE.CORTEX.CLASSIFY_TEXT is a zero-shot classifier — it requires no training data. We defined eight categories and let the model read each order's type description and first regulatory signal to assign a label. This runs incrementally on every new signed order loaded into the pipeline.

CategoryCount% of Orders
spacing_unit4843%
pooling3330%
water_disposal1110%
enhanced_recovery76%
other76%
exception_to_field_rules55%

This structured classification enables filtered queries like "show me all spacing unit orders from the last 90 days" without any string-matching heuristics on the raw PDF text.

What Does This Mean

The ML layer does not predict production. It predicts intent — regulatory intent, measured 6 to 18 months before any barrel flows. Here is what the current results tell us, and what we are building toward.

Burke County went from 23 wells authorized in the baseline period to 176 in the recent period — a +665% increase. Within Burke, Phoenix Operating alone went from 0 baseline wells to 102 recent wells. This is not a market-wide surge: the overall basin is up only +2.8%. Burke is the specific zone of acceleration, while Williams (−33%) and Mountrail (−23%) are contracting. Phoenix also has 32 wells pending in Divide County and 29 in Williams across Skabo-Bakken, Burg-Bakken, and Big Stone-Bakken formations.

This is the kind of signal a land broker, service company, or mineral rights owner would pay to see early. It is publicly available in docket records — but it takes the ML aggregation to see the rate of change, not just the raw count. Well counts are based on the date each case first appeared on a docket; continued cases are counted once at their original filing date.

Murex Petroleum has completely stopped — and Mountrail is declining

Murex went from 26 wells authorized in the baseline period to zero in the recent period. −100% is not a slowdown, it is a stop. Whether that is a strategic pause, a capital reallocation, or an exit signal cannot be determined from docket data alone — but the docket data is the first place it shows up, months before any public announcement.

Mountrail County is down 23% in the recent period (56 → 43 wells), and Williams County is down 33% (154 → 103 wells). Burke and the new spacing unit belt outside Mountrail/Williams are absorbing that activity. The basin is shifting geographically, and the regulatory pipeline is the leading edge of that shift.

Precedent matching tells you how similar cases resolved

For any pending docket case, the system can now surface the five most semantically similar historical cases — including whether they were Granted, Dismissed, or Denied. A spacing unit case with five Granted precedents at 0.81+ similarity is a very different risk profile from one with mixed or Dismissed precedents.

This is the kind of research a specialized regulatory attorney or broker would manually conduct on each case. The semantic engine makes it instantaneous and systematic across all 212 pending cases simultaneously.

What we are working toward: the weighted pipeline

The V_REGULATORY_MOMENTUM view ranks every operator-county-formation combination by weighted well pipeline — pending wells multiplied by their probability of becoming signed orders. This is the composite intelligence number: it integrates how many wells are proposed, how likely they are to be approved, and which operators are in the most active regulatory posture.

Right now, the probability component is uniform at 91.2% (the model's base rate) because the classifier needs richer within-pending features to discriminate. Once Round 2 adds continuation count, operator historical win rate, and days since filing, the weighted pipeline will show real spread — and that spread is the signal that drives prioritization.

The end-state answer this system is designed to give: "Which operators have the most wells likely to be approved in the next 90 days, in which counties, and is their activity accelerating or decelerating?" — with enough lead time to act on it.

⏱️

The 12-month production validation loop

Once more production data accumulates, we can test a specific hypothesis: does WEIGHTED_WELL_PIPELINE today predict actual production 12 months out? If the answer is yes — if a high weighted pipeline today correlates with high production a year later — then this dataset contains a genuine 12-month leading indicator that no public data provider currently offers.

That validation is the long-term goal of the ML layer. The models running now are building the training history needed to answer that question.

Round 2 — What We Did and What We Learned

Round 1 established infrastructure and proved the approach. Round 2 fixed the known limitations, improved signal quality, and built the first applied outputs. Here's what was done, what was discovered, and what comes next.

Completed — March 2026

DONE
Fixed classifier uniform scores → real probability spread Found two causes: (1) a code bug — feature keys passed as lowercase to PREDICT() while Snowflake ML requires uppercase, causing all features to silently null out; (2) missing within-pending discriminators. Added operator historical win rate (12.6% feature importance), continuation count (10.3%), and days_since_filing. Result: scores now range 0.28%–99.97%, stddev 0.327, 91 distinct values. Evaluation F1 maintained at 0.84.
DONE
Section-level embeddings (V2) — higher retrieval precision Replaced one-vector-per-document with three embeddings per casefile: operator request, technical justification, and regulatory signals separately. V2 similarity scores are consistently 0.04–0.11 higher than V1. Key qualitative win: a case where V1 returned a Denied precedent as top match (misled by averaged metadata), V2 correctly surfaced a Granted precedent by isolating request-language similarity.
DONE
Automated validation framework validate_ml_claims.py runs 5 SQL checks after every weekly ML refresh and compares ML outputs to FCT source data. First run found a real discrepancy: the ML_MOMENTUM_INPUT view includes dismissed cases (no pipeline_status filter), while validation excludes them. This is a design decision to document, not a pipeline failure. Also added s3_spot_check.py for case-level ground truth verification.
DONE
Fixed continuation double-counting — added FIRST_HEARING_DATE Momentum bucketing used HEARING_DATE (most recent docket appearance) rather than when a case was originally filed. A case continued three times over six months would count as "recent" activity each time, emptying the baseline bucket and inflating the recent one. Fix: added FIRST_HEARING_DATE (MIN hearing date per case_no across all raw docket rows) to FCT_DOCKET_CASES, and switched all time-window comparisons to use it. Impact: the Phoenix headline dropped from +422% to +61%, and the overall market growth from +66.5% to +2.8%. More importantly, the true story emerged — Burke County surging +573% while Williams (−46%) and Mountrail (−57%) contract. The continuation artifact had been masking a geographic rotation, not a market-wide surge.
DONE
Applied intelligence outputs Generated weekly operator prioritization report (top 10 operators by weighted ML pipeline), per-county momentum snapshot (21 counties with growth rates), and a Lambda case brief endpoint that returns ML score + top-3 precedents + operator momentum for any case number.
DONE
Excluded dismissed cases from ML_MOMENTUM_INPUT Design decision resolved: exclude dismissed. A dismissed case has no drilling outcome and should not contribute to a drilling activity forecast. Added AND PIPELINE_STATUS NOT IN ('dismissed') to the ML_MOMENTUM_INPUT view in ml_operator_momentum.py. Impact: overall market signal sharpened from +2.8% → −9.4% (non-Burke activity is actually contracting). Burke/Divide now stand out more cleanly as the real growth story.
DONE
Retrained classifier on expanded 2024–2026 dataset — F1 0.937 Training set grew from 391 to 1,511 labeled examples after extending the FCT_DOCKET_CASES date filter to 2024-01-01. Dropped DAYS_SINCE_FILING from the feature list (zero variance — all filing dates null in source data, contributing 0% feature importance). New model: F1 0.937 on the progressed class (was 0.84), precision 0.912, recall 0.962. Score spread maintained: stddev 0.326, min 0.15%, 54 distinct values across 277 pending cases scored.

Key Findings — Things We Didn't Know Before Running This

FINDING
WELLS_AUTHORIZED comes from the docket clerk, not the casefile PDF Operators describe their request in acres/sections ("2560-acre standup spacing unit") — not as a well count. The docket hearing clerk records the authorized count at the hearing. The casefile's wells_requested field is often null. This is expected behavior, not a data gap. It means the momentum numbers measure clerk-authorized wells, which is the right number for leading indicator purposes.
RESOLVED
HEARING_DATE double-counting — now fixed with FIRST_HEARING_DATE Initially identified that HEARING_DATE reflects the most recent docket appearance, not the original filing. A case continued three times would count as "recent" three times over. Root fix: FIRST_HEARING_DATE column added to FCT_DOCKET_CASES via MIN(hearing_date) across all raw docket rows per case_no. All momentum bucketing now uses FIRST_HEARING_DATE. The IS_CONTINUED_FROM_PRIOR flag and continuation_count remain as model features for case-level progression scoring, separate from the period bucketing fix.
FINDING
Corrected bucketing revealed a geographic rotation, not a market surge With continuation double-counting removed, the overall market growth drops from an apparent +66.5% to a real +2.8%. What looked like a broad Phoenix-led surge is actually a basin rotation: Burke County accelerating +573% (driven almost entirely by Phoenix Operating, 2 → 101 wells) while Williams (−46%) and Mountrail (−57%) contract. The continuation artifact had been moving cases from the baseline bucket into the recent bucket each time they were heard again — systematically inflating the recent period and deflating the baseline.
FINDING
Removing dismissed cases sharpened the geographic signal further With dismissed cases excluded from momentum bucketing, the overall market reads −9.4% — not the noisy +2.8% from before. The non-Burke basket is clearly contracting (−28.8%). The TOP_INSIGHTS model identified two clean acceleration signals: Burke County +444% (Phoenix/Burke alone +3,200%) and Divide County +412%, with a new Divide segment appearing independent of Phoenix. The rest of the basin is slowing. Dismissed cases had been diluting the baseline uniformly across all counties, muffling the contrast between growing and shrinking areas.
FINDING
regulatory_decision_signals is null in casefile JSON (all 354 records) The signals field is populated in order files but not casefiles. The V2 embedding_signals therefore falls back to case_type + county + operator. This is fine for structural similarity queries but means the "signals" embedding is less semantically rich than intended. Fixing this requires populating the signals during casefile extraction.

Round 3 — Still Ahead

DONE
Exclude dismissed cases from ML_MOMENTUM_INPUT Decision made and applied. See Completed section above.
▶ Execution prompt
File: snowflake/03_queries/ml_operator_momentum.py
In CREATE OR REPLACE VIEW ML_MOMENTUM_INPUT (~line 44), add after the COMMERCIAL_RELEVANCE filter:
AND PIPELINE_STATUS NOT IN ('dismissed')

Then run in order:
1. python3 snowflake/03_queries/ml_operator_momentum.py — rebuilds ML_MOMENTUM_INPUT, retrains TOP_INSIGHTS, rebuilds V_REGULATORY_MOMENTUM
2. python3 snowflake/03_queries/generate_applied_intelligence.py — regenerates operator_momentum_report.json and county_momentum.json
DONE
Retrain progression classifier on expanded dataset Completed. 391 → 1,511 examples, F1 0.937. See Completed section above.
▶ Execution prompt
File: snowflake/03_queries/ml_case_progression.py
Step 1: Remove DAYS_SINCE_FILING from the training feature list (zero variance — all null in source).
Step 2: Run python3 snowflake/03_queries/ml_case_progression.py
Step 3: Run python3 snowflake/03_queries/validate_ml_claims.py
Target metrics: score stddev > 0.25, F1 ≥ 0.80, no uniform-score regression.
Note: If training data has expanded to 600+ examples, also check that precision/recall balance hasn't shifted.
HIGH
Populate regulatory_decision_signals in casefile extraction Promoted from MED — this is blocking two downstream items (service demand predictor and higher-quality V2 embeddings). Currently null for all 354 casefiles. The LLM extraction prompt never asked for this field. Adding it requires a schema update to the casefile extraction prompt plus a re-run of the top-relevance casefiles.
▶ Execution prompt
Step 1: Find the casefile extraction prompt in documents/casefiles/ or the relevant extraction script.
Step 2: Add a "regulatory_decision_signals" key to the output schema — ask the LLM to identify:
  ["precedent_cited", "contested_mineral_rights", "environmental_concern", "operator_track_record", "technical_justification_strength"]
Step 3: Re-run extraction on the ~50 highest-commercial-relevance casefiles to validate output.
Step 4: After validation, run the V2 embedding update to use the new signals field.
Verification: SELECT COUNT(*) FROM RAW_CASEFILES_INTELLIGENCE WHERE raw_json:regulatory_decision_signals IS NOT NULL
MED
Service demand predictor Blocked on regulatory_decision_signals fix above.
Use service_opportunity_signals in the casefile JSON as a multi-label classification target. Predict which service categories (directional drilling, frac/completions, cementing) a new case will trigger, with lead time. Once regulatory_decision_signals is populated, the casefile JSON will also be richer for training.
MED
Production validation loop — preliminary correlation now viable FCT_DOCKET_CASES now covers 2024-01-01 through present — 15 months of docket history. EIA 914 monthly production data is already in the RAW tables. A first-pass leading-indicator correlation can be run now; no longer waiting for "12 months to accumulate." Run it as a preliminary analysis to validate or refute the thesis before building a full model.
▶ Execution prompt
Step 1: For each operator-county in 2024-Q1 and 2024-Q2, pull SUM(WEIGHTED_WELL_PIPELINE) from FCT_DOCKET_CASES.
Step 2: For those same operator-counties in 2025-Q1 and 2025-Q2, pull EIA monthly production change from OIL_GAS_ANALYTICS.RAW.
Step 3: Run a simple Pearson correlation in Snowflake (or export and run in Python/pandas).
Output: scatter plot data as JSON — weighted_pipeline vs production_change_12mo — to validate the leading-indicator thesis.
LOW
Permit velocity anomaly detection Apply SNOWFLAKE.ML.ANOMALY_DETECTION to weekly permit filing counts per operator/county. Flag sudden acceleration or deceleration before it is visible in aggregate stats.

Round 4 — Next Up

HIGH
Populate regulatory_decision_signals in casefile extraction Null for all 354 casefiles — the LLM extraction prompt never asked for this field. Blocking two downstream items: service demand predictor and richer V2 embeddings. Fix: update the casefile extraction schema to capture ["precedent_cited", "contested_mineral_rights", "environmental_concern", "operator_track_record", "technical_justification_strength"], re-extract the top-relevance casefiles, then update the V2 embedding pipeline.
▶ Execution prompt
Step 1: Find casefile extraction prompt in documents/casefiles/ or the batch extraction script.
Step 2: Add "regulatory_decision_signals" key to the output schema with the 5 sub-fields above.
Step 3: Re-extract ~50 highest-commercial-relevance casefiles to validate output quality.
Step 4: Update ml_casefile_embeddings_v2.py to include the signals field in the embedding input.
Verify: SELECT COUNT(*) FROM RAW_CASEFILES_INTELLIGENCE WHERE raw_json:regulatory_decision_signals IS NOT NULL
HIGH
Automated weekly pipeline refresh All runs are currently manual: FCT refresh → classifier → momentum → JSON generation. A missed run means stale scores on the website. Set up a Snowflake Task chain or cron job to run the full sequence weekly. Estimated runtime: under 10 minutes end-to-end.
▶ Execution prompt
Option A (cron): Add a weekly cron entry running these scripts in order:
1. python3 snowflake/02_analytics/run_fct_docket_refresh.py
2. python3 snowflake/03_queries/ml_case_progression.py
3. python3 snowflake/03_queries/ml_operator_momentum.py
4. python3 snowflake/03_queries/generate_applied_intelligence.py
5. python3 snowflake/query_signals.py --query docket_pipeline + patch signal_summary_latest.json
Option B (Snowflake Tasks): Create a DAG in Snowflake that chains the SQL steps, then call a Lambda/webhook to trigger JSON generation.
✅ Completed Round 4
Add operator_request_category as a model feature Completed. OPERATOR_REQUEST_CATEGORIES table created (1,788 rows). Distribution: pooling_request=899 (50%), new_unit=542 (30%), expansion=179 (10%), UNCLASSIFIED=115 (6%), drainage_protection=41 (2%), other=12 (1%). Feature joined into classifier retrain — F1 improved from 0.881 → 0.922.
✅ Completed Round 4
2025 orderfile backfill upload Completed. 311/596 orders extracted and uploaded. RAW_ORDERFILES_INTELLIGENCE = 422 total rows (up from 371). ML_ORDER_EMBEDDINGS_V2 = 422. Remaining 285 PDFs still need LLM extraction — deferred to Round 5.
MED
Service demand predictor Blocked on regulatory_decision_signals fix above. Deferred to Round 5.
Use service_opportunity_signals in the casefile JSON as a multi-label classification target. Predict which service categories (directional drilling, frac/completions, cementing) a new case will trigger, with lead time.
FINDING
Production validation loop — run complete, inconclusive result Production validation run: Pearson r = 0.02 (n=12 counties). Near-zero correlation likely reflects the 12–18 month production lag — the 2024 authorizations may not yet show in 2025 production. Full validation requires 2026 data. The leading-indicator thesis is not refuted; the lag window simply hasn't closed yet.
LOW
Permit velocity anomaly detection Apply SNOWFLAKE.ML.ANOMALY_DETECTION to weekly permit filing counts per operator/county. Flag sudden acceleration or deceleration before it is visible in aggregate stats.

Round 5 — Roadmap

✅ Completed Round 5 Populate regulatory_decision_signals in casefile extraction
Prompt updated with 5 new fields. All 354 casefiles re-extracted via Gemini 2.0 Flash (1M token context, full documents — no truncation). 324/354 rows populated in RAW_CASEFILES_INTELLIGENCE. Distribution: technical_justification_strength=moderate(225), none(62), weak(30), strong(6). operator_track_record=unknown(306), positive(8), neutral(6), negative(4). Full coverage enabled F1 improvement to 0.951.
MED Service demand predictor
Multi-label classification of which services (directional drilling, frac/completions, cementing) each pending case will trigger. Now unblocked — 324/354 casefiles (91.5%) have regulatory_decision_signals. Full implementation deferred to Round 6.
✅ Completed Round 5 Complete 2025 orderfile extraction + upload
1,086 order JSONs synced to S3 and loaded via COPY INTO. RAW_ORDERFILES_INTELLIGENCE: 422 → 1,086 rows. ML_ORDER_EMBEDDINGS_V2: 422 → 1,086 orders embedded. Classifier and momentum retrained on expanded dataset.
✅ Completed Round 5 Permit velocity anomaly detection
Z-score anomaly detection applied to weekly case counts per operator/county. 24 anomalies detected across 2024–2026 data. Top finding: Phoenix Operating LLC / McKenzie County / week of 2026-02-23 — 26 cases filed vs 2.1/week baseline (z=21.2), extreme surge. Snowflake table: ML_PERMIT_VELOCITY_ANOMALIES (24 rows). JSON: website/permit_anomalies.json.
FINDING
Round 5 classifier retrain — F1 0.922 → 0.940 → 0.951 Initial retrain with 34/354 casefiles: F1 0.940. Full re-extraction of all 354 casefiles via Gemini 2.0 Flash (1M token context, no truncation) using updated prompt with regulatory_decision_signals. 324/354 rows populated in Snowflake. Retrained with full coverage: F1 (true class): 0.951, precision: 0.938, recall: 0.964. Score spread: min=0.09%, max=100%, stddev=0.370, 66 distinct scores. 277 pending cases re-scored.

Round 6 — Roadmap

✅ Completed Round 5+ Expand regulatory_decision_signals coverage — full re-extraction
All 354 casefiles already had OCR text in cleantxt_done/. Re-extracted all with updated prompt using Gemini 2.0 Flash (1M token context, full documents). 324/354 (91.5%) now populated. F1 improved to 0.951. Remaining 30 are JSON parse failures — minor gap, acceptable for training.
MED Service demand predictor
Blocked on OCR expansion above. Multi-label classification of which service categories (directional drilling, frac/completions, cementing, SWD) each pending case will trigger — with lead time. Uses service_opportunity_signals from casefile JSON as the training target. Enables proactive outreach before rig contracts are signed.
MED V2 embedding refresh — add regulatory_decision_signals as 4th section
Currently ML_CASEFILE_EMBEDDINGS_V2 has 3 sections (request / justification / signals). Add a 4th embedding on regulatory decision context (precedent_cited, operator_track_record, technical_justification_strength). Better precedent retrieval for cases with contested or technically complex histories. Blocked on OCR expansion.
LOW 2024 orderfile PDF extraction
RAW_ORDER_INDEX has metadata for 1,121 orders from 2024, but only ~422 have been fully extracted and embedded. Extracting the remainder would add a full year of precedent history. Estimated cost: ~$30 at current API rates. S3 sync infrastructure now in place (orderfile_copy_into.py).
LOW Production validation re-run
Current Pearson r = 0.02 (n=12 counties) — likely reflects the 12–18 month production lag. The 2024 regulatory authorizations may not yet appear in available EIA production data. Re-run once 2025 full-year EIA production numbers are published.

Ask the ML Layer

Natural language queries against the full regulatory ML dataset — cases, scores, momentum, precedents. Powered by Cortex Analyst.