What You Can Do Today
Three ways to act on the ML intelligence layer right now — no data science background required.
Operator Field Prioritization
See which operators have the most wells moving through the regulatory pipeline right now — weighted by ML progression probability.
Who uses this: Service company reps. Call Phoenix Operating's completion manager before the competition does.
Case Due Diligence Brief
For any pending case number: ML progression probability, top-3 similar historical cases and their outcomes, operator momentum context.
Who uses this: Landmen and mineral rights owners evaluating whether a pending case will result in a drilled well.
County Momentum Snapshot
Which North Dakota counties are accelerating vs. slowing down right now — recent vs. baseline well authorization counts.
Who uses this: Operators and investors looking for where activity is building, not just where it already is.
Data updated weekly. Source: ND Industrial Commission docket + Snowflake ML. See methodology below ↓
Why Regulatory Data Leads the Market
Production data in North Dakota is published monthly with a two-month lag. By the time a well shows up in production stats, the real decision — the one that created value — happened more than a year earlier. The regulatory pipeline is that decision, and it is entirely public.
Every oil and gas well in North Dakota moves through a predictable sequence before it ever produces a barrel. That sequence is documented in public regulatory filings, and it unfolds over 6–18 months ahead of production data:
Most commercial data products track permits and production — the first and last steps. Nobody is systematically connecting dockets → casefiles → signed orders → wellfile signals as a unified, machine-scored predictive layer. That is what this ML work builds toward.
Our Process
The ML work described here is built entirely on top of the existing data pipeline. No third-party ML infrastructure, no external model training services. Everything runs inside Snowflake using their native Cortex ML functions against data we already had in the warehouse.
Data Foundation
Before ML can run, the data pipeline has already done the heavy lifting:
| Source | What It Contains | Rows |
|---|---|---|
FCT_DOCKET_CASES | Every NDIC docket case with applicant, county, application type, pipeline status, wells authorized | ~1,200 |
RAW_CASEFILES_INTELLIGENCE | LLM-extracted JSON from casefile PDFs — operator intent, technical justification, signals | 354 |
RAW_ORDERFILES_INTELLIGENCE | LLM-extracted JSON from signed order PDFs — decision signals, conditions, outcomes | 1,086 |
The Three ML Questions We Asked
We framed the ML work around three concrete questions a field team or analyst would actually pay to answer:
"Will this pending case become a signed order?"
Train on historical cases with known outcomes. Score every pending case with a 0–1 probability.
"What happened in similar cases before?"
Embed every casefile as a semantic vector. For any pending case, find the 5 most similar historical cases and their outcomes.
"Which operators are accelerating activity?"
Compare recent 90 days vs prior 90 days. Automatically identify which operator-county-formation combinations explain the change.
"What type of order is this?"
Zero-shot classification of signed orders into structured categories — no training data required.
Snowflake ML Functions Used
All four implementations use Snowflake's native Cortex ML functions — no external compute, no custom model deployment. The entire ML layer runs as SQL or Python against the existing warehouse.
| Function | Type | Cost Model |
|---|---|---|
SNOWFLAKE.ML.CLASSIFICATION | Supervised binary classification | ~$8 (train + score 277 cases) |
SNOWFLAKE.CORTEX.EMBED_TEXT_768 | 768-dim semantic embeddings | ~$4 (465 docs × avg 200 tokens) |
SNOWFLAKE.ML.TOP_INSIGHTS | Key driver / segment analysis | ~$5 (warehouse compute) |
SNOWFLAKE.CORTEX.CLASSIFY_TEXT | Zero-shot text classification | ~$2 (111 orders) |
ML Accomplished on the Dataset
Four ML implementations were deployed in the initial sprint. Each creates a persistent Snowflake object that is queryable in Cortex Analyst (the natural language SQL interface) and refreshed weekly.
Case Progression Classifier
A binary classification model trained on 1,511 historical docket cases with known outcomes — either they progressed to a signed order or they were dismissed. The model learns which features predict each outcome and scores all 277 currently pending cases (Round 4 retrain).
Features used in training: operator name, application type, commercial relevance tier, wells authorized, spacing acres, whether the case is a continuation, formation/pool, county, and case classification (new vs. continued).
Target: DID_PROGRESS — TRUE if the case eventually became a signed order.
Evaluation results (hold-out set):
true class (progressed) — Round 4
true class
true class
Output table: ML_CASE_SCORES — 277 rows, one per pending case, with a prediction object containing class and probability.
Semantic Precedent Engine
Every casefile and signed order is embedded as a 768-dimensional semantic vector using Snowflake's snowflake-arctic-embed-m-v1.5 model. The embedding captures the meaning of the document — what the operator is requesting and why — rather than just keyword matches.
The text embedded for each casefile combines: case type, county, operator name, the operator's stated request, and the technical justification. This is the richest semantic signal available in the casefile JSON.
The ML_CASE_PRECEDENTS view uses VECTOR_COSINE_SIMILARITY to find the top-5 most semantically similar historical cases for every currently pending case.
Sample output — precedent matching in action:
| Pending Case | Type | Precedent | Outcome | Similarity |
|---|---|---|---|---|
31928 | Spacing | 31981 | Granted | 0.815 |
31928 | Spacing | 31982 | Granted | 0.809 |
31928 | Spacing | 32205 | Granted | 0.806 |
31929 | Spacing | 31981 | Granted | 0.829 |
31929 | Spacing | 32008 | Granted | 0.736 |
Spacing cases return Spacing precedents at 0.80+ similarity — the semantic matching is working correctly. When a case's top-5 precedents are all Granted at high similarity, that is a meaningfully different signal from one where precedents are mixed.
Output objects: ML_CASEFILE_EMBEDDINGS (354 rows), ML_ORDER_EMBEDDINGS (111 rows), ML_CASE_PRECEDENTS view (top-5 per pending case, 212 pending cases covered).
Operator Activity Momentum
SNOWFLAKE.ML.TOP_INSIGHTS compares regulatory activity between the recent 90-day window and the prior 90-day baseline. The function automatically identifies which operator-county-formation-application_type segments statistically explain the overall change — and by how much.
Input: 555 docket cases from the last 180 days, bucketed by first_hearing_date (when each case originally appeared on a docket, not subsequent continuations). Metric: wells_authorized.
Overall change: +2.8% increase in wells authorized — 723 in the baseline period vs 743 in the recent period. The headline number is modest; the signal is in the mix — Burke County driving a surge while Williams and Mountrail decline.
Top segments by relative contribution:
| Segment | Contribution | Growth | Baseline | Recent |
|---|---|---|---|---|
| Burke County | +573% | 26 | 175 | |
| New spacing units (not Burke/Mountrail/Williams) | +110% | 93 | 195 | |
| Phoenix Operating — Burke County | +4,950% | 2 | 101 | |
| Williams County | −46% | 157 | 85 | |
| Mountrail County | −57% | 97 | 42 | |
| Murex Petroleum Corp. | −100% | 49 | 0 |
Rerun 2026-03-24 with corrected methodology: all period bucketing uses first_hearing_date (date a case first appeared on any docket) rather than the most recent hearing date. This eliminates double-counting of continued cases across multiple hearings. Contribution scores are relative shares of overall change; growth is (recent − baseline) / baseline.
Output: ML_OPERATOR_MOMENTUM (58 segments), V_REGULATORY_MOMENTUM composite view (340 operator-county-formation rows ranked by weighted well pipeline).
Signed Order Auto-Classifier
SNOWFLAKE.CORTEX.CLASSIFY_TEXT is a zero-shot classifier — it requires no training data. We defined eight categories and let the model read each order's type description and first regulatory signal to assign a label. This runs incrementally on every new signed order loaded into the pipeline.
| Category | Count | % of Orders |
|---|---|---|
| spacing_unit | 48 | 43% |
| pooling | 33 | 30% |
| water_disposal | 11 | 10% |
| enhanced_recovery | 7 | 6% |
| other | 7 | 6% |
| exception_to_field_rules | 5 | 5% |
This structured classification enables filtered queries like "show me all spacing unit orders from the last 90 days" without any string-matching heuristics on the raw PDF text.
What Does This Mean
The ML layer does not predict production. It predicts intent — regulatory intent, measured 6 to 18 months before any barrel flows. Here is what the current results tell us, and what we are building toward.
Burke County is where the Bakken is accelerating — and Phoenix is driving it click for verification results →
Burke County went from 23 wells authorized in the baseline period to 176 in the recent period — a +665% increase. Within Burke, Phoenix Operating alone went from 0 baseline wells to 102 recent wells. This is not a market-wide surge: the overall basin is up only +2.8%. Burke is the specific zone of acceleration, while Williams (−33%) and Mountrail (−23%) are contracting. Phoenix also has 32 wells pending in Divide County and 29 in Williams across Skabo-Bakken, Burg-Bakken, and Big Stone-Bakken formations.
This is the kind of signal a land broker, service company, or mineral rights owner would pay to see early. It is publicly available in docket records — but it takes the ML aggregation to see the rate of change, not just the raw count. Well counts are based on the date each case first appeared on a docket; continued cases are counted once at their original filing date.
Murex Petroleum has completely stopped — and Mountrail is declining
Murex went from 26 wells authorized in the baseline period to zero in the recent period. −100% is not a slowdown, it is a stop. Whether that is a strategic pause, a capital reallocation, or an exit signal cannot be determined from docket data alone — but the docket data is the first place it shows up, months before any public announcement.
Mountrail County is down 23% in the recent period (56 → 43 wells), and Williams County is down 33% (154 → 103 wells). Burke and the new spacing unit belt outside Mountrail/Williams are absorbing that activity. The basin is shifting geographically, and the regulatory pipeline is the leading edge of that shift.
Precedent matching tells you how similar cases resolved
For any pending docket case, the system can now surface the five most semantically similar historical cases — including whether they were Granted, Dismissed, or Denied. A spacing unit case with five Granted precedents at 0.81+ similarity is a very different risk profile from one with mixed or Dismissed precedents.
This is the kind of research a specialized regulatory attorney or broker would manually conduct on each case. The semantic engine makes it instantaneous and systematic across all 212 pending cases simultaneously.
What we are working toward: the weighted pipeline
The V_REGULATORY_MOMENTUM view ranks every operator-county-formation combination by weighted well pipeline — pending wells multiplied by their probability of becoming signed orders. This is the composite intelligence number: it integrates how many wells are proposed, how likely they are to be approved, and which operators are in the most active regulatory posture.
Right now, the probability component is uniform at 91.2% (the model's base rate) because the classifier needs richer within-pending features to discriminate. Once Round 2 adds continuation count, operator historical win rate, and days since filing, the weighted pipeline will show real spread — and that spread is the signal that drives prioritization.
The end-state answer this system is designed to give: "Which operators have the most wells likely to be approved in the next 90 days, in which counties, and is their activity accelerating or decelerating?" — with enough lead time to act on it.
The 12-month production validation loop
Once more production data accumulates, we can test a specific hypothesis: does WEIGHTED_WELL_PIPELINE today predict actual production 12 months out? If the answer is yes — if a high weighted pipeline today correlates with high production a year later — then this dataset contains a genuine 12-month leading indicator that no public data provider currently offers.
That validation is the long-term goal of the ML layer. The models running now are building the training history needed to answer that question.
Round 2 — What We Did and What We Learned
Round 1 established infrastructure and proved the approach. Round 2 fixed the known limitations, improved signal quality, and built the first applied outputs. Here's what was done, what was discovered, and what comes next.
Completed — March 2026
PREDICT() while Snowflake ML requires uppercase, causing all features to silently null out; (2) missing within-pending discriminators. Added operator historical win rate (12.6% feature importance), continuation count (10.3%), and days_since_filing. Result: scores now range 0.28%–99.97%, stddev 0.327, 91 distinct values. Evaluation F1 maintained at 0.84.
validate_ml_claims.py runs 5 SQL checks after every weekly ML refresh and compares ML outputs to FCT source data. First run found a real discrepancy: the ML_MOMENTUM_INPUT view includes dismissed cases (no pipeline_status filter), while validation excludes them. This is a design decision to document, not a pipeline failure. Also added s3_spot_check.py for case-level ground truth verification.
HEARING_DATE (most recent docket appearance) rather than when a case was originally filed. A case continued three times over six months would count as "recent" activity each time, emptying the baseline bucket and inflating the recent one. Fix: added FIRST_HEARING_DATE (MIN hearing date per case_no across all raw docket rows) to FCT_DOCKET_CASES, and switched all time-window comparisons to use it. Impact: the Phoenix headline dropped from +422% to +61%, and the overall market growth from +66.5% to +2.8%. More importantly, the true story emerged — Burke County surging +573% while Williams (−46%) and Mountrail (−57%) contract. The continuation artifact had been masking a geographic rotation, not a market-wide surge.
ML_MOMENTUM_INPUT
Design decision resolved: exclude dismissed. A dismissed case has no drilling outcome and should not contribute to a drilling activity forecast. Added AND PIPELINE_STATUS NOT IN ('dismissed') to the ML_MOMENTUM_INPUT view in ml_operator_momentum.py. Impact: overall market signal sharpened from +2.8% → −9.4% (non-Burke activity is actually contracting). Burke/Divide now stand out more cleanly as the real growth story.
DAYS_SINCE_FILING from the feature list (zero variance — all filing dates null in source data, contributing 0% feature importance). New model: F1 0.937 on the progressed class (was 0.84), precision 0.912, recall 0.962. Score spread maintained: stddev 0.326, min 0.15%, 54 distinct values across 277 pending cases scored.
Key Findings — Things We Didn't Know Before Running This
wells_requested field is often null. This is expected behavior, not a data gap. It means the momentum numbers measure clerk-authorized wells, which is the right number for leading indicator purposes.
HEARING_DATE reflects the most recent docket appearance, not the original filing. A case continued three times would count as "recent" three times over. Root fix: FIRST_HEARING_DATE column added to FCT_DOCKET_CASES via MIN(hearing_date) across all raw docket rows per case_no. All momentum bucketing now uses FIRST_HEARING_DATE. The IS_CONTINUED_FROM_PRIOR flag and continuation_count remain as model features for case-level progression scoring, separate from the period bucketing fix.
embedding_signals therefore falls back to case_type + county + operator. This is fine for structural similarity queries but means the "signals" embedding is less semantically rich than intended. Fixing this requires populating the signals during casefile extraction.
Round 3 — Still Ahead
ML_MOMENTUM_INPUT
Decision made and applied. See Completed section above.
▶ Execution prompt
In CREATE OR REPLACE VIEW ML_MOMENTUM_INPUT (~line 44), add after the COMMERCIAL_RELEVANCE filter:
AND PIPELINE_STATUS NOT IN ('dismissed')Then run in order:
1.
python3 snowflake/03_queries/ml_operator_momentum.py — rebuilds ML_MOMENTUM_INPUT, retrains TOP_INSIGHTS, rebuilds V_REGULATORY_MOMENTUM2.
python3 snowflake/03_queries/generate_applied_intelligence.py — regenerates operator_momentum_report.json and county_momentum.json
▶ Execution prompt
Step 1: Remove DAYS_SINCE_FILING from the training feature list (zero variance — all null in source).
Step 2: Run
python3 snowflake/03_queries/ml_case_progression.pyStep 3: Run
python3 snowflake/03_queries/validate_ml_claims.pyTarget metrics: score stddev > 0.25, F1 ≥ 0.80, no uniform-score regression.
Note: If training data has expanded to 600+ examples, also check that precision/recall balance hasn't shifted.
regulatory_decision_signals in casefile extraction
Promoted from MED — this is blocking two downstream items (service demand predictor and higher-quality V2 embeddings). Currently null for all 354 casefiles. The LLM extraction prompt never asked for this field. Adding it requires a schema update to the casefile extraction prompt plus a re-run of the top-relevance casefiles.
▶ Execution prompt
Step 2: Add a
"regulatory_decision_signals" key to the output schema — ask the LLM to identify:["precedent_cited", "contested_mineral_rights", "environmental_concern", "operator_track_record", "technical_justification_strength"]
Step 3: Re-run extraction on the ~50 highest-commercial-relevance casefiles to validate output.
Step 4: After validation, run the V2 embedding update to use the new signals field.
Verification:
SELECT COUNT(*) FROM RAW_CASEFILES_INTELLIGENCE WHERE raw_json:regulatory_decision_signals IS NOT NULL
Use
service_opportunity_signals in the casefile JSON as a multi-label classification target. Predict which service categories (directional drilling, frac/completions, cementing) a new case will trigger, with lead time. Once regulatory_decision_signals is populated, the casefile JSON will also be richer for training.
▶ Execution prompt
Step 2: For those same operator-counties in 2025-Q1 and 2025-Q2, pull EIA monthly production change from OIL_GAS_ANALYTICS.RAW.
Step 3: Run a simple Pearson correlation in Snowflake (or export and run in Python/pandas).
Output: scatter plot data as JSON — weighted_pipeline vs production_change_12mo — to validate the leading-indicator thesis.
SNOWFLAKE.ML.ANOMALY_DETECTION to weekly permit filing counts per operator/county. Flag sudden acceleration or deceleration before it is visible in aggregate stats.
Round 4 — Next Up
regulatory_decision_signals in casefile extraction
Null for all 354 casefiles — the LLM extraction prompt never asked for this field. Blocking two downstream items: service demand predictor and richer V2 embeddings. Fix: update the casefile extraction schema to capture ["precedent_cited", "contested_mineral_rights", "environmental_concern", "operator_track_record", "technical_justification_strength"], re-extract the top-relevance casefiles, then update the V2 embedding pipeline.
▶ Execution prompt
Step 2: Add "regulatory_decision_signals" key to the output schema with the 5 sub-fields above.
Step 3: Re-extract ~50 highest-commercial-relevance casefiles to validate output quality.
Step 4: Update
ml_casefile_embeddings_v2.py to include the signals field in the embedding input.Verify:
SELECT COUNT(*) FROM RAW_CASEFILES_INTELLIGENCE WHERE raw_json:regulatory_decision_signals IS NOT NULL
▶ Execution prompt
1.
python3 snowflake/02_analytics/run_fct_docket_refresh.py2.
python3 snowflake/03_queries/ml_case_progression.py3.
python3 snowflake/03_queries/ml_operator_momentum.py4.
python3 snowflake/03_queries/generate_applied_intelligence.py5.
python3 snowflake/query_signals.py --query docket_pipeline + patch signal_summary_latest.jsonOption B (Snowflake Tasks): Create a DAG in Snowflake that chains the SQL steps, then call a Lambda/webhook to trigger JSON generation.
operator_request_category as a model feature
Completed. OPERATOR_REQUEST_CATEGORIES table created (1,788 rows). Distribution: pooling_request=899 (50%), new_unit=542 (30%), expansion=179 (10%), UNCLASSIFIED=115 (6%), drainage_protection=41 (2%), other=12 (1%). Feature joined into classifier retrain — F1 improved from 0.881 → 0.922.
Use
service_opportunity_signals in the casefile JSON as a multi-label classification target. Predict which service categories (directional drilling, frac/completions, cementing) a new case will trigger, with lead time.
SNOWFLAKE.ML.ANOMALY_DETECTION to weekly permit filing counts per operator/county. Flag sudden acceleration or deceleration before it is visible in aggregate stats.
Round 5 — Roadmap
regulatory_decision_signals in casefile extractionPrompt updated with 5 new fields. All 354 casefiles re-extracted via Gemini 2.0 Flash (1M token context, full documents — no truncation). 324/354 rows populated in RAW_CASEFILES_INTELLIGENCE. Distribution: technical_justification_strength=moderate(225), none(62), weak(30), strong(6). operator_track_record=unknown(306), positive(8), neutral(6), negative(4). Full coverage enabled F1 improvement to 0.951.
Multi-label classification of which services (directional drilling, frac/completions, cementing) each pending case will trigger. Now unblocked — 324/354 casefiles (91.5%) have regulatory_decision_signals. Full implementation deferred to Round 6.
1,086 order JSONs synced to S3 and loaded via COPY INTO. RAW_ORDERFILES_INTELLIGENCE: 422 → 1,086 rows. ML_ORDER_EMBEDDINGS_V2: 422 → 1,086 orders embedded. Classifier and momentum retrained on expanded dataset.
Z-score anomaly detection applied to weekly case counts per operator/county. 24 anomalies detected across 2024–2026 data. Top finding: Phoenix Operating LLC / McKenzie County / week of 2026-02-23 — 26 cases filed vs 2.1/week baseline (z=21.2), extreme surge. Snowflake table: ML_PERMIT_VELOCITY_ANOMALIES (24 rows). JSON: website/permit_anomalies.json.
Round 6 — Roadmap
regulatory_decision_signals coverage — full re-extractionAll 354 casefiles already had OCR text in cleantxt_done/. Re-extracted all with updated prompt using Gemini 2.0 Flash (1M token context, full documents). 324/354 (91.5%) now populated. F1 improved to 0.951. Remaining 30 are JSON parse failures — minor gap, acceptable for training.
Blocked on OCR expansion above. Multi-label classification of which service categories (directional drilling, frac/completions, cementing, SWD) each pending case will trigger — with lead time. Uses
service_opportunity_signals from casefile JSON as the training target. Enables proactive outreach before rig contracts are signed.
regulatory_decision_signals as 4th sectionCurrently ML_CASEFILE_EMBEDDINGS_V2 has 3 sections (request / justification / signals). Add a 4th embedding on regulatory decision context (precedent_cited, operator_track_record, technical_justification_strength). Better precedent retrieval for cases with contested or technically complex histories. Blocked on OCR expansion.
RAW_ORDER_INDEX has metadata for 1,121 orders from 2024, but only ~422 have been fully extracted and embedded. Extracting the remainder would add a full year of precedent history. Estimated cost: ~$30 at current API rates. S3 sync infrastructure now in place (
orderfile_copy_into.py).
Current Pearson r = 0.02 (n=12 counties) — likely reflects the 12–18 month production lag. The 2024 regulatory authorizations may not yet appear in available EIA production data. Re-run once 2025 full-year EIA production numbers are published.
Ask the ML Layer
Natural language queries against the full regulatory ML dataset — cases, scores, momentum, precedents. Powered by Cortex Analyst.