Pipeline Overview

ND Oil & Gas Intelligence — Operational Reference
Production 5 cron jobs Weekdays 5 PM MST Last updated: 2026-03-17
§ 1

Overview

The ND pipeline runs automatically on weekdays at 5 PM MST, ingesting North Dakota regulatory data — well permits, recently signed orders, and case filings — extracting commercial intelligence via Claude Haiku AI, and deploying enriched reports to oilgasorgrass.com.

The primary output is a set of HTML pages and JSON data files covering 2,100+ wells and hundreds of recent regulatory orders. Orders and cases from the last 14 days are processed through the full AI extraction pipeline and rendered as enriched intelligence cards. Older records remain in the dataset as historical context but are not enriched — they have no current commercial value for service companies acting on regulatory signals.

The pipeline is fully automated with no manual intervention required. A nightly OCR job at 6 PM runs separately to ensure OCR failures never block the main deployment.

Key question this document answers
Why are most records on recently_signed_orders.html unenriched — and when will they become enriched? See § 5 and § 6.
§ 2

Cron Schedule

All times are MST (America/Denver). Five active jobs:

Schedule Job Script Purpose
Weekdays 5:00 PM Daily pipeline scripts/run_daily_pipeline.sh Main data fetch → AI extraction → deploy
Every night 6:00 PM OCR processing scripts/run_ocr_processing.sh OCR pending wellfile PDFs → Haiku → deploy
Sundays 3:00 AM Retry missing scripts/run_retry_missing.sh Retry wellfiles with status='not_found'
1st of month 2:00 AM Log cleanup scripts/cleanup_old_logs.sh Delete logs older than 90 days
Sundays 6:00 AM Update detection utilities/detect_wellfile_updates.py Detect changed PDFs → diff pages → dashboard → deploy

Timing chain

5 PM → 6 PM → Sundays 3 AM / 6 AM. The main pipeline and OCR run back-to-back on weekday evenings; the weekly jobs handle retries and update detection on Sunday.

Why OCR is decoupled (Feb 12, 2026 fix)
Before Feb 12, OCR ran inside the 5 PM pipeline. When tesseract failed, it blocked all downstream deployments. After separating OCR to a standalone 6 PM job, OCR failures no longer prevent the main pipeline from delivering updated data to the site.
§ 3

Daily Pipeline Stages

Script: scripts/run_daily_pipeline.sh — runs weekdays at 5 PM MST.

Stage What happens Script Output
0 Fetch ND DMR daily PDFs scripts/fetch_nd_daily_pdfs.py ND/nddata/*.pdf
0a Parse PDFs → structured well data parsers/batch_ingest.py ND/nddata/nd_daily.json
0b Merge new wells into master dataset parsers/merge_new_wells.py nd_daily.json updated
0c Backfill PLSS coordinates from scout tickets parsers/backfill_plss_from_scout.py Coordinates added inline
0d Copy master data to website directory cp website/nd_daily.json
0e Generate scout fetch list parsers/generate_wells_list.py parsers/wells_to_fetch.txt
0f Fetch scout ticket JSON for new wells parsers/fetch_scout_batch.py website/scout_tickets/
0g Generate regional context for new wells scripts/batch_regional_fast.py Regional metadata JSON
0h Orders pipeline — full orchestration (see sub-steps below) parsers/fetch_and_process_orders.py --days 14 Enriched recently_signed_orders.json
1 Wellfile orchestrator — 7-day lookback, OCR disabled scripts/orchestrate_wellfile_pipeline.py --lookback 7 --no-ocr Wellfile HTML reports
2 Regenerate HTML report manifests Inline Python html_report_manifest.json, html_report_intelligence_manifest.json
2b Alert on new Tier 1 permits (signal score ≥ 50) Inline Python Slack notification
2c Generate pipeline monitor status scripts/generate_pipeline_status.py Pipeline status JSON
3 S3 sync + CloudFront invalidation website/scripts/deploy.sh Live website updated
4 Snowflake sync — push raw data to warehouse scripts/run_snowflake_sync.sh RAW tables updated
5 dbt run + test + source freshness dbt run --target prod ANALYTICS views rebuilt

Stage 0h — Orders Pipeline Sub-steps

fetch_and_process_orders.py is a combined orchestrator that calls individual scripts in sequence. A single entry point keeps the shell script clean while handling each concern separately:

Sub-step Script Output
Download Excel from NDIC DMR parsers/fetch_recently_signed_orders.py ND/ndic_index/data/Recently_Signed_Orders.xlsx
Parse Excel → JSON (unenriched) parsers/parse_recently_signed_orders.py website/recently_signed_orders.json
Fetch PDFs for last 14 days parsers/fetch_order_case_files.py ND/orderfiles/pdfs/, ND/casefiles/pdfs/
OCR PDFs + Haiku AI extraction + HTML reports parsers/process_orders_cases_pipeline.py ND/orderfiles/llmjson/or{N}.json, ND/casefiles/llmjson/C{N}.json, HTML reports
Enrich JSON in-place parsers/enrich_recently_signed_orders.py recently_signed_orders.json with _enriched_* fields added
§ 4

What Makes a Record "Enriched"

Enrichment is performed by parsers/enrich_recently_signed_orders.py. For each record in recently_signed_orders.json, the script looks for a matching Haiku intelligence JSON:

  • Order records — matched via Order No.ND/orderfiles/llmjson/or{N}.json
  • Case records — matched via Case No.ND/casefiles/llmjson/C{N}.json

When a match is found, _enriched_* fields are added directly to the record object and the file is saved in-place. The website detects enrichment by checking _enriched_operator — if populated, the enriched card layout is used; otherwise the plain unenriched card is shown.

Order enrichment fields

FieldSource path in intelligence JSONDescription
_enriched_operatoroperator_and_well_intelligence.operatorOperator company name
_enriched_countyorder_identification.countyCounty (stripped of ", North Dakota" / " County")
_enriched_order_typeorder_identification.order_typeOrder classification
_enriched_primary_decisionregulatory_decision_signals[0].signalMain regulatory decision (220 chars)
_enriched_decision_typeregulatory_decision_signals[0].decision_typeCategory of decision
_enriched_why_notableregulatory_decision_signals[0].why_notableWhy this decision matters (280 chars)
_enriched_commercialregulatory_decision_signals[0].commercial_relevanceCommercial impact for service companies (280 chars)
_enriched_formationsoperator_and_well_intelligence.target_formationsTarget formations, max 2, standardized names
_enriched_service_catservice_opportunity_signals[0].service_categoryPrimary service opportunity category
_enriched_service_timingservice_opportunity_signals[0].timing_indicatorTiming signal (e.g., "Immediate", "30–90 days")
_enriched_top_serviceCombined category + timingDisplay label for service card
_enriched_service_descriptionservice_opportunity_signals[0].opportunity_descriptionService opportunity detail (280 chars)
_enriched_service_whyservice_opportunity_signals[0].why_relevantWhy relevant to service companies (280 chars)
_enriched_all_servicesservice_opportunity_signals[:3]All service categories (up to 3) for tooltip
_enriched_confidenceconfidence.levelHaiku confidence level

Case enrichment fields

FieldSource path in intelligence JSONDescription
_enriched_case_purposecase_purpose_and_application.primary_purposeWhat was applied for (220 chars)
_enriched_operator_requestcase_purpose_and_application.operator_requestSpecific ask (250 chars)
_enriched_filing_datecase_identification.filing_dateWhen the case was filed
_enriched_hearing_datecase_identification.hearing_dateScheduled hearing date
_enriched_case_statuscase_identification.case_statusCurrent case status
_enriched_case_service_catservice_opportunity_signals[0].service_categoryService category
_enriched_case_service_timingservice_opportunity_signals[0].timing_indicatorTiming signal
_enriched_case_service_descriptionservice_opportunity_signals[0].opportunity_descriptionOpportunity detail (280 chars)
_enriched_case_service_whyservice_opportunity_signals[0].why_relevantWhy relevant (280 chars)
_enriched_case_all_servicesservice_opportunity_signals[:3]All service categories for tooltip
_enriched_case_strategyoperator_and_well_intelligence.operator_strategy_signals[0].strategyOperator strategy signal (200 chars)
§ 5

Why Most Records Are Not Enriched

Expected behavior — not a bug
A stat like "97% unenriched" is normal and intentional. The Excel source (Recently_Signed_Orders.xlsx) contains thousands of records going back years. Only the last 14 days of orders are commercially actionable — those are the only records for which PDFs are fetched and Haiku extraction is run.

The default date filter on recently_signed_orders.html focuses the view on the recent 14-day window precisely because enrichment coverage is high there. The overall "97% unenriched" figure includes years of historical records that have never been processed and never will be — they exist in the JSON for historical context only.

Three reasons a brand-new record may appear unenriched initially

#ReasonTypical delay
1 PDF fetch lag — NDIC sometimes posts PDFs 24–48 hours after the order is signed. No PDF = no OCR = no Haiku extraction. 24–48 hours
2 Pipeline runs once at 5 PM weekdays — If the PDF wasn't available at run time, enrichment waits until the next business day's run. Next business day
3 OCR or Haiku failure — Malformed PDF, rate limit, or extraction error. The record stays unenriched until the next daily run succeeds. 1–2 days

Records that will never be enriched (by design)

  • Orders signed more than ~30 days ago — outside the 14-day PDF fetch window
  • Historical records loaded from Excel without a corresponding PDF ever being fetched

These records remain in the JSON and are accessible via the date range filter, but are not intended to be enriched.

§ 6

Enrichment Timeline for a New Order

From the moment an order is signed to appearing as an enriched card on the site:

Day 0
(signed today)
Order appears in Recently_Signed_Orders.xlsx. Stage 0h downloads the Excel and adds the record to recently_signed_orders.json as unenriched. NDIC may or may not have posted the PDF yet.
Day 0 → Day 1
Pipeline runs with --days 14 lookback. If the PDF is available on NDIC's servers, it is downloaded, OCR'd, and sent through Haiku extraction during Stage 0h.
Day 1
5 PM run
If the PDF was available: OCR → Haiku → enrich → deploy. The record is now an enriched card on the live site by ~6 PM.
Day 2–3
If NDIC hadn't posted the PDF yet: automatic retry happens on the next daily run. No manual action needed — the pipeline's --days 14 lookback catches it.
Check enrichment coverage manually
python3 -c "import json; d=json.load(open('website/recently_signed_orders.json')); o=d['orders']; e=sum(1 for x in o if x.get('_enriched_operator')); print(f'{e}/{len(o)} enriched ({100*e/len(o):.1f}%)')"
§ 7

Data Flow Diagram

End-to-end flow for the orders pipeline (Stage 0h → Stage 3):

# NDIC DMR Excel (downloaded daily) Recently_Signed_Orders.xlsx └─ fetch_recently_signed_orders.py └─ parse_recently_signed_orders.py └─ website/recently_signed_orders.json ← UNENRICHED (3,000+ records, years of history) [Last 14 days only — commercially actionable window] ├─ fetch_order_case_files.py --days 14 └─ ND/orderfiles/pdfs/or{N}.pdf ND/casefiles/pdfs/C{N}.pdf ├─ process_orders_cases_pipeline.py ├─ OCR: PDF → clean text └─ ND/orderfiles/cleantxt/or{N}_clean.txt ├─ Haiku: clean text → intelligence JSON └─ ND/orderfiles/llmjson/or{N}.json ND/casefiles/llmjson/C{N}.json └─ HTML reports └─ website/orderhtml/, website/casehtml/ └─ enrich_recently_signed_orders.py └─ website/recently_signed_orders.json ← ENRICHED (adds _enriched_* fields in-place) [Stage 3: S3 sync + CloudFront] └─ website/recently_signed_orders.html (enriched cards sort first)
§ 8

Directory Reference

Key paths for the orders/cases pipeline:

ND/ ├── ndic_index/data/ │ └── Recently_Signed_Orders.xlsx ← source Excel, downloaded daily ├── orderfiles/ │ ├── pdfs/ ← downloaded PDFs (or{N}.pdf) │ ├── cleantxt/ ← OCR'd clean text (or{N}_clean.txt) │ └── llmjson/ ← Haiku intelligence JSONs (or{N}.json) ├── casefiles/ │ ├── pdfs/ ← downloaded PDFs (C{N}.pdf) │ ├── cleantxt/ ← OCR'd clean text │ └── llmjson/ ← Haiku intelligence JSONs (C{N}.json) ├── nddata/ ← nd_daily.json — well permit master data └── wellfiles/ ├── logs/ ← daily_automation_YYYYMMDD.log, ocr_processing_YYYYMMDD.log ├── pdfs/ ← wellfile PDFs └── llmjson*/ ← wellfile intelligence JSONs website/ ├── recently_signed_orders.json ← master orders data, enriched in-place daily ├── recently_signed_orders.html ← enriched-first sort, default 14-day filter ├── nd_daily.json ← well permit data (2,100+ wells) ├── orderhtml/ ← HTML reports linked from enriched order cards ├── casehtml/ ← HTML reports linked from enriched case cards ├── html_reports/ ← wellfile standard reports └── html_reports_i/ ← wellfile intelligence/permit reports

Log locations

Log fileJob
ND/wellfiles/logs/daily_automation_YYYYMMDD.logMain 5 PM pipeline
ND/wellfiles/logs/ocr_processing_YYYYMMDD.log6 PM OCR job
ND/wellfiles/logs/retry_YYYYMMDD.logSunday 3 AM retry
ND/wellfiles/logs/update_detection_YYYYMMDD.logSunday 6 AM wellfile updates