Pipeline Overview — ND Oil & Gas

§ 1

Overview

The ND pipeline runs automatically on weekdays at 5 PM MST, ingesting North Dakota regulatory data — well permits, recently signed orders, and case filings — extracting commercial intelligence via Claude Haiku, and deploying enriched reports to oilgasorgrass.com.

The primary output is a set of HTML pages and JSON data files covering 2,100+ wells and hundreds of recent regulatory orders. Orders and cases from the last 14 days are processed through the full structured extraction pipeline and rendered as enriched intelligence cards. Older records remain in the dataset as historical context but are not enriched — they have no current commercial value for service companies acting on regulatory signals.

The pipeline is fully automated with no manual intervention required. A nightly OCR job at 6 PM runs separately to ensure OCR failures never block the main deployment.

Key question this document answers

Why are most records on recently_signed_orders.html unenriched — and when will they become enriched? See § 5 and § 6.

§ 2

Cron Schedule

All times are MST (America/Denver). Five active jobs:

Schedule	Job	Script	Purpose
`Weekdays 5:00 PM`	Daily pipeline	`scripts/run_daily_pipeline.sh`	Main data fetch → structured extraction → deploy
`Every night 6:00 PM`	OCR processing	`scripts/run_ocr_processing.sh`	OCR pending wellfile PDFs → Haiku → deploy
`Sundays 3:00 AM`	Retry missing	`scripts/run_retry_missing.sh`	Retry wellfiles with `status='not_found'`
`1st of month 2:00 AM`	Log cleanup	`scripts/cleanup_old_logs.sh`	Delete logs older than 90 days
`Sundays 6:00 AM`	Update detection	`utilities/detect_wellfile_updates.py`	Detect changed PDFs → diff pages → dashboard → deploy

Timing chain

5 PM → 6 PM → Sundays 3 AM / 6 AM. The main pipeline and OCR run back-to-back on weekday evenings; the weekly jobs handle retries and update detection on Sunday.

Why OCR is decoupled (Feb 12, 2026 fix)

Before Feb 12, OCR ran inside the 5 PM pipeline. When tesseract failed, it blocked all downstream deployments. After separating OCR to a standalone 6 PM job, OCR failures no longer prevent the main pipeline from delivering updated data to the site.

§ 3

Daily Pipeline Stages

Script: scripts/run_daily_pipeline.sh — runs weekdays at 5 PM MST.

Stage	What happens	Script	Output
0	Fetch ND DMR daily PDFs	`scripts/fetch_nd_daily_pdfs.py`	`ND/nddata/*.pdf`
0a	Parse PDFs → structured well data	`parsers/batch_ingest.py`	`ND/nddata/nd_daily.json`
0b	Merge new wells into master dataset	`parsers/merge_new_wells.py`	`nd_daily.json` updated
0c	Backfill PLSS coordinates from scout tickets	`parsers/backfill_plss_from_scout.py`	Coordinates added inline
0d	Copy master data to website directory	`cp`	`website/nd_daily.json`
0e	Generate scout fetch list	`parsers/generate_wells_list.py`	`parsers/wells_to_fetch.txt`
0f	Fetch scout ticket JSON for new wells	`parsers/fetch_scout_batch.py`	`website/scout_tickets/`
0g	Generate regional context for new wells	`scripts/batch_regional_fast.py`	Regional metadata JSON
0h	Orders pipeline — full orchestration (see sub-steps below)	`parsers/fetch_and_process_orders.py --days 14`	Enriched `recently_signed_orders.json`
1	Wellfile orchestrator — 7-day lookback, OCR disabled	`scripts/orchestrate_wellfile_pipeline.py --lookback 7 --no-ocr`	Wellfile HTML reports
2	Regenerate HTML report manifests	Inline Python	`html_report_manifest.json`, `html_report_intelligence_manifest.json`
2b	Alert on new Tier 1 permits (signal score ≥ 50)	Inline Python	Slack notification
2c	Generate pipeline monitor status	`scripts/generate_pipeline_status.py`	Pipeline status JSON
3	S3 sync + CloudFront invalidation	`website/scripts/deploy.sh`	Live website updated
4	Snowflake sync — push raw data to warehouse	`scripts/run_snowflake_sync.sh`	RAW tables updated
5	dbt run + test + source freshness	`dbt run --target prod`	ANALYTICS views rebuilt

Stage 0h — Orders Pipeline Sub-steps

fetch_and_process_orders.py is a combined orchestrator that calls individual scripts in sequence. A single entry point keeps the shell script clean while handling each concern separately:

Sub-step	Script	Output
Download Excel from NDIC DMR	`parsers/fetch_recently_signed_orders.py`	`ND/ndic_index/data/Recently_Signed_Orders.xlsx`
Parse Excel → JSON (unenriched)	`parsers/parse_recently_signed_orders.py`	`website/recently_signed_orders.json`
Fetch PDFs for last 14 days	`parsers/fetch_order_case_files.py`	`ND/orderfiles/pdfs/`, `ND/casefiles/pdfs/`
OCR PDFs + Haiku extraction + HTML reports	`parsers/process_orders_cases_pipeline.py`	`ND/orderfiles/llmjson/or{N}.json`, `ND/casefiles/llmjson/C{N}.json`, HTML reports
Enrich JSON in-place	`parsers/enrich_recently_signed_orders.py`	`recently_signed_orders.json` with `_enriched_*` fields added

§ 4

What Makes a Record "Enriched"

Enrichment is performed by parsers/enrich_recently_signed_orders.py. For each record in recently_signed_orders.json, the script looks for a matching Haiku intelligence JSON:

Order records — matched via Order No. → ND/orderfiles/llmjson/or{N}.json
Case records — matched via Case No. → ND/casefiles/llmjson/C{N}.json

When a match is found, _enriched_* fields are added directly to the record object and the file is saved in-place. The website detects enrichment by checking _enriched_operator — if populated, the enriched card layout is used; otherwise the plain unenriched card is shown.

Order enrichment fields

Field	Source path in intelligence JSON	Description
`_enriched_operator`	`operator_and_well_intelligence.operator`	Operator company name
`_enriched_county`	`order_identification.county`	County (stripped of ", North Dakota" / " County")
`_enriched_order_type`	`order_identification.order_type`	Order classification
`_enriched_primary_decision`	`regulatory_decision_signals[0].signal`	Main regulatory decision (220 chars)
`_enriched_decision_type`	`regulatory_decision_signals[0].decision_type`	Category of decision
`_enriched_why_notable`	`regulatory_decision_signals[0].why_notable`	Why this decision matters (280 chars)
`_enriched_commercial`	`regulatory_decision_signals[0].commercial_relevance`	Commercial impact for service companies (280 chars)
`_enriched_formations`	`operator_and_well_intelligence.target_formations`	Target formations, max 2, standardized names
`_enriched_service_cat`	`service_opportunity_signals[0].service_category`	Primary service opportunity category
`_enriched_service_timing`	`service_opportunity_signals[0].timing_indicator`	Timing signal (e.g., "Immediate", "30–90 days")
`_enriched_top_service`	Combined category + timing	Display label for service card
`_enriched_service_description`	`service_opportunity_signals[0].opportunity_description`	Service opportunity detail (280 chars)
`_enriched_service_why`	`service_opportunity_signals[0].why_relevant`	Why relevant to service companies (280 chars)
`_enriched_all_services`	`service_opportunity_signals[:3]`	All service categories (up to 3) for tooltip
`_enriched_confidence`	`confidence.level`	Haiku confidence level

Case enrichment fields

Field	Source path in intelligence JSON	Description
`_enriched_case_purpose`	`case_purpose_and_application.primary_purpose`	What was applied for (220 chars)
`_enriched_operator_request`	`case_purpose_and_application.operator_request`	Specific ask (250 chars)
`_enriched_filing_date`	`case_identification.filing_date`	When the case was filed
`_enriched_hearing_date`	`case_identification.hearing_date`	Scheduled hearing date
`_enriched_case_status`	`case_identification.case_status`	Current case status
`_enriched_case_service_cat`	`service_opportunity_signals[0].service_category`	Service category
`_enriched_case_service_timing`	`service_opportunity_signals[0].timing_indicator`	Timing signal
`_enriched_case_service_description`	`service_opportunity_signals[0].opportunity_description`	Opportunity detail (280 chars)
`_enriched_case_service_why`	`service_opportunity_signals[0].why_relevant`	Why relevant (280 chars)
`_enriched_case_all_services`	`service_opportunity_signals[:3]`	All service categories for tooltip
`_enriched_case_strategy`	`operator_and_well_intelligence.operator_strategy_signals[0].strategy`	Operator strategy signal (200 chars)

§ 5

Why Most Records Are Not Enriched

Expected behavior — not a bug

A stat like "97% unenriched" is normal and intentional. The Excel source (Recently_Signed_Orders.xlsx) contains thousands of records going back years. Only the last 14 days of orders are commercially actionable — those are the only records for which PDFs are fetched and Haiku extraction is run.

The default date filter on recently_signed_orders.html focuses the view on the recent 14-day window precisely because enrichment coverage is high there. The overall "97% unenriched" figure includes years of historical records that have never been processed and never will be — they exist in the JSON for historical context only.

Three reasons a brand-new record may appear unenriched initially

#	Reason	Typical delay
1	PDF fetch lag — NDIC sometimes posts PDFs 24–48 hours after the order is signed. No PDF = no OCR = no Haiku extraction.	24–48 hours
2	Pipeline runs once at 5 PM weekdays — If the PDF wasn't available at run time, enrichment waits until the next business day's run.	Next business day
3	OCR or Haiku failure — Malformed PDF, rate limit, or extraction error. The record stays unenriched until the next daily run succeeds.	1–2 days

Records that will never be enriched (by design)

Orders signed more than ~30 days ago — outside the 14-day PDF fetch window
Historical records loaded from Excel without a corresponding PDF ever being fetched

These records remain in the JSON and are accessible via the date range filter, but are not intended to be enriched.

§ 6

Enrichment Timeline for a New Order

From the moment an order is signed to appearing as an enriched card on the site:

Day 0
(signed today)

Order appears in Recently_Signed_Orders.xlsx. Stage 0h downloads the Excel and adds the record to recently_signed_orders.json as unenriched. NDIC may or may not have posted the PDF yet.

Day 0 → Day 1

Pipeline runs with --days 14 lookback. If the PDF is available on NDIC's servers, it is downloaded, OCR'd, and sent through Haiku extraction during Stage 0h.

Day 1
5 PM run

If the PDF was available: OCR → Haiku → enrich → deploy. The record is now an enriched card on the live site by ~6 PM.

Day 2–3

If NDIC hadn't posted the PDF yet: automatic retry happens on the next daily run. No manual action needed — the pipeline's --days 14 lookback catches it.

Check enrichment coverage manually

python3 -c "import json; d=json.load(open('website/recently_signed_orders.json')); o=d['orders']; e=sum(1 for x in o if x.get('_enriched_operator')); print(f'{e}/{len(o)} enriched ({100*e/len(o):.1f}%)')"

§ 7

Data Flow Diagram

End-to-end flow for the orders pipeline (Stage 0h → Stage 3):

# NDIC DMR Excel (downloaded daily)
Recently_Signed_Orders.xlsx
  └─ fetch_recently_signed_orders.py
     └─ parse_recently_signed_orders.py
        └─ website/recently_signed_orders.json  ← UNENRICHED (3,000+ records, years of history)
              │
              │  [Last 14 days only — commercially actionable window]
              │
              ├─ fetch_order_case_files.py --days 14
              │     └─ ND/orderfiles/pdfs/or{N}.pdf
              │        ND/casefiles/pdfs/C{N}.pdf
              │
              ├─ process_orders_cases_pipeline.py
              │   ├─ OCR: PDF → clean text
              │   │     └─ ND/orderfiles/cleantxt/or{N}_clean.txt
              │   ├─ Haiku: clean text → intelligence JSON
              │   │     └─ ND/orderfiles/llmjson/or{N}.json
              │   │        ND/casefiles/llmjson/C{N}.json
              │   └─ HTML reports
              │         └─ website/orderhtml/,  website/casehtml/
              │
              └─ enrich_recently_signed_orders.py
                    └─ website/recently_signed_orders.json  ← ENRICHED (adds _enriched_* fields in-place)
                          │
                          │  [Stage 3: S3 sync + CloudFront]
                          │
                          └─ website/recently_signed_orders.html  (enriched cards sort first)
        

§ 8

Directory Reference

Key paths for the orders/cases pipeline:

ND/ ├── ndic_index/data/ │ └── Recently_Signed_Orders.xlsx ← source Excel, downloaded daily ├── orderfiles/ │ ├── pdfs/ ← downloaded PDFs (or{N}.pdf) │ ├── cleantxt/ ← OCR'd clean text (or{N}_clean.txt) │ └── llmjson/ ← Haiku intelligence JSONs (or{N}.json) ├── casefiles/ │ ├── pdfs/ ← downloaded PDFs (C{N}.pdf) │ ├── cleantxt/ ← OCR'd clean text │ └── llmjson/ ← Haiku intelligence JSONs (C{N}.json) ├── nddata/ ← nd_daily.json — well permit master data └── wellfiles/ ├── logs/ ← daily_automation_YYYYMMDD.log, ocr_processing_YYYYMMDD.log ├── pdfs/ ← wellfile PDFs └── llmjson*/ ← wellfile intelligence JSONs website/ ├── recently_signed_orders.json ← master orders data, enriched in-place daily ├── recently_signed_orders.html ← enriched-first sort, default 14-day filter ├── nd_daily.json ← well permit data (2,100+ wells) ├── orderhtml/ ← HTML reports linked from enriched order cards ├── casehtml/ ← HTML reports linked from enriched case cards ├── html_reports/ ← wellfile standard reports └── html_reports_i/ ← wellfile intelligence/permit reports

Log locations

Log file	Job
`ND/wellfiles/logs/daily_automation_YYYYMMDD.log`	Main 5 PM pipeline
`ND/wellfiles/logs/ocr_processing_YYYYMMDD.log`	6 PM OCR job
`ND/wellfiles/logs/retry_YYYYMMDD.log`	Sunday 3 AM retry
`ND/wellfiles/logs/update_detection_YYYYMMDD.log`	Sunday 6 AM wellfile updates

§ 9

Overview

Cron Schedule

Timing chain

Daily Pipeline Stages

Stage 0h — Orders Pipeline Sub-steps

What Makes a Record "Enriched"

Order enrichment fields

Case enrichment fields

Why Most Records Are Not Enriched

Three reasons a brand-new record may appear unenriched initially

Records that will never be enriched (by design)

Enrichment Timeline for a New Order

Data Flow Diagram

Directory Reference

Log locations

Related Documentation