Overview
The ND pipeline runs automatically on weekdays at 5 PM MST, ingesting North Dakota regulatory data — well permits, recently signed orders, and case filings — extracting commercial intelligence via Claude Haiku AI, and deploying enriched reports to oilgasorgrass.com.
The primary output is a set of HTML pages and JSON data files covering 2,100+ wells and hundreds of recent regulatory orders. Orders and cases from the last 14 days are processed through the full AI extraction pipeline and rendered as enriched intelligence cards. Older records remain in the dataset as historical context but are not enriched — they have no current commercial value for service companies acting on regulatory signals.
The pipeline is fully automated with no manual intervention required. A nightly OCR job at 6 PM runs separately to ensure OCR failures never block the main deployment.
Cron Schedule
All times are MST (America/Denver). Five active jobs:
| Schedule | Job | Script | Purpose |
|---|---|---|---|
Weekdays 5:00 PM |
Daily pipeline | scripts/run_daily_pipeline.sh |
Main data fetch → AI extraction → deploy |
Every night 6:00 PM |
OCR processing | scripts/run_ocr_processing.sh |
OCR pending wellfile PDFs → Haiku → deploy |
Sundays 3:00 AM |
Retry missing | scripts/run_retry_missing.sh |
Retry wellfiles with status='not_found' |
1st of month 2:00 AM |
Log cleanup | scripts/cleanup_old_logs.sh |
Delete logs older than 90 days |
Sundays 6:00 AM |
Update detection | utilities/detect_wellfile_updates.py |
Detect changed PDFs → diff pages → dashboard → deploy |
Timing chain
5 PM → 6 PM → Sundays 3 AM / 6 AM. The main pipeline and OCR run back-to-back on weekday evenings; the weekly jobs handle retries and update detection on Sunday.
Daily Pipeline Stages
Script: scripts/run_daily_pipeline.sh — runs weekdays at 5 PM MST.
| Stage | What happens | Script | Output |
|---|---|---|---|
| 0 | Fetch ND DMR daily PDFs | scripts/fetch_nd_daily_pdfs.py |
ND/nddata/*.pdf |
| 0a | Parse PDFs → structured well data | parsers/batch_ingest.py |
ND/nddata/nd_daily.json |
| 0b | Merge new wells into master dataset | parsers/merge_new_wells.py |
nd_daily.json updated |
| 0c | Backfill PLSS coordinates from scout tickets | parsers/backfill_plss_from_scout.py |
Coordinates added inline |
| 0d | Copy master data to website directory | cp |
website/nd_daily.json |
| 0e | Generate scout fetch list | parsers/generate_wells_list.py |
parsers/wells_to_fetch.txt |
| 0f | Fetch scout ticket JSON for new wells | parsers/fetch_scout_batch.py |
website/scout_tickets/ |
| 0g | Generate regional context for new wells | scripts/batch_regional_fast.py |
Regional metadata JSON |
| 0h | Orders pipeline — full orchestration (see sub-steps below) | parsers/fetch_and_process_orders.py --days 14 |
Enriched recently_signed_orders.json |
| 1 | Wellfile orchestrator — 7-day lookback, OCR disabled | scripts/orchestrate_wellfile_pipeline.py --lookback 7 --no-ocr |
Wellfile HTML reports |
| 2 | Regenerate HTML report manifests | Inline Python | html_report_manifest.json, html_report_intelligence_manifest.json |
| 2b | Alert on new Tier 1 permits (signal score ≥ 50) | Inline Python | Slack notification |
| 2c | Generate pipeline monitor status | scripts/generate_pipeline_status.py |
Pipeline status JSON |
| 3 | S3 sync + CloudFront invalidation | website/scripts/deploy.sh |
Live website updated |
| 4 | Snowflake sync — push raw data to warehouse | scripts/run_snowflake_sync.sh |
RAW tables updated |
| 5 | dbt run + test + source freshness | dbt run --target prod |
ANALYTICS views rebuilt |
Stage 0h — Orders Pipeline Sub-steps
fetch_and_process_orders.py is a combined orchestrator that calls individual scripts
in sequence. A single entry point keeps the shell script clean while handling each concern
separately:
| Sub-step | Script | Output |
|---|---|---|
| Download Excel from NDIC DMR | parsers/fetch_recently_signed_orders.py |
ND/ndic_index/data/Recently_Signed_Orders.xlsx |
| Parse Excel → JSON (unenriched) | parsers/parse_recently_signed_orders.py |
website/recently_signed_orders.json |
| Fetch PDFs for last 14 days | parsers/fetch_order_case_files.py |
ND/orderfiles/pdfs/, ND/casefiles/pdfs/ |
| OCR PDFs + Haiku AI extraction + HTML reports | parsers/process_orders_cases_pipeline.py |
ND/orderfiles/llmjson/or{N}.json, ND/casefiles/llmjson/C{N}.json, HTML reports |
| Enrich JSON in-place | parsers/enrich_recently_signed_orders.py |
recently_signed_orders.json with _enriched_* fields added |
What Makes a Record "Enriched"
Enrichment is performed by parsers/enrich_recently_signed_orders.py. For each
record in recently_signed_orders.json, the script looks for a matching Haiku
intelligence JSON:
- Order records — matched via
Order No.→ND/orderfiles/llmjson/or{N}.json - Case records — matched via
Case No.→ND/casefiles/llmjson/C{N}.json
When a match is found, _enriched_* fields are added directly to the record object
and the file is saved in-place. The website detects enrichment by checking
_enriched_operator — if populated, the enriched card layout is used; otherwise the
plain unenriched card is shown.
Order enrichment fields
| Field | Source path in intelligence JSON | Description |
|---|---|---|
_enriched_operator | operator_and_well_intelligence.operator | Operator company name |
_enriched_county | order_identification.county | County (stripped of ", North Dakota" / " County") |
_enriched_order_type | order_identification.order_type | Order classification |
_enriched_primary_decision | regulatory_decision_signals[0].signal | Main regulatory decision (220 chars) |
_enriched_decision_type | regulatory_decision_signals[0].decision_type | Category of decision |
_enriched_why_notable | regulatory_decision_signals[0].why_notable | Why this decision matters (280 chars) |
_enriched_commercial | regulatory_decision_signals[0].commercial_relevance | Commercial impact for service companies (280 chars) |
_enriched_formations | operator_and_well_intelligence.target_formations | Target formations, max 2, standardized names |
_enriched_service_cat | service_opportunity_signals[0].service_category | Primary service opportunity category |
_enriched_service_timing | service_opportunity_signals[0].timing_indicator | Timing signal (e.g., "Immediate", "30–90 days") |
_enriched_top_service | Combined category + timing | Display label for service card |
_enriched_service_description | service_opportunity_signals[0].opportunity_description | Service opportunity detail (280 chars) |
_enriched_service_why | service_opportunity_signals[0].why_relevant | Why relevant to service companies (280 chars) |
_enriched_all_services | service_opportunity_signals[:3] | All service categories (up to 3) for tooltip |
_enriched_confidence | confidence.level | Haiku confidence level |
Case enrichment fields
| Field | Source path in intelligence JSON | Description |
|---|---|---|
_enriched_case_purpose | case_purpose_and_application.primary_purpose | What was applied for (220 chars) |
_enriched_operator_request | case_purpose_and_application.operator_request | Specific ask (250 chars) |
_enriched_filing_date | case_identification.filing_date | When the case was filed |
_enriched_hearing_date | case_identification.hearing_date | Scheduled hearing date |
_enriched_case_status | case_identification.case_status | Current case status |
_enriched_case_service_cat | service_opportunity_signals[0].service_category | Service category |
_enriched_case_service_timing | service_opportunity_signals[0].timing_indicator | Timing signal |
_enriched_case_service_description | service_opportunity_signals[0].opportunity_description | Opportunity detail (280 chars) |
_enriched_case_service_why | service_opportunity_signals[0].why_relevant | Why relevant (280 chars) |
_enriched_case_all_services | service_opportunity_signals[:3] | All service categories for tooltip |
_enriched_case_strategy | operator_and_well_intelligence.operator_strategy_signals[0].strategy | Operator strategy signal (200 chars) |
Why Most Records Are Not Enriched
Recently_Signed_Orders.xlsx) contains thousands of records going back years.
Only the last 14 days of orders are commercially actionable — those are the
only records for which PDFs are fetched and Haiku extraction is run.
The default date filter on recently_signed_orders.html focuses the view on the
recent 14-day window precisely because enrichment coverage is high there. The overall "97%
unenriched" figure includes years of historical records that have never been processed and never
will be — they exist in the JSON for historical context only.
Three reasons a brand-new record may appear unenriched initially
| # | Reason | Typical delay |
|---|---|---|
| 1 | PDF fetch lag — NDIC sometimes posts PDFs 24–48 hours after the order is signed. No PDF = no OCR = no Haiku extraction. | 24–48 hours |
| 2 | Pipeline runs once at 5 PM weekdays — If the PDF wasn't available at run time, enrichment waits until the next business day's run. | Next business day |
| 3 | OCR or Haiku failure — Malformed PDF, rate limit, or extraction error. The record stays unenriched until the next daily run succeeds. | 1–2 days |
Records that will never be enriched (by design)
- Orders signed more than ~30 days ago — outside the 14-day PDF fetch window
- Historical records loaded from Excel without a corresponding PDF ever being fetched
These records remain in the JSON and are accessible via the date range filter, but are not intended to be enriched.
Enrichment Timeline for a New Order
From the moment an order is signed to appearing as an enriched card on the site:
(signed today)
Recently_Signed_Orders.xlsx. Stage 0h downloads the Excel
and adds the record to recently_signed_orders.json as unenriched.
NDIC may or may not have posted the PDF yet.
--days 14 lookback. If the PDF is available on NDIC's
servers, it is downloaded, OCR'd, and sent through Haiku extraction during Stage 0h.
5 PM run
--days 14 lookback catches it.
Data Flow Diagram
End-to-end flow for the orders pipeline (Stage 0h → Stage 3):
Directory Reference
Key paths for the orders/cases pipeline:
Log locations
| Log file | Job |
|---|---|
ND/wellfiles/logs/daily_automation_YYYYMMDD.log | Main 5 PM pipeline |
ND/wellfiles/logs/ocr_processing_YYYYMMDD.log | 6 PM OCR job |
ND/wellfiles/logs/retry_YYYYMMDD.log | Sunday 3 AM retry |
ND/wellfiles/logs/update_detection_YYYYMMDD.log | Sunday 6 AM wellfile updates |