Summary
Keywords
Full Transcript
💡 Course Enrolment Link: https://www.geekcoders.co.in/courses/Practice-50-PySpark-Interview-Questions-65f317e33210a77d47c6afaf Script: df = spark.createDataFrame([(1234,), (3245,), (2435,), (13,)], ["id"]) The project automates procurement document validation by building a pipeline in Databricks using LLMs, embeddings, Delta Lake, and Unity Catalog. 📊 Architecture Breakdown 1. Source System & Staging PDFs for PO, GR, and Invoice are stored in Google Drive(ADLS/S3) Pulled into staging volume using Databricks 2. Bronze Layer – Unstructured Text PDF documents are read and parsed (DLT) Unstructured content stored in Delta Lake as raw text 3. LLM Foundation Model A Large Language Model (LLM) is applied to extract structured fields like: vendor_name, po_number, quantity, price, invoice_amount, etc. 4. Silver Layer – Structured Format Clean, structured data stored in Delta Lake tables Embeddings are generated and stored for semantic search 5. Vector Search Indexing POs, GRs, and Invoices are indexed using Vector Search(Optional) Each document type is compared using semantic matching 6. Matching Logic LLM layer performs 2-way or 3-way match logic: PO ↔ GR ↔ Invoice If matched: Data pushed to Gold Layer and shown on dashboard If not matched: Email alerts are triggered for mismatches 7. Gold Layer + Reporting Final matched records stored in Delta Lake Dashboard built for insights (e.g., discrepancy rate, vendor performance,total invoice mismatch) 8. Databricks Components Used ✅ Unity Catalog – Data governance, tables, masking ✅ Model Serving – LLM/embedding inference ✅ Pipelines – DLT and workflow automation ✅ SQL Alerts – Alert on mismatches ✅ Workflows – Schedule and manage pipeline ✅ Governance – SPN, Group-based access ✅ Key Benefits 🔍 Automated document parsing and validation 🤖 AI-powered matching using LLM and embeddings 📈 Real-time dashboards and discrepancy alerts 📧 Auto notifications for non-matching records 🔐 Scalable and governed architecture using Unity Catalog Course link: https://www.geekcoders.co.in/courses/Databricks-End---End-Project--with-Delta-Lake-and-Gen-AI-685108965033f64f53e47035 #interview #microsoft #dataengineering
