VIEW SPEECH SUMMARY
- AI is an umbrella term including classic machine learning and Generative AI, which generates new content.
- Large Language Models (LLMs) like ChatGPT act as advanced autocomplete systems trained on vast datasets.
- Prompt engineering is essential for obtaining accurate and relevant responses from LLMs.
- Data types: unstructured (text/documents), semi-structured (CSV, Excel), and structured (SQL databases).
- Two Gen AI approaches:
- Retrieval Augmented Generation (RAG): supplements queries with relevant external documents for more accurate answers in real-time.
- Fine-tuning: custom-trains an LLM on specific data but can be costly and infrequent.
Gen AI on Semi-Structured Data (ChatGPT Use Case):
- ChatGPT can upload and interpret files up to 50MB, such as CSVs, for viewing, querying, and cleaning data.
- Demonstrations include filtering movies by rating, transforming data categories, and enriching data by adding actor information via web lookups.
- Accuracy has improved significantly with version 4.0, enabling handling of hundreds of rows.
- Deep research mode offers enhanced accuracy but takes longer (5-10 minutes).
- Alternatives like AskOnData.com provide advanced data querying but may require payment.
Microsoft Fabric on Structured Data:
- Microsoft Fabric is a comprehensive platform combining ETL, data warehousing, streaming, Spark notebooks, and Power BI.
- Introduces "Data Agents" (formerly AI Skills) to enable natural language querying of structured data.
- Users create and configure data agents by selecting relevant tables and providing hints for accuracy.
- Data agents convert user English queries into SQL queries executed on selected data, supporting role-based security.
- Users can refine queries by providing corrected SQL or hints to improve response accuracy.
- Published data agents can be accessed via URLs in notebooks or other applications, including Microsoft Teams.
- Fabric’s Copilot feature allows interactive chat with data models and is designed for non-technical end users.
Integration and Future Developments:
- Combining unstructured, semi-structured, and structured data via vector databases optimizes querying by retrieving relevant documents before querying LLMs.
- Vector databases filter documents for relevance, which with data agents leads to comprehensive answers.
- Upcoming enhancements in Fabric aim to incorporate semi-structured and unstructured data for consolidated querying.
- Competitors like Amazon and Google are expected to release similar solutions.
Industry Use Cases:
- Healthcare: Integrating patient records, clinical notes, and transcription to enhance diagnosis and treatment with LLMs.
- Personalized Marketing: Combining structured purchase history with social media and product reviews for targeted campaigns.
- Fraud Detection: Leveraging transaction data, call transcripts, legal documents, and emails to identify fraudulent activities.
- Sports Analytics: Analyzing player stats, coaching notes, and game logs for strategy optimization in real-time or pre-match.
Actionable Items:
- Attendees can request a free PDF copy of James Serra’s book by emailing him and mentioning attendance.
- Explore creating data agents in Microsoft Fabric to enable natural language querying on selected structured data.
- Consider integrating vector databases with LLMs for multi-format data querying.
- Experiment with ChatGPT for semi-structured data cleaning and enrichment tasks.
- Keep watch for evolving capabilities in Fabric and competing AI data platforms.
Using Generative AI on Structured Data
11:00 - 11:30, 28th of May (Wednesday) 2025 / DEV TRENDS STAGE
Generative AI, traditionally used for processing unstructured text, is rapidly advancing to handle structured data like relational databases, spreadsheets, and CSV files. New tools now enable AI to extract meaningful insights, identify patterns, and generate predictions from structured datasets. This presentation will explore how AI transforms our interaction with structured data, providing practical applications for enhanced automation, decision-making, and efficiency in data analysis. I will discuss ChatGPT, Copilot, and Microsoft Fabric AI Skill and provide a level-set on GenAI definitions, RAG, fine-tuning, and cover industry use cases for using both unstructured and structured data to make better business decisions.