library / lib4437afc86b2cacdf
Data Analysis with LLMs
Immanuel Trummer · 2025
In a sentence
A hands-on guide showing developers and data scientists how to use large language models—across text, tables, images, audio, and graphs—to build effective, cost-efficient data analysis pipelines in Python.
Data Analysis with LLMs by Cornell professor Immanuel Trummer is the practical field manual every data practitioner needs to exploit the transformative capabilities of modern language models. Starting from first principles—what a prompt is, how tokenization works, why few-shot examples help—the book walks readers step by step through real Python mini-projects that classify text, extract structured information, cluster documents, translate natural language into SQL and Cypher queries, answer questions about images and videos, transcribe and translate audio, and build voice-driven database interfaces. It then tackles the hard economic problem every production team faces: how to get high-quality results without overpaying. Chapters on model selection, parameter tuning, prompt engineering, and fine-tuning demonstrate concrete cost-quality tradeoffs on a running sentiment-classification scenario. The final section broadens the toolkit to GPT alternatives (Anthropic, Cohere, Google, Hugging Face), the LangChain agent framework, and LlamaIndex for multimodal retrieval—giving readers everything they need to design sophisticated, maintainable AI pipelines. Whether you are a software developer, data scientist, or curious hobbyist, this book turns the magic of LLMs into systematic, replicable engineering practice.
The four lenses
- Science
- Statistics
- Systems
- Strategy
Tags
The model
A causal model describing how design levers applied by a practitioner—prompt design, model selection, parameter configuration, fine-tuning, and architectural choices—drive psychological and behavioral outcomes (trust, adoption, iteration speed) and ultimately determine the cost, quality, and scalability of LLM-powered data analysis pipelines.
Prompt Qualitydesign lever
The degree to which a prompt or prompt template clearly specifies the task description, output format, relevant context, and (optionally) few-shot examples, enabling the language model to produce accurate and parseable output without ambiguity.
Model Selectiondesign lever
The practitioner's choice of which language model to use for a given task, including provider (OpenAI, Anthropic, Cohere, Google, Hugging Face), model size, and specialization level (general-purpose versus task-specific), which jointly determine capability and per-token cost.
Parameter Configurationdesign lever
The set of API parameters used when invoking a language model, including temperature, max_tokens, logit_bias, stop sequences, presence_penalty, frequency_penalty, and n, which collectively shape output determinism, length, token distribution, and cost per call.
Fine-Tuningdesign lever
The process of continuing to train an existing base language model on a small set of task-specific labeled examples, producing a specialized model variant that can solve the target task with shorter prompts and potentially higher accuracy than the unmodified base model.
Architectural Choicedesign lever
The high-level design decision about how the language model is integrated into the data analysis pipeline—direct prompting on raw data, query translation to a specialized tool, agent-based orchestration, or index-based retrieval (RAG)—which determines scalability, cost structure, and capability boundaries.
Data Type and Complexitycontextual condition
The nature and structural complexity of the data being analyzed, ranging from unstructured text through images, audio, and video to structured tables and graphs, which constrains which architectural patterns are feasible and determines baseline difficulty for the language model.
Token Consumptionoutcome metric
The total number of tokens read and generated by language model API calls during a data analysis task, which is the primary driver of monetary processing cost and also reflects prompt efficiency and output verbosity.
Output Format Compliancebehavioral pattern
The proportion of language model responses that conform to the format specified in the prompt—e.g., returning exactly 'pos' or 'neg' rather than verbose alternatives—enabling reliable downstream parsing and aggregation without additional postprocessing failures.
Classification / Extraction Accuracyoutcome metric
The proportion of data items for which the language model pipeline produces the correct label, extracted value, or query result, measured against a ground-truth annotation, serving as the primary quality metric for supervised text analysis and query translation tasks.
Hallucination Riskoutcome metric
The probability that the language model generates factually incorrect, fabricated, or contextually unsupported content in its output, which undermines reliability and necessitates human verification before results are acted upon.
Pipeline Scalabilityoutcome metric
The ability of the data analysis pipeline to process large volumes of data items efficiently and cost-effectively, determined by automation level, token efficiency, use of batch processing, and reliance on external specialized tools rather than per-item large-model calls.
Developer Trust in LLM Outputpsychological state
The practitioner's calibrated confidence in the correctness and reliability of language model outputs, reflecting their understanding of hallucination risks, output variability, and the necessity of validation steps before acting on model-generated content.
Few-Shot Example Integrationdesign lever
The inclusion of one or more correctly solved task examples within the prompt to demonstrate the expected input-output mapping to the language model, reducing ambiguity about task semantics and output format and often improving accuracy compared to zero-shot prompting.
Embedding Qualitypsychological state
The degree to which the vector representations produced by an embedding model capture the semantic similarity between text documents, enabling effective clustering, retrieval, and outlier detection when comparing vectors via distance metrics.
Tool Description Qualitydesign lever
The precision, completeness, and clarity of the natural-language documentation and typed signatures provided for agent tools, which determines how accurately the language model agent can select appropriate tools and supply correct input parameter values during autonomous orchestration.
Agent Task Successoutcome metric
The proportion of complex multi-step data analysis tasks that an LLM-based agent completes correctly by autonomously selecting and sequencing tool invocations, reflecting the combined effectiveness of the underlying language model, tool descriptions, and agent prompt design.
How they connect
- prompt quality → predicts output format compliance
- prompt quality → predicts classification accuracy
- few shot examples → influences prompt quality
- few shot examples → predicts token consumption
- model selection → predicts classification accuracy
- model selection → influences token consumption
- parameter configuration → predicts output format compliance
- parameter configuration − predicts token consumption
- fine tuning − predicts token consumption
- fine tuning → predicts classification accuracy
- architectural choice → predicts pipeline scalability
- architectural choice − predicts token consumption
- data type complexity → moderates architectural choice
- token consumption − predicts pipeline scalability
- output format compliance → predicts classification accuracy
- embedding quality → predicts pipeline scalability
- tool description quality → predicts agent task success
- model selection → predicts agent task success
- developer trust → influences pipeline scalability
- hallucination risk − predicts developer trust
- prompt quality − predicts hallucination risk
The story
The reader A software developer, data scientist, or technically minded analyst who wants to harness LLMs to automate and scale data analysis across diverse data types—text, tables, images, audio, and graphs—but lacks a structured, end-to-end guide to do so in production-quality Python.
External problem
They have heterogeneous data in multiple formats that is too large and varied to analyze manually, and they lack a unified, cost-controlled programmatic approach to applying LLMs to it.
Internal problem
They feel overwhelmed by the pace of LLM development, unsure which models and frameworks to trust, and anxious about incurring runaway API costs without commensurate quality improvements.
Philosophical problem
It is wrong that powerful AI capabilities remain siloed behind fragmented tutorials and expensive trial-and-error, when a systematic engineering approach could put them within reach of any competent developer.
The plan
- Understand what LLMs can do and how prompting works (Chapters 1–2).
- Install and configure the OpenAI Python library and learn its core parameters (Chapter 3).
- Build mini-projects for text classification, extraction, and clustering (Chapter 4).
- Build natural-language query interfaces for relational and graph databases (Chapter 5).
- Analyze images and videos with multimodal GPT-4o (Chapter 6).
- Transcribe, translate, and generate audio with Whisper and TTS models (Chapter 7).
- Evaluate GPT alternatives (Anthropic, Cohere, Google, Hugging Face) for your task (Chapter 8).
- Optimize cost and quality through model selection, parameter tuning, prompt engineering, and fine-tuning (Chapter 9).
- Build complex agent-based pipelines and multimodal retrieval systems with LangChain and LlamaIndex (Chapter 10).
Success
- The reader can rapidly build end-to-end data analysis pipelines for text, images, audio, and structured data using LLMs with just a few dozen lines of Python.
- The reader consistently selects the right model and configuration for each task, controlling costs to a fraction of naive implementations.
- The reader creates agents that autonomously orchestrate multiple data sources to answer complex analytical questions no single tool could resolve.
- The reader confidently evaluates output quality and catches hallucinations before they propagate into decisions.
At stake
- Without this knowledge, practitioners overpay for large models on tasks small models could handle, burning budget before reaching production scale.
- They miss the enormous productivity gains LLMs offer for unstructured data, continuing to process text, images, and audio manually or with brittle rule-based systems.
- They build fragile, hard-coded pipelines that break when data formats change, instead of flexible agent-based systems that adapt dynamically.
- They expose sensitive data unnecessarily or execute unvalidated model-generated queries, creating security and data-integrity risks.
Related in the library