Databricks Dolly

Databricks

An instruction-tuned large language model (LLM) developed by Databricks, known for its commercial-use license and open model weights.

Code & Software Generation
Foundation & Enterprise LLM
Content Generation
Research & Analysis

What is Databricks Dolly?

Dolly (Dolly 2.0 and later) is an instruction-following Large Language Model developed by Databricks, built on the EleutherAI Pythia model family. It is distinguished by its **truly open-source nature**, including the model weights and the high-quality human-generated instruction dataset (databricks-dolly-15k), all licensed for commercial use. This allows organizations to own, customize, and deploy the model entirely within their private infrastructure, eliminating vendor API reliance and data leakage risks.

Key Features & Capabilities

  • Open & Commercial Use: No licensing restrictions for commercial applications, enabling complete model ownership.
  • Instruction Tuning: Fine-tuned on a high-quality, human-generated instruction set for superior command following.
  • Data Centric: Designed to be fine-tuned and specialized on private, proprietary datasets within the Databricks Lakehouse Platform.
  • Code Generation: Excels at generating and analyzing code, particularly PySpark and SQL for data engineering tasks.
Need help with AI Tools?
Get expert help
Starting from
$99
  • Connect your CRM, marketing, or automation tools seamlessly.
  • Automate workflows by combining multiple AI tools.
  • Train your team to master AI tools quickly.
  • Get ongoing support for updates and scaling.
Get Started

How to Deploy and Use Dolly

Dolly is not used as a public API but as a model deployed on private infrastructure:

  1. Deployment: Download the Dolly model weights from Hugging Face Hub or access them directly via the Databricks Lakehouse Platform.
  2. Compute Setup: Deploy the model onto a dedicated compute cluster (e.g., a GPU-enabled Databricks cluster).
  3. RAG Implementation: For Q&A use cases, ingest and clean proprietary Q&A data, transform it into embeddings, and index it in a vector database.
  4. Inference: Utilize **LangChain** or custom Python code within a Databricks notebook to fetch context from the vector database and craft a prompt for Dolly (Retrieval Augmented Generation).
  5. Fine-Tuning (Optional): If required, fine-tune the base Dolly model on specialized private datasets to customize its output style and knowledge domain.
Use Cases
Generate high-quality, production-ready PySpark and SQL code for ETL pipelines on private data.

Databricks Dolly specializes in generating high-quality, idiomatic code for data engineering tasks. Instead of manually writing boilerplate PySpark for complex joins and aggregations, users describe the desired ETL logic in natural language. This accelerates pipeline development significantly, allowing engineering teams to focus on architecture while the AI produces optimized, production-ready code.

Automatically summarize and categorize unstructured clinical notes for electronic health records (EHR) population analysis.

Dolly is capable of advanced text processing for healthcare data. It can be trained on private clinical datasets to recognize key findings and treatment plans within unstructured clinician notes. By converting messy text into structured data points like "primary diagnosis" and "treatment prescribed," it drastically improves the speed and accuracy of population health reporting and automated billing code assignment, enhancing resource allocation in hospital networks.

Highlights
  • Full Control & Ownership: Organizations own the model and data, eliminating third-party API dependencies and data risk.
  • Customization: Optimized for easy fine-tuning and specializing on unique, proprietary data.
  • Cost-Effective at Scale: Eliminates per-token API costs for large-scale, internal deployment.
Things to know
  • Infrastructure Overhead: Requires managing dedicated compute resources (GPUs/TPUs) for hosting and inference.
  • Ongoing Maintenance: Requires internal MLOps teams to monitor, update, and govern the model's performance and data lineage.
More Tools