Databricks Dolly

Databricks

An instruction-tuned large language model (LLM) developed by Databricks, known for its commercial-use license and open model weights.

What is Databricks Dolly?

Dolly (Dolly 2.0 and later) is an instruction-following Large Language Model developed by Databricks, built on the EleutherAI Pythia model family. It is distinguished by its **truly open-source nature**, including the model weights and the high-quality human-generated instruction dataset (databricks-dolly-15k), all licensed for commercial use. This allows organizations to own, customize, and deploy the model entirely within their private infrastructure, eliminating vendor API reliance and data leakage risks.

Key Features & Capabilities

  • Open & Commercial Use: No licensing restrictions for commercial applications, enabling complete model ownership.
  • Instruction Tuning: Fine-tuned on a high-quality, human-generated instruction set for superior command following.
  • Data Centric: Designed to be fine-tuned and specialized on private, proprietary datasets within the Databricks Lakehouse Platform.
  • Code Generation: Excels at generating and analyzing code, particularly PySpark and SQL for data engineering tasks.

How to Deploy and Use Dolly

Dolly is not used as a public API but as a model deployed on private infrastructure:

  1. Deployment: Download the Dolly model weights from Hugging Face Hub or access them directly via the Databricks Lakehouse Platform.
  2. Compute Setup: Deploy the model onto a dedicated compute cluster (e.g., a GPU-enabled Databricks cluster).
  3. RAG Implementation: For Q&A use cases, ingest and clean proprietary Q&A data, transform it into embeddings, and index it in a vector database.
  4. Inference: Utilize **LangChain** or custom Python code within a Databricks notebook to fetch context from the vector database and craft a prompt for Dolly (Retrieval Augmented Generation).
  5. Fine-Tuning (Optional): If required, fine-tune the base Dolly model on specialized private datasets to customize its output style and knowledge domain.
Need help with AI Tools?
Get expert help
Starting from
$99
  • Connect your CRM, marketing, or automation tools seamlessly.
  • Automate workflows by combining multiple AI tools.
  • Train your team to master AI tools quickly.
  • Get ongoing support for updates and scaling.
Get Started
Promoted
Use Cases
Generate high-quality, production-ready PySpark and SQL code for ETL pipelines on private data.

Databricks Dolly specializes in generating high-quality, idiomatic code for data engineering tasks. Instead of manually writing boilerplate PySpark for complex joins and aggregations, users describe the desired ETL logic in natural language. This accelerates pipeline development significantly, allowing engineering teams to focus on architecture while the AI produces optimized, production-ready code.

Automatically summarize and categorize unstructured clinical notes for electronic health records (EHR) population analysis.

Dolly is capable of advanced text processing for healthcare data. It can be trained on private clinical datasets to recognize key findings and treatment plans within unstructured clinician notes. By converting messy text into structured data points like "primary diagnosis" and "treatment prescribed," it drastically improves the speed and accuracy of population health reporting and automated billing code assignment, enhancing resource allocation in hospital networks.

Highlights
  • Full Control & Ownership: Organizations own the model and data, eliminating third-party API dependencies and data risk.
  • Customization: Optimized for easy fine-tuning and specializing on unique, proprietary data.
  • Cost-Effective at Scale: Eliminates per-token API costs for large-scale, internal deployment.
Things to know
  • Infrastructure Overhead: Requires managing dedicated compute resources (GPUs/TPUs) for hosting and inference.
  • Ongoing Maintenance: Requires internal MLOps teams to monitor, update, and govern the model's performance and data lineage.
AiGanak Analysis

This tool is specifically for data-sensitive organizations that require an open-source, instruction-tuned LLM they can own and host privately. It provides a more cost-effective and secure alternative for large-scale internal deployments compared to proprietary models like GPT-4.

Databricks Dolly Alternatives & Competitors

Databricks Dolly

C3.ai

Google Gemini

Description
An instruction-tuned large language model (LLM) developed by Databricks, known for its commercial-use license and open model weights.
A suite of Generative AI applications and tools designed to solve high-value, complex, data-rich enterprise problems in industrial sectors.
A family of powerful, multimodal foundation models that handles text, image, video, and audio to build advanced applications.
Pros
  • Full Control & Ownership: Organizations own the model and data, eliminating third-party API dependencies and data risk.
  • Customization: Optimized for easy fine-tuning and specializing on unique, proprietary data.
  • Cost-Effective at Scale: Eliminates per-token API costs for large-scale, internal deployment.
  • High Value-Add: Focused on complex industrial and business challenges with high economic impact (e.g., reduced unplanned downtime).
  • Traceability & Security: Built-in RAG architecture ensures deterministic, traceable, and secure responses against enterprise data.
  • Application Suite: Provides off-the-shelf, proven AI applications for rapid deployment across various industries.
  • Truly Multimodal: Native handling of text, code, image, audio, and video inputs in a single model.
  • Enterprise Governance: Strong security, privacy, and control when deployed through Google Cloud's Vertex AI.
  • Powerful Integrations: Deeply integrated with Google Workspace and Cloud ecosystem tools.
Things to Know
  • Infrastructure Overhead: Requires managing dedicated compute resources (GPUs/TPUs) for hosting and inference.
  • Ongoing Maintenance: Requires internal MLOps teams to monitor, update, and govern the model's performance and data lineage.
  • Enterprise Only: Pricing model and complexity are tailored exclusively for large enterprise customers, making it cost-prohibitive for SMBs.
  • Longer Implementation: Initial deployment and data integration can be a significant project requiring specialized expertise.
  • Token Costs: Pricing can be complex and expensive for high-volume or extremely long context window usage.
  • API Latency: The largest, most capable models (Ultra/Pro) may introduce higher latency for real-time applications.

Ready to get AI working for you?

Get personalized help setting up tools, automating workflows, or building custom AI assistants.
Get Started
More Tools