Inference Providers documentation
NeMo Data Designer
NeMo Data Designer
DataDesigner is NVIDIA NeMo’s framework for generating high-quality synthetic datasets using LLMs. It enables you to create diverse data using statistical samplers, LLMs, or existing seed datasets while maintaining control over field relationships and data quality.
Overview
DataDesigner supports OpenAI-compatible endpoints, making it easy to use any model available through Hugging Face Inference Providers for synthetic data generation.
Prerequisites
- DataDesigner installed (
pip install data-designer) - A Hugging Face account with API token (needs “Make calls to Inference Providers” permission)
Configuration
1. Set your HF token
export HF_TOKEN="hf_your_token_here"2. Configure HF as a provider
from data_designer.essentials import (
CategorySamplerParams,
DataDesigner,
DataDesignerConfigBuilder,
LLMTextColumnConfig,
ModelConfig,
ModelProvider,
SamplerColumnConfig,
SamplerType,
)
# Define HF Inference Provider (OpenAI-compatible)
hf_provider = ModelProvider(
name="huggingface",
endpoint="https://router.huggingface.co/v1",
provider_type="openai",
api_key="HF_TOKEN", # Reads from environment variable
)
# Define a model available via HF Inference Providers
hf_model = ModelConfig(
alias="hf-gpt-oss",
model="openai/gpt-oss-120b",
provider="huggingface",
)
# Create DataDesigner with HF provider
data_designer = DataDesigner(model_providers=[hf_provider])
config_builder = DataDesignerConfigBuilder(model_configs=[hf_model])3. Generate synthetic data
# Add a sampler column
config_builder.add_column(
SamplerColumnConfig(
name="category",
sampler_type=SamplerType.CATEGORY,
params=CategorySamplerParams(
values=["Electronics", "Books", "Clothing"],
),
)
)
# Add an LLM-generated column
config_builder.add_column(
LLMTextColumnConfig(
name="product_name",
model_alias="hf-gpt-oss",
prompt="Generate a creative product name for a {{ category }} item.",
)
)
# Preview the generated data
preview = data_designer.preview(config_builder=config_builder, num_records=5)
preview.display_sample_record()
# Access the DataFrame
df = preview.dataset
print(df)Using Different Models
You can use any model available through Inference Providers. Simply update the model field:
# Use a different model
hf_model = ModelConfig(
alias="hf-olmo",
model="allenai/OLMo-3-7B-Instruct",
provider="huggingface",
)Push to Hub
Once you’ve generated your dataset, push it directly to the Hugging Face Hub:
# Generate full dataset
results = data_designer.create(config_builder, num_records=1000, dataset_name="my-dataset")
# Push to Hub
url = results.push_to_hub(
repo_id="username/my-synthetic-dataset",
private=False,
)For loading Hub datasets as seed data and more details on the Hub integration, see the Data Designer Hub datasets guide.
Resources
- DataDesigner Documentation
- GitHub Repository
- Available Models on Inference Providers
- Using Data Designer with Hub Datasets