BigScience Workshop

non-profit

https://bigscience.huggingface.co

bigscienceW

bigscience-workshop

Activity Feed

AI & ML interests

A one-year long research workshop on large language models: the Summer of Language Models 21 🌸

Recent Activity

christopher new activity 1 day ago

bigscience/bloom:pretokenizer Regex issues?

nguyenvulebinh authored a paper 1 day ago

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

nguyenvulebinh authored a paper 1 day ago

Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024

View all activity

christopher

in bigscience/bloom 1 day ago

pretokenizer Regex issues?

#278 opened over 1 year ago by

hpcpony

gsarti

authored 2 papers 4 days ago

Agents of Chaos

Paper • 2602.20021 • Published 11 days ago • 29

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Paper • 2602.08964 • Published 25 days ago • 1

christopher

in bigscience/bloom 6 days ago

Test PR

#286 opened 6 days ago by

FIRSTACCOUNT69

Test discussion

#287 opened 6 days ago by

FIRSTACCOUNT69

Test discussion

#288 opened 6 days ago by

FIRSTACCOUNT69

albertvillanova

posted an update 8 days ago

Post

1748

🚀 TRL v0.29.0 introduces trl-training: an agent-native training skill.

This makes the TRL CLI a structured, agent-readable capability, allowing AI agents to reliably execute training workflows such as:
- Supervised Fine-Tuning (SFT)
- Direct Preference Optimization (DPO)
- Group Relative Policy Optimization (GRPO)

We’re excited to see what the community builds on top of this.

If you’re working on AI agents, alignment research, or scalable RL training infrastructure: give TRL v0.29.0 a try! 🤗

The future of ML tooling is agent-native.
🔗 https://github.com/huggingface/trl/releases/tag/v0.29.0

armanc

authored a paper 14 days ago

References Improve LLM Alignment in Non-Verifiable Domains

Paper • 2602.16802 • Published 16 days ago • 2

albertvillanova

posted an update 23 days ago

Post

1726

5 years already working in democratizing AI 🤗
Grateful to be part of such an awesome team making it happen every day.

archiki

submitted a paper to Daily Papers 23 days ago

Effective Reasoning Chains Reduce Intrinsic Dimensionality

Paper • 2602.09276 • Published 25 days ago • 11

pjox

authored a paper about 1 month ago

SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing

Paper • 2512.11192 • Published Dec 12, 2025

armanc

authored a paper about 2 months ago

Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

Paper • 2601.09876 • Published Jan 14 • 7

christopher

in bigscience/bloomz-560m 3 months ago

Fails to load with transformers v4.57+

#14 opened 3 months ago by

qgallouedec

monsoon-nlp

posted an update 3 months ago

Post

440

PatchDNA, a DNA foundation model based on Meta's BLT tokenization strategy https://www.biorxiv.org/content/10.1101/2025.11.28.691095v1

christopher

in bigscience/petals-api 4 months ago

Bloom

#2 opened 4 months ago by

Raz-Test

BramVanroy

posted an update 5 months ago

Post

541

What are currently the best multilingual models with at most 72B parameters? Are Llama 3.3 70B and Qwen 2.5 72B still king?

1 reply

giadap

posted an update 5 months ago

Post

4586

🌎 AI ethics and sustainability are two sides of the same coin.

In our new blog post with Dr. Sasha Luccioni, we argue that separating them (as is too often the case) means missing the bigger picture of how AI systems impact both people and the planet.

Ethical and sustainable AI development can’t be pursued in isolation. The same choices that affect who benefits or is harmed by AI systems also determine how much energy and resources they consume.

We explore how two key concepts, evaluation and transparency, can serve as bridges between these domains:

📊 Evaluation, by moving beyond accuracy or performance metrics to include environmental and social costs, as we’ve done with tools like the AI Energy Score.

🔍 Transparency, by enabling reproducibility, accountability, and environmental reporting through open tools like the Environmental Transparency Space.

AI systems mirror our priorities. If we separate ethics from sustainability, we risk building technologies that are efficient but unjust, or fair but unsustainable.

Read our blog post here: https://huggingface.co/blog/sasha/ethics-sustainability

AIEnergyScore/Leaderboard
sasha/environmental-transparency

1 reply

giadap

posted an update 5 months ago

Post

11060

One of the hardest challenges in AI safety is finding the right balance: how do we protect people from harm without undermining their agency? This tension is especially visible in conversational systems, where safeguards can sometimes feel more paternalistic than supportive.

In my latest piece for Hugging Face, I argue that open source and community-driven approaches offer a promising (though not exclusive) way forward.

✨ Transparency can make safety mechanisms into learning opportunities.
✨ Collaboration with diverse communities makes safeguards more relevant across contexts.
✨ Iteration in the open lets protections evolve rather than freeze into rigid, one-size-fits-all rules.

Of course, this isn’t a silver bullet. Top-down safety measures will still be necessary in some cases. But if we only rely on corporate control, we risk building systems that are safe at the expense of trust and autonomy.

Read the blog post here: https://huggingface.co/blog/giadap/preserving-agency

7 replies

monsoon-nlp

posted an update 5 months ago

Post

465

Bio LLMs train on many genomes, but can we encode differences within a species? TomatoTomato adds pangenome tokens to represent a domestic tomato and a wild tomato in one sequence 🍅 🧬
monsoon-nlp/tomatotomato-gLM2-150M-v0.1

giadap

posted an update 6 months ago

Post

428

I've noticed something. While we're careful about what we post on social media, we're sharing our deepest and most intimate thoughts with AI chatbots -- health concerns, financial worries, relationship issues, business ideas...

With OpenAI hinting at ChatGPT advertising, this matters more than ever. Unlike banner ads, AI advertising happens within the conversation itself. Sponsors could subtly influence that relationship advice or financial guidance.

The good news? We have options.
🤝 Open source AI models let us keep conversations private, avoid surveillance-based business models, and build systems that actually serve users first.

Read more about it in our latest blog post, co-written with
@frimelle
https://huggingface.co/blog/giadap/privacy-conversational-ai

AI & ML interests

Recent Activity

Team members 328

bigscience's activity

pretokenizer Regex issues?

Test PR

Test discussion

Test discussion

Fails to load with transformers v4.57+

Bloom