Nov 5, 2025

Weekly Data News - 45

This Week in Data Tools: Open Source Revolution and Web Data Evolution

Welcome back to another week of insights from the world of data infrastructure and developer tools. This week, we're covering two major announcements that could reshape how teams work with notebooks and web data. Let's dive in.

Deepnote Goes Open Source: Rethinking the Notebook Experience

After seven years of development and serving over 500,000 data professionals (including 21% of Fortune 500 companies), Deepnote announced they're open-sourcing their platform. This is a significant moment for the data science community.

Why This Matters Now

The traditional Jupyter notebook has been the standard for years, but the cracks are showing. Deepnote's announcement comes at a time when job postings requiring Jupyter knowledge are declining, and the core Jupyter repositories are seeing reduced activity. The market seems ready for something new. What makes this announcement interesting is the timing. We're entering an era where notebooks need to do more than just run code in isolation. They need to support collaboration, integrate with AI agents, and work seamlessly across technical and non-technical team members.

The Hidden Cost Problem

One of the most compelling parts of Deepnote's announcement is their breakdown of total cost of ownership (TCO) for self-hosted Jupyter deployments. Many teams built internal data platforms on JupyterHub between 2019 and 2023, and those decisions are now creating unexpected overhead.

The costs show up in multiple ways:

Platform maintenance: Engineering teams spend significant cycles maintaining JupyterHub infrastructure, managing kernels, dealing with authentication issues, and handling extensions. For a 50-person data organization, this can mean roughly 3 full-time employees focused solely on platform work. Productivity bottlenecks: Teams using pre-AI tooling can't easily add modern capabilities. The collaboration infrastructure wasn't built with AI agents in mind, so retrofitting becomes complex and fragile. Compute waste: Without reactive execution, every notebook change often means re-running entire pipelines. Idle kernels continue consuming resources. These inefficiencies compound over time. According to Deepnote's real-world example from a fintech company, switching to their reactive execution model resulted in 35-45% fewer re-runs and around $180,000 per year in infrastructure savings.


What's Actually Different

Deepnote's open source release centers on a new notebook format designed for modern workflows. Instead of Jupyter's JSON structure, they're using human-readable text files that work better with version control and code review processes.

Key features include:

Reactive execution: Change a parameter and downstream blocks update automatically, similar to how spreadsheets work. No more "Run All" and hoping everything works. Native collaboration: Built-in versioning, comments, and review capabilities that actually produce readable diffs in Git. AI-ready structure: The format is designed so AI agents can understand dependencies and safely make changes without breaking existing work. Block diversity: Beyond code and markdown, the platform supports SQL queries, charts, interactive inputs, and app layouts as first-class citizens. No lock-in: Projects can convert between Deepnote format and standard .ipynb files. You can also use the format in VS Code, Cursor, or JupyterLab through their extensions. The project is released under Apache 2.0 license. You can check out the main repository, the VS Code extension, or the JupyterLab integration.


Firecrawl v2.5: Reimagining Web Data Extraction

On the web scraping front, Firecrawl released version 2.5 with two major infrastructure improvements that address longstanding data quality challenges.

Building a Custom Browser Stack

Firecrawl took the bold approach of building their entire browser stack from scratch. This wasn't just an optimization, it was a fundamental reimagining of how web data extraction should work.

Their custom browser automatically detects how each page renders and adapts accordingly. It handles PDFs, paginated tables, and dynamic JavaScript applications, then converts everything into clean, structured formats that work well with AI systems.

The key advantage is consistency. By controlling the entire stack, Firecrawl can index complete pages rather than partial content. Their quality benchmarks show substantial improvements over competitors in extracting accurate, comprehensive data.


The Semantic Index Advantage

Perhaps the most innovative feature in v2.5 is the semantic index. This system already handles 40% of all API calls and fundamentally changes how web data retrieval works.

The index stores full page snapshots, embeddings, and structural metadata. This creates an interesting capability: you can request data "as of now" or "as of last known good copy." Essentially, you get access to both current and historical states of web pages.

Developers control this through the maxAge parameter, letting them choose between speed (using cached data) and freshness (requesting new captures). This flexibility is particularly valuable for applications that need reliable data but don't always need real-time information. The semantic index also improves coverage significantly, as shown in their benchmarks comparing retrieval success rates across different websites.


The Bigger Vision

Firecrawl positions v2.5 as part of a larger mission: building a new programmatic layer for the internet. They want web data access to be as simple and reliable as calling any standard API, especially for AI agents and modern applications.

They're also committing to transparency. In the coming weeks, they plan to open source their web data retrieval benchmarks, allowing the developer community to validate their claims and build on their work.

The best part? v2.5 is available now for all users with no code changes required. You can experiment in their playground or check the documentation to get started.


What This Means for Your Team

Both announcements share a common theme: infrastructure decisions made 3-5 years ago need reevaluation. The requirements have changed. AI integration isn't optional anymore, collaboration is critical, and the total cost of ownership extends far beyond licensing fees.

If your team is running self-hosted Jupyter deployments, it's worth calculating the real TCO and comparing it to modern alternatives. The engineering time alone might justify a change.

If you're building applications that depend on web data, the improvements in extraction quality and the semantic index approach could simplify your architecture significantly.

The shift toward open source in both cases reduces lock-in risk while providing transparency into how these tools actually work. That's valuable whether you adopt them directly or just learn from their approaches.


Looking Ahead

We're in a transition period for data infrastructure. The tools that served us well for the past decade weren't designed for AI agents, real-time collaboration, or the scale many teams now need. These announcements represent serious attempts to solve those problems, not just incremental updates.

Worth watching: how the community responds to Deepnote's format specification and whether it gains traction as a standard. Also interesting will be Firecrawl's open source benchmarks and what they reveal about web data extraction quality across the industry.

Have thoughts on these announcements? Working on similar problems at your company? I'd love to hear about it.

Until next week,

Your friendly neighborhood data tools observer

Resources:


ready to build with data?

ready to build with data?

Partner with AEDI to turn information into impact. Whether you're designing new systems, solving complex challenges, or shaping the next frontier of human potential—our team is here to help you move from insight to execution.

From idea to

impact.

impact.

Consulting that translates innovation into outcomes.

From idea to

impact.

impact.

Consulting that translates innovation into outcomes.

From idea to

impact.

impact.

Consulting that translates innovation into outcomes.