Gluedly Gluedly

· Product updates · Donatas

Dev Blog: Introducing Multi-Stage Workflow Scraping—Deep Data Extraction, Fully Automated

Stop stitch-building datasets by hand. Our brand-new Multi-Stage Workflow Scraping feature automatically connects product lists to detail pages, runs deep crawls in parallel, and merges your data instantly.

Hey builders,

If you’ve ever had to run one scrape job for a catalog list, a second job for the individual product details, and then spend your afternoon playing spreadsheet gymnastics to merge them together—this major product release is for you.

Today, we are officially launching a foundational new capability on our platform: Multi-Stage Workflow Scraping (List + Detail).

For the first time, you are no longer restricted to scraping single, isolated pages. The platform can now treat your extraction targets as a fluid, multi-stage workflow. It automatically crawls a list view, follows individual item links, extracts deep detail data in parallel, and synthesizes everything into a single dataset.

Here is an inside look at how this new feature works under the hood and what it adds to your dashboard.

The Core Concept: Seamless List-to-Detail Automation

Previously, deep crawling required manual, multi-step configurations. With the introduction of Workflows, the platform effortlessly bridges the gap between high-level item listings and the granular pages sitting beneath them.

[1. List Page Scraped] 
         │
         ▼ (Link XPath Isolation)
[2. Parallel Child Jobs Queued] ───► [Detail Job 1] [Detail Job 2] [Detail Job 3] ...
         │
         ▼ (Automatic Settlement)
[3. Single Data Snapshot Generated]

Here is exactly how a Workflow Scrape operates from start to finish:

  • Automatic Link Traversal: You can now define a Link XPath directly on your list page configuration. The scraper uses this to automatically isolate the unique URL for each row item.
  • Parallel Child Jobs: The moment a list page finishes processing, the system takes over. It queues up one individual child scrape job per detail URL discovered. These run concurrently, scaling dynamically to maximize your account's concurrent lane limits.
  • The Unified Snapshot: You no longer have to spend hours merging external files. When all parallel detail jobs have settled, the system automatically merges your list fields and detail fields into a single, comprehensive data snapshot on the parent list page.

1. Complete Visibility: The New Workflow Tab & Lane Monitor

Because multi-stage scraping introduces new background behavior, we have built entire dashboard environments to give you absolute control and visibility over your runs.

Elements Mapper Upgrade

You will find a brand new Workflow Tab inside the elements mapper settings. When creating or editing a page, you can toggle workflow functionality on, select your Link XPath visually, and map out your target detail fields ahead of time.

Real-Time Lane Throttling

To prevent massive workflows from crashing your account limits, we’ve upgraded our Lane Monitor to enforce plan caps strictly in real-time. The pages list now displays a workflow-aware status—including exactly how many detail pages are remaining during an active run—while the traffic snapshot shows you your exact workflow backlog.

2. Bulletproof Fault Tolerance: Single-Row Retries

Web scraping can be messy. A temporary network hiccup or a 504 gateway error on a single product page shouldn't ruin a 500-item crawl.

Because workflows break deep scraping down into individual child jobs, we are able to introduce incredibly granular error handling:

  • Isolated Failures: If a specific detail page fails to scrape, it won't crash your workflow. The platform simply flags that specific row with a workflow_detail_failed warning on your data view.
  • Targeted Retries: Right next to that warning, you will find a Retry Button. Clicking it re-queues only that specific row’s detail job.
  • Credit & Data Preservation: You don't have to re-run the full workflow, meaning you don't burn extra credits on dispatching jobs that already succeeded. Once the retry completes successfully, the stored data is updated cleanly in place.

3. Intelligent Integration Dispatching

We've made sure that workflows play nice with your existing tech stack. If you use webhooks or email alerts via our integrations, you don't have to worry about your systems being flooded with hundreds of pings for every individual child page.

Integrations now fire exactly once, waiting patiently until the entire workflow settles and the final merged snapshot is saved. Note that if you ever have to trigger a manual retry on a failed row later on, the integrations will fire one additional time to ensure your external systems receive the newly corrected data.

Under the Hood: Built for Scale

To support this brand new structural architecture, our engineering team has completely updated our core database schemas, introducing a new workflow_scrape_runs table to accurately track run states, items, pending counts, and billing metrics.

This release also includes a massive suite of 525 unit and feature tests covering job serialization, lane traffic distribution, and integration dispatches to guarantee perfect stability from day one.

Ready to build your first deep crawl? Head over to your dashboard, open up the elements mapper, and activate your very first Workflow Scrape today!