Crawling

Extract data from multiple pages at scale

The crawl endpoint lets you extract data from multiple pages by following links.

Basic Crawl

curl -X POST https://api.refyne.uk/api/v1/crawl \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/products",
    "schema": {
      "name": "string",
      "price": "number"
    },
    "options": {
      "follow_selector": "a.product-link",
      "max_pages": 10
    }
  }'

Crawl Options

OptionTypeDescription
follow_selectorstringCSS selector for links to follow
follow_patternstringRegex pattern for URLs to follow
max_pagesnumberMaximum pages to extract (default: 50)
max_depthnumberMaximum link depth to follow (default: 2)
same_domain_onlybooleanOnly follow links on same domain (default: true)
delaystringDelay between requests (e.g., "1s")
concurrencynumberParallel requests (default: 3)

By CSS Selector

Follow specific links on the page:

{
  "options": {
    "follow_selector": "a.product-card",
    "max_pages": 20
  }
}

By URL Pattern

Follow links matching a pattern:

{
  "options": {
    "follow_pattern": "/products/[0-9]+",
    "max_pages": 20
  }
}

Pagination

Handle paginated content:

{
  "options": {
    "next_selector": "a.next-page",
    "max_pages": 100
  }
}

Job Status

Crawl jobs run asynchronously. Check status:

curl https://api.refyne.uk/api/v1/jobs/JOB_ID \
  -H "Authorization: Bearer YOUR_API_KEY"

Getting Results

Retrieve crawl results:

# Individual results
curl https://api.refyne.uk/api/v1/jobs/JOB_ID/results \
  -H "Authorization: Bearer YOUR_API_KEY"

# Merged results
curl "https://api.refyne.uk/api/v1/jobs/JOB_ID/results?merge=true" \
  -H "Authorization: Bearer YOUR_API_KEY"