Skip to main content
This page covers how to phrase an extract query, design a row schema, and poll and handle failures on the Extract endpoint.
Extract is in closed beta. Access is limited to a small group of users while we refine it — request access. Parameters, behavior, and response shape may change.

Choosing a seed URL

Extract starts from the single url you provide. Point it at the page that already lists the records you want — a team page, a product catalog, a directory, a search results page — rather than a site’s homepage. The more directly the seed page contains the rows, the more reliable the extraction.

Query phrasing

q describes both which rows to extract and what each row should contain. Name the unit of repetition and the fields explicitly:
Each engineering team member, with their full name, job title, and the
URL of their individual profile page.
Useful dimensions to specify include:
  • the entity that defines one row (a person, a product, a job posting),
  • the fields each row must contain,
  • any filter that limits which records qualify, and
  • how to handle missing fields.

Designing a row schema

When you need a guaranteed shape, pass an explicit schema. It describes one row, and every returned row must match it.
  • Keep the schema flat. Primitive fields (string, number, boolean) are most reliable.
  • Mark only genuinely mandatory fields as required. Over-constraining drops otherwise-valid rows.
  • Use "format": "uri" for link fields so they can be validated with verifyUrls.
  • Reshape into nested structures client-side after extraction rather than requesting deep nesting.
If schema is omitted, the agent infers the row shape from q. Provide a schema whenever downstream code depends on stable field names and types.

Verifying URLs

Set verifyUrls to true when extracted links must resolve — for example, when building a dataset of working profile or product pages. Verification checks each URL for reachability after extraction, which adds latency. Leave it at the default false when you only need the raw values.

Reading the result

The result is delivered as a downloadable NDJSON file, not inline:
  • output.resultUrl is valid for 24 hours. Download and persist the rows promptly.
  • The file has one JSON object per line; parse it line by line rather than as a single JSON document.
  • output.rowsReturned tells you how many rows to expect.
  • output.creditsUsed gives the credits used by the task.

Polling

Polling loops are a common source of integration errors. Recommended defaults:
  • Interval: poll roughly every 30 seconds for long-running tasks.
  • Maximum poll rate: 1 request per second. Higher rates trigger rate limits without reducing time-to-completion.

Failure handling

No credit is deducted for failed tasks or tasks that return no result. Retries are unrestricted. This policy is consistent with the other endpoints. See errors.

Resources