Extract is in closed beta. Access is limited to a small group of users while we refine it — request access. Parameters, behavior, and response shape may change.
Choosing a seed URL
Extract starts from the singleurl you provide. Point it at the page that
already lists the records you want — a team page, a product catalog, a
directory, a search results page — rather than a site’s homepage. The more
directly the seed page contains the rows, the more reliable the extraction.
Query phrasing
q describes both which rows to extract and what each row should
contain. Name the unit of repetition and the fields explicitly:
- the entity that defines one row (a person, a product, a job posting),
- the fields each row must contain,
- any filter that limits which records qualify, and
- how to handle missing fields.
Designing a row schema
When you need a guaranteed shape, pass an explicitschema. It describes
one row, and every returned row must match it.
- Keep the schema flat. Primitive fields (
string,number,boolean) are most reliable. - Mark only genuinely mandatory fields as
required. Over-constraining drops otherwise-valid rows. - Use
"format": "uri"for link fields so they can be validated withverifyUrls. - Reshape into nested structures client-side after extraction rather than requesting deep nesting.
schema is omitted, the agent infers the row shape from q. Provide a
schema whenever downstream code depends on stable field names and types.
Verifying URLs
SetverifyUrls to true when extracted links must resolve — for example,
when building a dataset of working profile or product pages. Verification
checks each URL for reachability after extraction, which adds latency. Leave
it at the default false when you only need the raw values.
Reading the result
The result is delivered as a downloadable NDJSON file, not inline:output.resultUrlis valid for 24 hours. Download and persist the rows promptly.- The file has one JSON object per line; parse it line by line rather than as a single JSON document.
output.rowsReturnedtells you how many rows to expect.output.creditsUsedgives the credits used by the task.
Polling
Polling loops are a common source of integration errors. Recommended defaults:- Interval: poll roughly every 30 seconds for long-running tasks.
- Maximum poll rate: 1 request per second. Higher rates trigger rate limits without reducing time-to-completion.