Skip to main content

Extract API

POST
/v1/extract

The Extract API extracts clean, readable text content from web pages, PDFs, and documents. It removes ads, navigation, and clutter to provide AI-ready content perfect for RAG systems and content analysis.

🎯 Overview

Extract any web page or document into clean, structured text:

  • Web pages - Remove ads, navigation, popups
  • PDF documents - Extract text and metadata
  • Articles - Get main content only
  • Social media - Clean posts and comments

📝 Request Format

Basic Request

curl -X POST https://api.usecortex.co/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article"
}'

Complete Request

{
"url": "https://example.com/long-article",
"format": "markdown",
"include_metadata": true,
"clean_html": true,
"extract_images": false,
"options": {
"timeout": 15,
"user_agent": "cortex-bot",
"follow_redirects": true,
"extract_tables": true
}
}

📋 Parameters

Required Parameters

ParameterTypeDescription
urlstringTarget URL to extract content from

Optional Parameters

ParameterTypeDefaultDescription
formatstring"text"Output format: text, markdown, html, json
include_metadatabooleantrueInclude page metadata
clean_htmlbooleantrueRemove ads, navigation, clutter
extract_imagesbooleanfalseInclude image URLs and alt text
extract_linksbooleanfalseInclude internal/external links
extract_tablesbooleanfalseConvert tables to structured data

📊 Response Format

Success Response

{
"success": true,
"data": {
"content": "This is the clean extracted text content...",
"metadata": {
"title": "Article Title",
"description": "Article description from meta tags",
"author": "John Smith",
"published_date": "2025-01-08T10:00:00Z",
"language": "en",
"word_count": 1234,
"reading_time": "5 min read",
"canonical_url": "https://example.com/article"
},
"structure": {
"headings": [
{"level": 1, "text": "Main Title"},
{"level": 2, "text": "Subtitle"}
],
"paragraphs": 15,
"lists": 3,
"tables": 1
},
"images": [
{
"url": "https://example.com/image.jpg",
"alt": "Image description",
"caption": "Image caption"
}
],
"quality_score": 0.89
},
"metadata": {
"request_id": "req_extract_123",
"processing_time": 2.1,
"extraction_method": "readability",
"content_size": 12450
}
}

🎨 Output Formats

Text Format (Default)

{
"url": "https://example.com/article",
"format": "text"
}

Response:

This is the clean extracted text content without any HTML formatting. Perfect for AI processing and analysis.

The content maintains paragraph structure and readability while removing all navigation elements, advertisements, and other clutter.

Markdown Format

{
"url": "https://example.com/article",
"format": "markdown"
}

Response:

# Article Title

This is the **clean extracted content** formatted as Markdown. Great for documentation and structured processing.

## Subheading

- Bullet points are preserved
- Lists maintain structure
- Links are converted to [markdown format](https://example.com)

### Code blocks are maintained

```python
def example():
return "formatted code"

### HTML Format

```json
{
"url": "https://example.com/article",
"format": "html",
"clean_html": true
}

Response:

<article>
<h1>Article Title</h1>
<p>This is clean HTML with only content-relevant tags preserved.</p>
<ul>
<li>Navigation removed</li>
<li>Ads removed</li>
<li>Clutter removed</li>
</ul>
</article>

JSON Format

{
"url": "https://example.com/article",
"format": "json"
}

Response:

{
"title": "Article Title",
"content": [
{"type": "heading", "level": 1, "text": "Article Title"},
{"type": "paragraph", "text": "First paragraph content..."},
{"type": "list", "items": ["Item 1", "Item 2"]},
{"type": "heading", "level": 2, "text": "Subtitle"},
{"type": "paragraph", "text": "Second paragraph content..."}
]
}

📄 Supported Content Types

Web Pages

  • News articles - Clean content extraction
  • Blog posts - Remove sidebars and ads
  • Documentation - Preserve code blocks and structure
  • E-commerce - Product descriptions and specs

Documents

  • PDF files - Text extraction with layout preservation
  • Word documents - Convert to clean text/markdown
  • Google Docs - Public document extraction
  • Notion pages - Public page content

Social Media

  • Twitter threads - Clean tweet compilation
  • LinkedIn posts - Professional content extraction
  • Reddit posts - Thread and comment extraction
  • Medium articles - Clean article content

⚙️ Advanced Options

Custom Extraction Rules

{
"url": "https://example.com/article",
"options": {
"custom_selectors": {
"content": "article.main-content",
"title": "h1.article-title",
"exclude": [".advertisement", ".sidebar"]
},
"preserve_formatting": true,
"min_content_length": 500
}
}

Batch Processing

{
"urls": [
"https://example1.com/article",
"https://example2.com/article",
"https://example3.com/article"
],
"format": "text",
"parallel": true
}

PDF Processing

{
"url": "https://example.com/document.pdf",
"options": {
"extract_pages": [1, 2, 3],
"preserve_layout": false,
"extract_images": true,
"ocr_enabled": true
}
}

🚨 Error Handling

Common Errors

Error CodeDescriptionSolution
URL_INVALIDMalformed URLCheck URL format
URL_UNREACHABLECannot access URLVerify URL accessibility
CONTENT_TOO_LARGEContent exceeds size limitUse pagination or filters
UNSUPPORTED_FORMATFile type not supportedCheck supported formats
EXTRACTION_FAILEDCannot extract contentTry different extraction method

Error Response

{
"success": false,
"error": {
"code": "URL_UNREACHABLE",
"message": "Unable to access the provided URL",
"details": {
"url": "https://example.com/missing",
"http_status": 404,
"suggestion": "Verify the URL is correct and accessible"
}
},
"request_id": "req_error_extract_456"
}

💡 Use Cases

RAG System Integration

from cortex import CortexClient

client = CortexClient(api_key="your_key")

# Extract content for RAG
urls = ["https://docs.example.com/api", "https://blog.example.com/tutorial"]

for url in urls:
result = client.extract(url, format="text")
# Add to vector database
vector_db.add_document(result.content, metadata=result.metadata)

Content Analysis

# Analyze content quality
result = client.extract(
url="https://news.example.com/article",
include_metadata=True,
extract_links=True
)

print(f"Quality Score: {result.quality_score}")
print(f"Reading Time: {result.metadata.reading_time}")
print(f"External Links: {len(result.links.external)}")

Document Processing

# Process PDF documents
result = client.extract(
url="https://example.com/research.pdf",
format="markdown",
options={
"extract_pages": [1, 2, 3],
"preserve_layout": True
}
)

print(result.content)

📊 Quality Scoring

Content quality is scored based on:

FactorWeightDescription
Content Length25%Adequate content volume
Structure20%Proper heading hierarchy
Readability20%Clean, readable text
Metadata15%Complete title, author, date
Media10%Relevant images and media
Links10%Internal/external link quality

Quality Thresholds

  • 0.9-1.0 - Excellent (news articles, documentation)
  • 0.7-0.9 - Good (blog posts, tutorials)
  • 0.5-0.7 - Fair (forums, social media)
  • 0.0-0.5 - Poor (spam, low-content pages)

🔧 Integration Examples

Python SDK

from cortex import CortexClient

client = CortexClient(api_key="your_key")

# Basic extraction
result = client.extract("https://example.com/article")
print(result.content)

# Advanced extraction
result = client.extract(
url="https://example.com/article",
format="markdown",
include_metadata=True,
extract_images=True
)

# Process result
if result.quality_score > 0.7:
print("High quality content extracted")
print(f"Title: {result.metadata.title}")
print(f"Author: {result.metadata.author}")

JavaScript SDK

import Cortex from '@cortex/sdk';

const cortex = new Cortex({ apiKey: 'your_key' });

// Extract and process
const result = await cortex.extract({
url: 'https://example.com/article',
format: 'markdown',
includeMetadata: true
});

console.log(`Extracted ${result.metadata.wordCount} words`);
console.log(result.content);

cURL Examples

# Basic extraction
curl -X POST https://api.usecortex.co/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{"url": "https://example.com/article"}'

# Markdown format with metadata
curl -X POST https://api.usecortex.co/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"url": "https://example.com/article",
"format": "markdown",
"include_metadata": true,
"extract_images": true
}'

📈 Performance & Limits

Processing Limits

LimitValueDescription
Max file size50MBPer document/URL
Max content length1M charsExtracted text limit
Timeout30 secondsProcessing timeout
Concurrent requests10Per API key

Best Practices

  1. Use appropriate format for your use case
  2. Enable caching for frequently accessed URLs
  3. Process in batches for multiple URLs
  4. Check quality score before using content
  5. Handle errors gracefully with retry logic

Next: Validate API → - Source verification and fact-checking