Extract API

POST

/v1/extract

The Extract API extracts clean, readable text content from web pages, PDFs, and documents. It removes ads, navigation, and clutter to provide AI-ready content perfect for RAG systems and content analysis.

🎯 Overview

Extract any web page or document into clean, structured text:

Web pages - Remove ads, navigation, popups
PDF documents - Extract text and metadata
Articles - Get main content only
Social media - Clean posts and comments

📝 Request Format

Basic Request

curl -X POST https://api.usecortex.co/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article"
  }'

Complete Request

{
  "url": "https://example.com/long-article",
  "format": "markdown",
  "include_metadata": true,
  "clean_html": true,
  "extract_images": false,
  "options": {
    "timeout": 15,
    "user_agent": "cortex-bot",
    "follow_redirects": true,
    "extract_tables": true
  }
}

📋 Parameters

Required Parameters

Parameter	Type	Description
`url`	string	Target URL to extract content from

Optional Parameters

Parameter	Type	Default	Description
`format`	string	`"text"`	Output format: text, markdown, html, json
`include_metadata`	boolean	`true`	Include page metadata
`clean_html`	boolean	`true`	Remove ads, navigation, clutter
`extract_images`	boolean	`false`	Include image URLs and alt text
`extract_links`	boolean	`false`	Include internal/external links
`extract_tables`	boolean	`false`	Convert tables to structured data

📊 Response Format

Success Response

{
  "success": true,
  "data": {
    "content": "This is the clean extracted text content...",
    "metadata": {
      "title": "Article Title",
      "description": "Article description from meta tags",
      "author": "John Smith",
      "published_date": "2025-01-08T10:00:00Z",
      "language": "en",
      "word_count": 1234,
      "reading_time": "5 min read",
      "canonical_url": "https://example.com/article"
    },
    "structure": {
      "headings": [
        {"level": 1, "text": "Main Title"},
        {"level": 2, "text": "Subtitle"}
      ],
      "paragraphs": 15,
      "lists": 3,
      "tables": 1
    },
    "images": [
      {
        "url": "https://example.com/image.jpg",
        "alt": "Image description",
        "caption": "Image caption"
      }
    ],
    "quality_score": 0.89
  },
  "metadata": {
    "request_id": "req_extract_123",
    "processing_time": 2.1,
    "extraction_method": "readability",
    "content_size": 12450
  }
}

🎨 Output Formats

Text Format (Default)

{
  "url": "https://example.com/article",
  "format": "text"
}

Response:

This is the clean extracted text content without any HTML formatting. Perfect for AI processing and analysis.

The content maintains paragraph structure and readability while removing all navigation elements, advertisements, and other clutter.

Markdown Format

{
  "url": "https://example.com/article", 
  "format": "markdown"
}

Response:

# Article Title

This is the **clean extracted content** formatted as Markdown. Great for documentation and structured processing.

## Subheading

- Bullet points are preserved
- Lists maintain structure
- Links are converted to [markdown format](https://example.com)

### Code blocks are maintained

```python
def example():
    return "formatted code"

### HTML Format

```json
{
  "url": "https://example.com/article",
  "format": "html",
  "clean_html": true
}

Response:

<article>
  <h1>Article Title</h1>
  <p>This is clean HTML with only content-relevant tags preserved.</p>
  <ul>
    <li>Navigation removed</li>
    <li>Ads removed</li>
    <li>Clutter removed</li>
  </ul>
</article>

JSON Format

{
  "url": "https://example.com/article",
  "format": "json"
}

Response:

{
  "title": "Article Title",
  "content": [
    {"type": "heading", "level": 1, "text": "Article Title"},
    {"type": "paragraph", "text": "First paragraph content..."},
    {"type": "list", "items": ["Item 1", "Item 2"]},
    {"type": "heading", "level": 2, "text": "Subtitle"},
    {"type": "paragraph", "text": "Second paragraph content..."}
  ]
}

📄 Supported Content Types

Web Pages

News articles - Clean content extraction
Blog posts - Remove sidebars and ads
Documentation - Preserve code blocks and structure
E-commerce - Product descriptions and specs

Documents

PDF files - Text extraction with layout preservation
Word documents - Convert to clean text/markdown
Google Docs - Public document extraction
Notion pages - Public page content

Twitter threads - Clean tweet compilation
LinkedIn posts - Professional content extraction
Reddit posts - Thread and comment extraction
Medium articles - Clean article content

⚙️ Advanced Options

Custom Extraction Rules

{
  "url": "https://example.com/article",
  "options": {
    "custom_selectors": {
      "content": "article.main-content",
      "title": "h1.article-title",
      "exclude": [".advertisement", ".sidebar"]
    },
    "preserve_formatting": true,
    "min_content_length": 500
  }
}

Batch Processing

{
  "urls": [
    "https://example1.com/article",
    "https://example2.com/article",
    "https://example3.com/article"
  ],
  "format": "text",
  "parallel": true
}

PDF Processing

{
  "url": "https://example.com/document.pdf",
  "options": {
    "extract_pages": [1, 2, 3],
    "preserve_layout": false,
    "extract_images": true,
    "ocr_enabled": true
  }
}

🚨 Error Handling

Common Errors

Error Code	Description	Solution
`URL_INVALID`	Malformed URL	Check URL format
`URL_UNREACHABLE`	Cannot access URL	Verify URL accessibility
`CONTENT_TOO_LARGE`	Content exceeds size limit	Use pagination or filters
`UNSUPPORTED_FORMAT`	File type not supported	Check supported formats
`EXTRACTION_FAILED`	Cannot extract content	Try different extraction method

Error Response

{
  "success": false,
  "error": {
    "code": "URL_UNREACHABLE",
    "message": "Unable to access the provided URL",
    "details": {
      "url": "https://example.com/missing",
      "http_status": 404,
      "suggestion": "Verify the URL is correct and accessible"
    }
  },
  "request_id": "req_error_extract_456"
}

💡 Use Cases

RAG System Integration

from cortex import CortexClient

client = CortexClient(api_key="your_key")

# Extract content for RAG
urls = ["https://docs.example.com/api", "https://blog.example.com/tutorial"]

for url in urls:
    result = client.extract(url, format="text")
    # Add to vector database
    vector_db.add_document(result.content, metadata=result.metadata)

Content Analysis

# Analyze content quality
result = client.extract(
    url="https://news.example.com/article",
    include_metadata=True,
    extract_links=True
)

print(f"Quality Score: {result.quality_score}")
print(f"Reading Time: {result.metadata.reading_time}")
print(f"External Links: {len(result.links.external)}")

Document Processing

# Process PDF documents
result = client.extract(
    url="https://example.com/research.pdf",
    format="markdown",
    options={
        "extract_pages": [1, 2, 3],
        "preserve_layout": True
    }
)

print(result.content)

📊 Quality Scoring

Content quality is scored based on:

Factor	Weight	Description
Content Length	25%	Adequate content volume
Structure	20%	Proper heading hierarchy
Readability	20%	Clean, readable text
Metadata	15%	Complete title, author, date
Media	10%	Relevant images and media
Links	10%	Internal/external link quality

Quality Thresholds

0.9-1.0 - Excellent (news articles, documentation)
0.7-0.9 - Good (blog posts, tutorials)
0.5-0.7 - Fair (forums, social media)
0.0-0.5 - Poor (spam, low-content pages)

🔧 Integration Examples

Python SDK

from cortex import CortexClient

client = CortexClient(api_key="your_key")

# Basic extraction
result = client.extract("https://example.com/article")
print(result.content)

# Advanced extraction
result = client.extract(
    url="https://example.com/article",
    format="markdown",
    include_metadata=True,
    extract_images=True
)

# Process result
if result.quality_score > 0.7:
    print("High quality content extracted")
    print(f"Title: {result.metadata.title}")
    print(f"Author: {result.metadata.author}")

JavaScript SDK

import Cortex from '@cortex/sdk';

const cortex = new Cortex({ apiKey: 'your_key' });

// Extract and process
const result = await cortex.extract({
  url: 'https://example.com/article',
  format: 'markdown',
  includeMetadata: true
});

console.log(`Extracted ${result.metadata.wordCount} words`);
console.log(result.content);

cURL Examples

# Basic extraction
curl -X POST https://api.usecortex.co/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"url": "https://example.com/article"}'

# Markdown format with metadata
curl -X POST https://api.usecortex.co/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "url": "https://example.com/article",
    "format": "markdown",
    "include_metadata": true,
    "extract_images": true
  }'

📈 Performance & Limits

Processing Limits

Limit	Value	Description
Max file size	50MB	Per document/URL
Max content length	1M chars	Extracted text limit
Timeout	30 seconds	Processing timeout
Concurrent requests	10	Per API key

Best Practices

Use appropriate format for your use case
Enable caching for frequently accessed URLs
Process in batches for multiple URLs
Check quality score before using content
Handle errors gracefully with retry logic

Next: Validate API → - Source verification and fact-checking

🎯 Overview​

📝 Request Format​

Basic Request​

Complete Request​

📋 Parameters​

Required Parameters​

Optional Parameters​

📊 Response Format​

Success Response​

🎨 Output Formats​

Text Format (Default)​

Markdown Format​

JSON Format​

📄 Supported Content Types​

Web Pages​

Documents​

Social Media​

⚙️ Advanced Options​

Custom Extraction Rules​

Batch Processing​

PDF Processing​

🚨 Error Handling​

Common Errors​

Error Response​

💡 Use Cases​

RAG System Integration​

Content Analysis​

Document Processing​

📊 Quality Scoring​

Quality Thresholds​

🔧 Integration Examples​

Python SDK​

JavaScript SDK​

cURL Examples​

📈 Performance & Limits​

Processing Limits​

Best Practices​

🎯 Overview

📝 Request Format

Basic Request

Complete Request

📋 Parameters

Required Parameters

Optional Parameters

📊 Response Format

Success Response

🎨 Output Formats

Text Format (Default)

Markdown Format

JSON Format

📄 Supported Content Types

Web Pages

Documents

Social Media

⚙️ Advanced Options

Custom Extraction Rules

Batch Processing

PDF Processing

🚨 Error Handling

Common Errors

Error Response

💡 Use Cases

RAG System Integration

Content Analysis

Document Processing

📊 Quality Scoring

Quality Thresholds

🔧 Integration Examples

Python SDK

JavaScript SDK

cURL Examples

📈 Performance & Limits

Processing Limits

Best Practices