Extract API
POST
/v1/extract
The Extract API extracts clean, readable text content from web pages, PDFs, and documents. It removes ads, navigation, and clutter to provide AI-ready content perfect for RAG systems and content analysis.
🎯 Overview
Extract any web page or document into clean, structured text:
- Web pages - Remove ads, navigation, popups
- PDF documents - Extract text and metadata
- Articles - Get main content only
- Social media - Clean posts and comments
📝 Request Format
Basic Request
curl -X POST https://api.usecortex.co/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article"
}'
Complete Request
{
"url": "https://example.com/long-article",
"format": "markdown",
"include_metadata": true,
"clean_html": true,
"extract_images": false,
"options": {
"timeout": 15,
"user_agent": "cortex-bot",
"follow_redirects": true,
"extract_tables": true
}
}
📋 Parameters
Required Parameters
| Parameter | Type | Description |
|---|---|---|
url | string | Target URL to extract content from |
Optional Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
format | string | "text" | Output format: text, markdown, html, json |
include_metadata | boolean | true | Include page metadata |
clean_html | boolean | true | Remove ads, navigation, clutter |
extract_images | boolean | false | Include image URLs and alt text |
extract_links | boolean | false | Include internal/external links |
extract_tables | boolean | false | Convert tables to structured data |
📊 Response Format
Success Response
{
"success": true,
"data": {
"content": "This is the clean extracted text content...",
"metadata": {
"title": "Article Title",
"description": "Article description from meta tags",
"author": "John Smith",
"published_date": "2025-01-08T10:00:00Z",
"language": "en",
"word_count": 1234,
"reading_time": "5 min read",
"canonical_url": "https://example.com/article"
},
"structure": {
"headings": [
{"level": 1, "text": "Main Title"},
{"level": 2, "text": "Subtitle"}
],
"paragraphs": 15,
"lists": 3,
"tables": 1
},
"images": [
{
"url": "https://example.com/image.jpg",
"alt": "Image description",
"caption": "Image caption"
}
],
"quality_score": 0.89
},
"metadata": {
"request_id": "req_extract_123",
"processing_time": 2.1,
"extraction_method": "readability",
"content_size": 12450
}
}
🎨 Output Formats
Text Format (Default)
{
"url": "https://example.com/article",
"format": "text"
}
Response:
This is the clean extracted text content without any HTML formatting. Perfect for AI processing and analysis.
The content maintains paragraph structure and readability while removing all navigation elements, advertisements, and other clutter.
Markdown Format
{
"url": "https://example.com/article",
"format": "markdown"
}
Response:
# Article Title
This is the **clean extracted content** formatted as Markdown. Great for documentation and structured processing.
## Subheading
- Bullet points are preserved
- Lists maintain structure
- Links are converted to [markdown format](https://example.com)
### Code blocks are maintained
```python
def example():
return "formatted code"
### HTML Format
```json
{
"url": "https://example.com/article",
"format": "html",
"clean_html": true
}
Response:
<article>
<h1>Article Title</h1>
<p>This is clean HTML with only content-relevant tags preserved.</p>
<ul>
<li>Navigation removed</li>
<li>Ads removed</li>
<li>Clutter removed</li>
</ul>
</article>
JSON Format
{
"url": "https://example.com/article",
"format": "json"
}
Response:
{
"title": "Article Title",
"content": [
{"type": "heading", "level": 1, "text": "Article Title"},
{"type": "paragraph", "text": "First paragraph content..."},
{"type": "list", "items": ["Item 1", "Item 2"]},
{"type": "heading", "level": 2, "text": "Subtitle"},
{"type": "paragraph", "text": "Second paragraph content..."}
]
}
📄 Supported Content Types
Web Pages
- News articles - Clean content extraction
- Blog posts - Remove sidebars and ads
- Documentation - Preserve code blocks and structure
- E-commerce - Product descriptions and specs
Documents
- PDF files - Text extraction with layout preservation
- Word documents - Convert to clean text/markdown
- Google Docs - Public document extraction
- Notion pages - Public page content
Social Media
- Twitter threads - Clean tweet compilation
- LinkedIn posts - Professional content extraction
- Reddit posts - Thread and comment extraction
- Medium articles - Clean article content
⚙️ Advanced Options
Custom Extraction Rules
{
"url": "https://example.com/article",
"options": {
"custom_selectors": {
"content": "article.main-content",
"title": "h1.article-title",
"exclude": [".advertisement", ".sidebar"]
},
"preserve_formatting": true,
"min_content_length": 500
}
}
Batch Processing
{
"urls": [
"https://example1.com/article",
"https://example2.com/article",
"https://example3.com/article"
],
"format": "text",
"parallel": true
}
PDF Processing
{
"url": "https://example.com/document.pdf",
"options": {
"extract_pages": [1, 2, 3],
"preserve_layout": false,
"extract_images": true,
"ocr_enabled": true
}
}
🚨 Error Handling
Common Errors
| Error Code | Description | Solution |
|---|---|---|
URL_INVALID | Malformed URL | Check URL format |
URL_UNREACHABLE | Cannot access URL | Verify URL accessibility |
CONTENT_TOO_LARGE | Content exceeds size limit | Use pagination or filters |
UNSUPPORTED_FORMAT | File type not supported | Check supported formats |
EXTRACTION_FAILED | Cannot extract content | Try different extraction method |
Error Response
{
"success": false,
"error": {
"code": "URL_UNREACHABLE",
"message": "Unable to access the provided URL",
"details": {
"url": "https://example.com/missing",
"http_status": 404,
"suggestion": "Verify the URL is correct and accessible"
}
},
"request_id": "req_error_extract_456"
}
💡 Use Cases
RAG System Integration
from cortex import CortexClient
client = CortexClient(api_key="your_key")
# Extract content for RAG
urls = ["https://docs.example.com/api", "https://blog.example.com/tutorial"]
for url in urls:
result = client.extract(url, format="text")
# Add to vector database
vector_db.add_document(result.content, metadata=result.metadata)
Content Analysis
# Analyze content quality
result = client.extract(
url="https://news.example.com/article",
include_metadata=True,
extract_links=True
)
print(f"Quality Score: {result.quality_score}")
print(f"Reading Time: {result.metadata.reading_time}")
print(f"External Links: {len(result.links.external)}")
Document Processing
# Process PDF documents
result = client.extract(
url="https://example.com/research.pdf",
format="markdown",
options={
"extract_pages": [1, 2, 3],
"preserve_layout": True
}
)
print(result.content)
📊 Quality Scoring
Content quality is scored based on:
| Factor | Weight | Description |
|---|---|---|
| Content Length | 25% | Adequate content volume |
| Structure | 20% | Proper heading hierarchy |
| Readability | 20% | Clean, readable text |
| Metadata | 15% | Complete title, author, date |
| Media | 10% | Relevant images and media |
| Links | 10% | Internal/external link quality |
Quality Thresholds
- 0.9-1.0 - Excellent (news articles, documentation)
- 0.7-0.9 - Good (blog posts, tutorials)
- 0.5-0.7 - Fair (forums, social media)
- 0.0-0.5 - Poor (spam, low-content pages)
🔧 Integration Examples
Python SDK
from cortex import CortexClient
client = CortexClient(api_key="your_key")
# Basic extraction
result = client.extract("https://example.com/article")
print(result.content)
# Advanced extraction
result = client.extract(
url="https://example.com/article",
format="markdown",
include_metadata=True,
extract_images=True
)
# Process result
if result.quality_score > 0.7:
print("High quality content extracted")
print(f"Title: {result.metadata.title}")
print(f"Author: {result.metadata.author}")
JavaScript SDK
import Cortex from '@cortex/sdk';
const cortex = new Cortex({ apiKey: 'your_key' });
// Extract and process
const result = await cortex.extract({
url: 'https://example.com/article',
format: 'markdown',
includeMetadata: true
});
console.log(`Extracted ${result.metadata.wordCount} words`);
console.log(result.content);
cURL Examples
# Basic extraction
curl -X POST https://api.usecortex.co/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{"url": "https://example.com/article"}'
# Markdown format with metadata
curl -X POST https://api.usecortex.co/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"url": "https://example.com/article",
"format": "markdown",
"include_metadata": true,
"extract_images": true
}'
📈 Performance & Limits
Processing Limits
| Limit | Value | Description |
|---|---|---|
| Max file size | 50MB | Per document/URL |
| Max content length | 1M chars | Extracted text limit |
| Timeout | 30 seconds | Processing timeout |
| Concurrent requests | 10 | Per API key |
Best Practices
- Use appropriate format for your use case
- Enable caching for frequently accessed URLs
- Process in batches for multiple URLs
- Check quality score before using content
- Handle errors gracefully with retry logic
Next: Validate API → - Source verification and fact-checking