Documentation Scraping & Archives

Overview

The Architect API automatically scrapes and archives external documentation alongside spec generation. This powerful feature provides AI agents and development teams with complete context about external APIs, frameworks, and services referenced in your specifications.

How It Works

Parallel Processing

Documentation scraping runs in parallel with spec generation, not sequentially. This means:

✅ No additional wait time for your specs
✅ Fast Spec: Still ~30-40 seconds
✅ Deep Spec: Still ~2-3 minutes
✅ Both processes complete simultaneously

Graceful Degradation

If documentation scraping fails:

✅ Spec generation completes normally
✅ zippedDocsUrls returns empty array []
✅ No errors thrown
✅ Your workflow continues uninterrupted

Using the Feature

Basic Usage - Single Documentation Source

curl -X POST https://api.pre.dev/fast-spec \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "input": "Build a payment processing system with subscription management",
    "docURLs": ["https://stripe.com/docs/api"],
  }'

Advanced Usage - Multiple Documentation Sources

curl -X POST https://api.pre.dev/deep-spec \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "input": "Build enterprise healthcare platform with HL7 FHIR integration, HIPAA compliance, and payment processing",
    "docURLs": [
      "https://docs.hl7.org/fhir",
      "https://www.hhs.gov/hipaa/for-professionals/security/guidance",
      "https://stripe.com/docs/api"
    ],
    "async": true
  }'

MCP Integration

When using the MCP server, simply reference documentation URLs naturally:

Use pre.dev deep_spec to generate a spec for a healthcare platform 
with HL7 FHIR integration. Reference the FHIR documentation at docs.hl7.org

The MCP server automatically handles the docURLs parameter.

Response Format

The `zippedDocsUrls` Field

interface ZippedDocsUrl {
  platform: string;              // Hostname from the documentation URL
  masterZipShortUrl: string;     // Download link to ZIP archive
  masterMarkdownShortUrl?: string; // Optional: consolidated markdown file
}

Example Response

{
  "endpoint": "deep_spec",
  "input": "Build enterprise healthcare platform...",
  "status": "completed",
  "success": true,
  "humanSpecUrl": "https://api.pre.dev/s/a6hFJRV6",
  "totalHumanHours": 240,
  "codingAgentSpecUrl": "https://api.pre.dev/s/a6hFJRV7",
  "executionTime": 185000,
  "predevUrl": "https://pre.dev/projects/abc123",
  "zippedDocsUrls": [
    {
      "platform": "docs.hl7.org",
      "masterZipShortUrl": "https://api.pre.dev/s/xyz789"
    },
    {
      "platform": "hhs.gov",
      "masterZipShortUrl": "https://api.pre.dev/s/abc456"
    },
    {
      "platform": "stripe.com",
      "masterZipShortUrl": "https://api.pre.dev/s/def123"
    }
  ]
}

ZIP Archive Structure

Organization

Each ZIP archive contains:

stripe.com/
├── api/
│   ├── authentication.md
│   ├── customers.md
│   ├── subscriptions.md
│   └── webhooks.md
├── payments/
│   ├── payment-intents.md
│   ├── payment-methods.md
│   └── checkout.md
└── index.md

Features

Hierarchical Structure: Mirrors the documentation site’s organization
Individual Files: Each documentation page is a separate markdown file
Easy Navigation: Folder structure makes finding specific topics simple
Complete Coverage: Includes all pages discovered during scraping

Platform Identification

The platform field is extracted from the hostname of the documentation URL:

Documentation URL	Extracted Platform
`https://stripe.com/docs/api`	`stripe.com`
`https://docs.github.com/en/rest`	`docs.github.com`
`https://docs.hl7.org/fhir`	`docs.hl7.org`
`https://www.hhs.gov/hipaa/guidance`	`hhs.gov`
`https://developer.mozilla.org/docs`	`developer.mozilla.org`

Supported Domain Formats

The system handles various TLDs and domain formats:

Common TLDs: .com, .org, .net, .io, .dev, .ai
Country TLDs: .co.uk, .com.au, .de, .fr, etc.
New TLDs: .cloud, .tech, .app, .digital, etc.
Subdomains: docs.example.com, api.example.com, etc.

Best Practices

What to Include

API Documentation

Official API references for integrations:

Stripe, Square, PayPal (payments)
Auth0, Okta, Firebase (authentication)
AWS, GCP, Azure (infrastructure)
Twilio, SendGrid (communications)

Framework Documentation

Official framework and library docs:

React, Vue, Angular
Next.js, Nuxt, SvelteKit
Express, Fastify, NestJS
TailwindCSS, Bootstrap

Compliance Documentation

Regulatory and compliance guides:

HIPAA guidelines
GDPR compliance docs
PCI DSS requirements
SOC 2 standards

Internal Documentation

Your organization’s documentation:

Internal API specifications
Architecture decision records
Design systems and style guides
Security policies

What to Avoid

❌ Marketing Pages: Sales content doesn’t help with implementation ❌ Blog Posts: Use official documentation instead ❌ Deprecated Docs: Ensure documentation is current ❌ Duplicate Domains: System handles this automatically, but avoid manually ❌ Non-Documentation URLs: Social media, forums, etc.

Use Cases

Healthcare Platform Development

{
  "input": "Build HIPAA-compliant telemedicine platform with HL7 FHIR integration, appointment scheduling, and secure video calls",
  "docURLs": [
    "https://docs.hl7.org/fhir",
    "https://www.hhs.gov/hipaa/for-professionals/security/guidance",
    "https://www.twilio.com/docs/video",
    "https://stripe.com/docs/api"
  ],
  "async": true
}

Result: Four separate documentation archives:

HL7 FHIR implementation guides
HIPAA security and compliance guidance
Twilio video API documentation
Stripe payment processing docs

E-Commerce Platform

{
  "input": "Build multi-vendor e-commerce marketplace with Stripe Connect, real-time inventory, and automated shipping",
  "docURLs": [
    "https://stripe.com/docs/connect",
    "https://stripe.com/docs/api",
    "https://www.shippo.com/docs/reference",
    "https://docs.sentry.io"
  ],
}

SaaS Application

{
  "input": "Build SaaS analytics platform with SSO, real-time dashboards, and data exports",
  "docURLs": [
    "https://auth0.com/docs/api",
    "https://docs.aws.amazon.com/s3",
    "https://docs.stripe.com/billing",
    "https://www.chartjs.org/docs"
  ]
}

Microservices Architecture

{
  "input": "Build microservices platform with Kubernetes, service mesh, and observability",
  "docURLs": [
    "https://kubernetes.io/docs",
    "https://istio.io/docs",
    "https://prometheus.io/docs",
    "https://grafana.com/docs"
  ],
  "async": true
}

Accessing Documentation Archives

For Enterprise Users

Navigate to Dashboard: Visit https://pre.dev/enterprise/dashboard?page=api
Open API Usage Logs: Click on the “API Usage Logs” tab
Select Request: Click any API call to open the log details modal
View Archives: Scroll to “Documentation Archives” section
Download: Click download links for each platform’s ZIP archive

For Solo Users

Documentation archives are included in the API response:

const response = await fetch('https://api.pre.dev/fast-spec', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer YOUR_API_KEY'
  },
  body: JSON.stringify({
    input: "Build payment system",
    docURLs: ["https://stripe.com/docs/api"]
  })
});

const result = await response.json();

// Download archives
result.zippedDocsUrls.forEach(archive => {
  console.log(`${archive.platform}: ${archive.masterZipShortUrl}`);
});

Integration Workflows

AI Agent Implementation

Spec Generation: Request spec with relevant documentation URLs
Archive Download: AI agent downloads documentation archives
Context Loading: Agent loads documentation into context
Implementation: Agent builds features with complete documentation reference
Validation: Agent verifies implementation against documentation

Team Collaboration

Spec Generation: Tech lead generates spec with documentation
Archive Distribution: Share documentation archives with team
Offline Reference: Team members download for offline access
Consistent Context: Everyone works from the same documentation version
Implementation: Team builds with aligned understanding

CI/CD Pipeline

# Example GitHub Actions workflow
- name: Generate Spec with Documentation
  run: |
    curl -X POST https://api.pre.dev/deep-spec \
      -H "Authorization: Bearer ${{ secrets.PREDEV_API_KEY }}" \
      -H "Content-Type: application/json" \
      -d '{
        "input": "${{ github.event.issue.body }}",
        "docURLs": ["https://stripe.com/docs/api"],
      }' > spec-response.json

- name: Download Documentation Archives
  run: |
    jq -r '.zippedDocsUrls[].masterZipShortUrl' spec-response.json | \
      xargs -I {} wget {} -P ./docs-archives/

Error Handling

Partial Failures

If some documentation URLs fail to scrape:

{
  "codingAgentSpecUrl": "https://api.pre.dev/s/a6hFJRV6",
  "zippedDocsUrls": [
    {
      "platform": "stripe.com",
      "masterZipShortUrl": "https://api.pre.dev/s/xyz789"
    }
  ]
}

Only successfully scraped platforms appear in the response. Failed platforms are silently omitted.

Complete Failure

If all documentation scraping fails:

{
  "codingAgentSpecUrl": "https://api.pre.dev/s/a6hFJRV6",
  "zippedDocsUrls": []
}

Spec generation completes normally with an empty zippedDocsUrls array.

Invalid URLs

Invalid or inaccessible URLs are logged but don’t prevent processing:

URLs are validated before scraping
Invalid formats are skipped
Inaccessible sites are skipped
Spec generation continues with available documentation

Performance Characteristics

Processing Time

Documentation scraping is parallelized:

Spec Type	With Documentation	Without Documentation
Fast Spec	~30-40 seconds	~30-40 seconds
Deep Spec	~2-3 minutes	~2-3 minutes

No additional wait time because scraping runs in parallel.

Archive Sizes

Typical documentation archive sizes:

Documentation Source	Approximate Size
Stripe API Docs	5-15 MB
React Documentation	10-20 MB
AWS Service Docs	20-50 MB
HIPAA Guidelines	2-5 MB

Archives are compressed ZIP files for efficient download.

Rate Limits

Documentation scraping respects the same rate limits as spec generation:

No additional API calls consumed
Same concurrent request limits apply
Fair use policies remain unchanged

Advanced Features

Consolidated Markdown

Some archives include a masterMarkdownShortUrl:

{
  "platform": "stripe.com",
  "masterZipShortUrl": "https://api.pre.dev/s/xyz789",
  "masterMarkdownShortUrl": "https://api.pre.dev/s/master-abc123"
}

This provides a single markdown file containing all documentation from that platform, useful for:

Quick searching
AI context loading
Documentation review

Version Tracking

Archives capture documentation at the time of spec generation:

Consistent Context: Same documentation version for entire team
Historical Reference: Documentation state preserved
Regression Prevention: Changes in docs don’t affect implementation

Custom Documentation

The system works with any accessible documentation:

Public APIs: Any publicly accessible documentation
Internal APIs: Documentation behind authentication (when accessible)
Custom Formats: Works with various documentation site structures

Troubleshooting

Empty zippedDocsUrls Array

Possible causes:

No docURLs parameter provided
All documentation URLs failed to scrape
URLs were inaccessible or invalid

Solution:

Verify URLs are accessible in a browser
Check URL format is correct
Ensure sites don’t block automated access
Try alternative documentation URLs

Missing Expected Platform

Possible causes:

URL failed to scrape
URL was invalid
Site blocked access

Solution:

Check the URL manually
Try a different page from the same documentation
Use the site’s main documentation URL

Download Links Not Working

Possible causes:

Links may have expired (rare)
Network connectivity issues

Solution:

Copy the URL and try in a different browser
Check network connectivity
Re-generate spec if links are old

Archive Contents Incomplete

Possible causes:

Documentation site structure limited scraping
Some pages were inaccessible
Rate limiting from documentation site

Solution:

Provide multiple specific URLs from the documentation
Try different entry points into the documentation
Use the site’s documentation index or API reference page

FAQ

Does documentation scraping cost extra credits?

No. Documentation scraping is included at no additional cost:

Fast Spec: ~5-10 credits (variable based on complexity)
Deep Spec: ~10-50 credits (variable based on complexity)

How many documentation URLs can I provide?

Recommended: 3-5 URLs per requestTechnical limit: No hard limit, but we recommend keeping it focusedBest practice: Provide specific, relevant documentation pages rather than entire documentation sites

Can I use internal/private documentation?

Yes, with limitations:

Public documentation: ✅ Always works
Password-protected: ❌ Not supported currently
API key authenticated: ⚠️ May work depending on implementation
VPN-required: ❌ Not supported

For internal docs: Consider providing documentation as file uploads instead.

How long are download links valid?

Download links are long-lived (typically 1+ year) and designed to remain accessible.If you need permanent storage:

Download archives immediately after generation
Store in your own infrastructure
Re-generate specs if links expire (rare)

Can I get raw HTML instead of markdown?

Currently: Archives contain markdown files onlyFuture: Raw HTML and other formats may be added based on user feedbackMarkdown benefits:

Cleaner, more readable format
Better for AI agent consumption
Smaller file sizes
Easy to parse and search

Feedback & Feature Requests

We’re continuously improving documentation scraping. Share your feedback:

Feature requests: Contact support through your dashboard
Bug reports: Report issues via enterprise support
Use cases: Share how you’re using this feature to help us improve

Need Help?

Visit your dashboard for support and additional resources

Best Practices

​Documentation Scraping & Archives

​Overview

​How It Works

​Parallel Processing

​Graceful Degradation

​Using the Feature

​Basic Usage - Single Documentation Source

​Advanced Usage - Multiple Documentation Sources

​MCP Integration

​Response Format

​The zippedDocsUrls Field

​Example Response

​ZIP Archive Structure

​Organization

​Features

​Platform Identification

​Supported Domain Formats

​Best Practices

​What to Include

API Documentation

Framework Documentation

Compliance Documentation

Internal Documentation

​What to Avoid

​Use Cases

​Healthcare Platform Development

​E-Commerce Platform

​SaaS Application

​Microservices Architecture

​Accessing Documentation Archives

​For Enterprise Users

​For Solo Users

​Integration Workflows

​AI Agent Implementation

​Team Collaboration

​CI/CD Pipeline

​Error Handling

​Partial Failures

​Complete Failure

​Invalid URLs

​Performance Characteristics

​Processing Time

​Archive Sizes

​Rate Limits

​Advanced Features

​Consolidated Markdown

​Version Tracking

​Custom Documentation

​Troubleshooting

​FAQ

​Feedback & Feature Requests

Need Help?

Documentation Scraping & Archives

Overview

How It Works

Parallel Processing

Graceful Degradation

Using the Feature

Basic Usage - Single Documentation Source

Advanced Usage - Multiple Documentation Sources

MCP Integration

Response Format

The `zippedDocsUrls` Field

Example Response

ZIP Archive Structure

Organization

Features

Platform Identification

Supported Domain Formats

Best Practices

What to Include

What to Avoid

Use Cases

Healthcare Platform Development

E-Commerce Platform

SaaS Application

Microservices Architecture

Accessing Documentation Archives

For Enterprise Users

For Solo Users

Integration Workflows

AI Agent Implementation

Team Collaboration

CI/CD Pipeline

Error Handling

Partial Failures

Complete Failure

Invalid URLs

Performance Characteristics

Processing Time

Archive Sizes

Rate Limits

Advanced Features

Consolidated Markdown

Version Tracking

Custom Documentation

Troubleshooting

FAQ

Feedback & Feature Requests