Skip to main content

Documentation Scraping & Archives

Overview

The Architect API now automatically scrapes and archives external documentation alongside spec generation. This powerful feature provides AI agents and development teams with complete context about external APIs, frameworks, and services referenced in your specifications.

How It Works

Parallel Processing

Documentation scraping runs in parallel with spec generation, not sequentially. This means:
  • ✅ No additional wait time for your specs
  • ✅ Fast Spec: Still ~30-40 seconds
  • ✅ Deep Spec: Still ~2-3 minutes
  • ✅ Both processes complete simultaneously

Graceful Degradation

If documentation scraping fails:
  • ✅ Spec generation completes normally
  • zippedDocsUrls returns empty array []
  • ✅ No errors thrown
  • ✅ Your workflow continues uninterrupted

Using the Feature

Basic Usage - Single Documentation Source

curl -X POST https://api.pre.dev/fast-spec \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "input": "Build a payment processing system with subscription management",
    "docURLs": ["https://stripe.com/docs/api"],
  }'

Advanced Usage - Multiple Documentation Sources

curl -X POST https://api.pre.dev/deep-spec \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "input": "Build enterprise healthcare platform with HL7 FHIR integration, HIPAA compliance, and payment processing",
    "docURLs": [
      "https://docs.hl7.org/fhir",
      "https://www.hhs.gov/hipaa/for-professionals/security/guidance",
      "https://stripe.com/docs/api"
    ],
    "async": true
  }'

MCP Integration

When using the MCP server, simply reference documentation URLs naturally:
Use pre.dev deep_spec to generate a spec for a healthcare platform 
with HL7 FHIR integration. Reference the FHIR documentation at docs.hl7.org
The MCP server automatically handles the docURLs parameter.

Response Format

The zippedDocsUrls Field

interface ZippedDocsUrl {
  platform: string;              // Hostname from the documentation URL
  masterZipShortUrl: string;     // Download link to ZIP archive
  masterMarkdownShortUrl?: string; // Optional: consolidated markdown file
}

Example Response

{
  "endpoint": "deep_spec",
  "input": "Build enterprise healthcare platform...",
  "status": "completed",
  "success": true,
  "humanSpecUrl": "https://api.pre.dev/s/a6hFJRV6",
  "totalHumanHours": 240,
  "codingAgentSpecUrl": "https://api.pre.dev/s/a6hFJRV7",
  "executionTime": 185000,
  "predevUrl": "https://pre.dev/projects/abc123",
  "zippedDocsUrls": [
    {
      "platform": "docs.hl7.org",
      "masterZipShortUrl": "https://api.pre.dev/s/xyz789"
    },
    {
      "platform": "hhs.gov",
      "masterZipShortUrl": "https://api.pre.dev/s/abc456"
    },
    {
      "platform": "stripe.com",
      "masterZipShortUrl": "https://api.pre.dev/s/def123"
    }
  ]
}

ZIP Archive Structure

Organization

Each ZIP archive contains:
stripe.com/
├── api/
│   ├── authentication.md
│   ├── customers.md
│   ├── subscriptions.md
│   └── webhooks.md
├── payments/
│   ├── payment-intents.md
│   ├── payment-methods.md
│   └── checkout.md
└── index.md

Features

  • Hierarchical Structure: Mirrors the documentation site’s organization
  • Individual Files: Each documentation page is a separate markdown file
  • Easy Navigation: Folder structure makes finding specific topics simple
  • Complete Coverage: Includes all pages discovered during scraping

Platform Identification

The platform field is extracted from the hostname of the documentation URL:
Documentation URLExtracted Platform
https://stripe.com/docs/apistripe.com
https://docs.github.com/en/restdocs.github.com
https://docs.hl7.org/fhirdocs.hl7.org
https://www.hhs.gov/hipaa/guidancehhs.gov
https://developer.mozilla.org/docsdeveloper.mozilla.org

Supported Domain Formats

The system handles various TLDs and domain formats:
  • Common TLDs: .com, .org, .net, .io, .dev, .ai
  • Country TLDs: .co.uk, .com.au, .de, .fr, etc.
  • New TLDs: .cloud, .tech, .app, .digital, etc.
  • Subdomains: docs.example.com, api.example.com, etc.

Best Practices

What to Include

API Documentation

Official API references for integrations:
  • Stripe, Square, PayPal (payments)
  • Auth0, Okta, Firebase (authentication)
  • AWS, GCP, Azure (infrastructure)
  • Twilio, SendGrid (communications)

Framework Documentation

Official framework and library docs:
  • React, Vue, Angular
  • Next.js, Nuxt, SvelteKit
  • Express, Fastify, NestJS
  • TailwindCSS, Bootstrap

Compliance Documentation

Regulatory and compliance guides:
  • HIPAA guidelines
  • GDPR compliance docs
  • PCI DSS requirements
  • SOC 2 standards

Internal Documentation

Your organization’s documentation:
  • Internal API specifications
  • Architecture decision records
  • Design systems and style guides
  • Security policies

What to Avoid

Marketing Pages: Sales content doesn’t help with implementation ❌ Blog Posts: Use official documentation instead ❌ Deprecated Docs: Ensure documentation is current ❌ Duplicate Domains: System handles this automatically, but avoid manually ❌ Non-Documentation URLs: Social media, forums, etc.

Use Cases

Healthcare Platform Development

{
  "input": "Build HIPAA-compliant telemedicine platform with HL7 FHIR integration, appointment scheduling, and secure video calls",
  "docURLs": [
    "https://docs.hl7.org/fhir",
    "https://www.hhs.gov/hipaa/for-professionals/security/guidance",
    "https://www.twilio.com/docs/video",
    "https://stripe.com/docs/api"
  ],
  "async": true
}
Result: Four separate documentation archives:
  • HL7 FHIR implementation guides
  • HIPAA security and compliance guidance
  • Twilio video API documentation
  • Stripe payment processing docs

E-Commerce Platform

{
  "input": "Build multi-vendor e-commerce marketplace with Stripe Connect, real-time inventory, and automated shipping",
  "docURLs": [
    "https://stripe.com/docs/connect",
    "https://stripe.com/docs/api",
    "https://www.shippo.com/docs/reference",
    "https://docs.sentry.io"
  ],
}

SaaS Application

{
  "input": "Build SaaS analytics platform with SSO, real-time dashboards, and data exports",
  "docURLs": [
    "https://auth0.com/docs/api",
    "https://docs.aws.amazon.com/s3",
    "https://docs.stripe.com/billing",
    "https://www.chartjs.org/docs"
  ]
}

Microservices Architecture

{
  "input": "Build microservices platform with Kubernetes, service mesh, and observability",
  "docURLs": [
    "https://kubernetes.io/docs",
    "https://istio.io/docs",
    "https://prometheus.io/docs",
    "https://grafana.com/docs"
  ],
  "async": true
}

Accessing Documentation Archives

For Enterprise Users

  1. Navigate to Dashboard: Visit https://pre.dev/enterprise/dashboard?page=api
  2. Open API Usage Logs: Click on the “API Usage Logs” tab
  3. Select Request: Click any API call to open the log details modal
  4. View Archives: Scroll to “Documentation Archives” section
  5. Download: Click download links for each platform’s ZIP archive

For Solo Users

Documentation archives are included in the API response:
const response = await fetch('https://api.pre.dev/fast-spec', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer YOUR_API_KEY'
  },
  body: JSON.stringify({
    input: "Build payment system",
    docURLs: ["https://stripe.com/docs/api"]
  })
});

const result = await response.json();

// Download archives
result.zippedDocsUrls.forEach(archive => {
  console.log(`${archive.platform}: ${archive.masterZipShortUrl}`);
});

Integration Workflows

AI Agent Implementation

  1. Spec Generation: Request spec with relevant documentation URLs
  2. Archive Download: AI agent downloads documentation archives
  3. Context Loading: Agent loads documentation into context
  4. Implementation: Agent builds features with complete documentation reference
  5. Validation: Agent verifies implementation against documentation

Team Collaboration

  1. Spec Generation: Tech lead generates spec with documentation
  2. Archive Distribution: Share documentation archives with team
  3. Offline Reference: Team members download for offline access
  4. Consistent Context: Everyone works from the same documentation version
  5. Implementation: Team builds with aligned understanding

CI/CD Pipeline

# Example GitHub Actions workflow
- name: Generate Spec with Documentation
  run: |
    curl -X POST https://api.pre.dev/deep-spec \
      -H "Authorization: Bearer ${{ secrets.PREDEV_API_KEY }}" \
      -H "Content-Type: application/json" \
      -d '{
        "input": "${{ github.event.issue.body }}",
        "docURLs": ["https://stripe.com/docs/api"],
      }' > spec-response.json

- name: Download Documentation Archives
  run: |
    jq -r '.zippedDocsUrls[].masterZipShortUrl' spec-response.json | \
      xargs -I {} wget {} -P ./docs-archives/

Error Handling

Partial Failures

If some documentation URLs fail to scrape:
{
  "codingAgentSpecUrl": "https://api.pre.dev/s/a6hFJRV6",
  "zippedDocsUrls": [
    {
      "platform": "stripe.com",
      "masterZipShortUrl": "https://api.pre.dev/s/xyz789"
    }
  ]
}
Only successfully scraped platforms appear in the response. Failed platforms are silently omitted.

Complete Failure

If all documentation scraping fails:
{
  "codingAgentSpecUrl": "https://api.pre.dev/s/a6hFJRV6",
  "zippedDocsUrls": []
}
Spec generation completes normally with an empty zippedDocsUrls array.

Invalid URLs

Invalid or inaccessible URLs are logged but don’t prevent processing:
  • URLs are validated before scraping
  • Invalid formats are skipped
  • Inaccessible sites are skipped
  • Spec generation continues with available documentation

Performance Characteristics

Processing Time

Documentation scraping is parallelized:
Spec TypeWith DocumentationWithout Documentation
Fast Spec~30-40 seconds~30-40 seconds
Deep Spec~2-3 minutes~2-3 minutes
No additional wait time because scraping runs in parallel.

Archive Sizes

Typical documentation archive sizes:
Documentation SourceApproximate Size
Stripe API Docs5-15 MB
React Documentation10-20 MB
AWS Service Docs20-50 MB
HIPAA Guidelines2-5 MB
Archives are compressed ZIP files for efficient download.

Rate Limits

Documentation scraping respects the same rate limits as spec generation:
  • No additional API calls consumed
  • Same concurrent request limits apply
  • Fair use policies remain unchanged

Advanced Features

Consolidated Markdown

Some archives include a masterMarkdownShortUrl:
{
  "platform": "stripe.com",
  "masterZipShortUrl": "https://api.pre.dev/s/xyz789",
  "masterMarkdownShortUrl": "https://api.pre.dev/s/master-abc123"
}
This provides a single markdown file containing all documentation from that platform, useful for:
  • Quick searching
  • AI context loading
  • Documentation review

Version Tracking

Archives capture documentation at the time of spec generation:
  • Consistent Context: Same documentation version for entire team
  • Historical Reference: Documentation state preserved
  • Regression Prevention: Changes in docs don’t affect implementation

Custom Documentation

The system works with any accessible documentation:
  • Public APIs: Any publicly accessible documentation
  • Internal APIs: Documentation behind authentication (when accessible)
  • Custom Formats: Works with various documentation site structures

Troubleshooting

Possible causes:
  1. No docURLs parameter provided
  2. All documentation URLs failed to scrape
  3. URLs were inaccessible or invalid
Solution:
  • Verify URLs are accessible in a browser
  • Check URL format is correct
  • Ensure sites don’t block automated access
  • Try alternative documentation URLs
Possible causes:
  1. URL failed to scrape
  2. URL was invalid
  3. Site blocked access
Solution:
  • Check the URL manually
  • Try a different page from the same documentation
  • Use the site’s main documentation URL
Possible causes:
  1. Documentation site structure limited scraping
  2. Some pages were inaccessible
  3. Rate limiting from documentation site
Solution:
  • Provide multiple specific URLs from the documentation
  • Try different entry points into the documentation
  • Use the site’s documentation index or API reference page

FAQ

No. Documentation scraping is included at no additional cost:
  • Fast Spec: 10 credits (same as before)
  • Deep Spec: 50 credits (same as before)
Recommended: 3-5 URLs per requestTechnical limit: No hard limit, but we recommend keeping it focusedBest practice: Provide specific, relevant documentation pages rather than entire documentation sites
Yes, with limitations:
  • Public documentation: ✅ Always works
  • Password-protected: ❌ Not supported currently
  • API key authenticated: ⚠️ May work depending on implementation
  • VPN-required: ❌ Not supported
For internal docs: Consider providing documentation as file uploads instead.
Currently: Archives contain markdown files onlyFuture: Raw HTML and other formats may be added based on user feedbackMarkdown benefits:
  • Cleaner, more readable format
  • Better for AI agent consumption
  • Smaller file sizes
  • Easy to parse and search

Feedback & Feature Requests

We’re continuously improving documentation scraping. Share your feedback:
  • Feature requests: Contact support through your dashboard
  • Bug reports: Report issues via enterprise support
  • Use cases: Share how you’re using this feature to help us improve

Need Help?

Visit your dashboard for support and additional resources