Documentation Scraping & Archives
Overview
The Architect API now automatically scrapes and archives external documentation alongside spec generation. This powerful feature provides AI agents and development teams with complete context about external APIs, frameworks, and services referenced in your specifications.How It Works
Parallel Processing
Documentation scraping runs in parallel with spec generation, not sequentially. This means:- ✅ No additional wait time for your specs
- ✅ Fast Spec: Still ~30-40 seconds
- ✅ Deep Spec: Still ~2-3 minutes
- ✅ Both processes complete simultaneously
Graceful Degradation
If documentation scraping fails:- ✅ Spec generation completes normally
- ✅
zippedDocsUrlsreturns empty array[] - ✅ No errors thrown
- ✅ Your workflow continues uninterrupted
Using the Feature
Basic Usage - Single Documentation Source
Advanced Usage - Multiple Documentation Sources
MCP Integration
When using the MCP server, simply reference documentation URLs naturally:docURLs parameter.
Response Format
The zippedDocsUrls Field
Example Response
ZIP Archive Structure
Organization
Each ZIP archive contains:Features
- Hierarchical Structure: Mirrors the documentation site’s organization
- Individual Files: Each documentation page is a separate markdown file
- Easy Navigation: Folder structure makes finding specific topics simple
- Complete Coverage: Includes all pages discovered during scraping
Platform Identification
Theplatform field is extracted from the hostname of the documentation URL:
| Documentation URL | Extracted Platform |
|---|---|
https://stripe.com/docs/api | stripe.com |
https://docs.github.com/en/rest | docs.github.com |
https://docs.hl7.org/fhir | docs.hl7.org |
https://www.hhs.gov/hipaa/guidance | hhs.gov |
https://developer.mozilla.org/docs | developer.mozilla.org |
Supported Domain Formats
The system handles various TLDs and domain formats:- Common TLDs:
.com,.org,.net,.io,.dev,.ai - Country TLDs:
.co.uk,.com.au,.de,.fr, etc. - New TLDs:
.cloud,.tech,.app,.digital, etc. - Subdomains:
docs.example.com,api.example.com, etc.
Best Practices
What to Include
API Documentation
Official API references for integrations:
- Stripe, Square, PayPal (payments)
- Auth0, Okta, Firebase (authentication)
- AWS, GCP, Azure (infrastructure)
- Twilio, SendGrid (communications)
Framework Documentation
Official framework and library docs:
- React, Vue, Angular
- Next.js, Nuxt, SvelteKit
- Express, Fastify, NestJS
- TailwindCSS, Bootstrap
Compliance Documentation
Regulatory and compliance guides:
- HIPAA guidelines
- GDPR compliance docs
- PCI DSS requirements
- SOC 2 standards
Internal Documentation
Your organization’s documentation:
- Internal API specifications
- Architecture decision records
- Design systems and style guides
- Security policies
What to Avoid
❌ Marketing Pages: Sales content doesn’t help with implementation ❌ Blog Posts: Use official documentation instead ❌ Deprecated Docs: Ensure documentation is current ❌ Duplicate Domains: System handles this automatically, but avoid manually ❌ Non-Documentation URLs: Social media, forums, etc.Use Cases
Healthcare Platform Development
- HL7 FHIR implementation guides
- HIPAA security and compliance guidance
- Twilio video API documentation
- Stripe payment processing docs
E-Commerce Platform
SaaS Application
Microservices Architecture
Accessing Documentation Archives
For Enterprise Users
- Navigate to Dashboard: Visit https://pre.dev/enterprise/dashboard?page=api
- Open API Usage Logs: Click on the “API Usage Logs” tab
- Select Request: Click any API call to open the log details modal
- View Archives: Scroll to “Documentation Archives” section
- Download: Click download links for each platform’s ZIP archive
For Solo Users
Documentation archives are included in the API response:Integration Workflows
AI Agent Implementation
- Spec Generation: Request spec with relevant documentation URLs
- Archive Download: AI agent downloads documentation archives
- Context Loading: Agent loads documentation into context
- Implementation: Agent builds features with complete documentation reference
- Validation: Agent verifies implementation against documentation
Team Collaboration
- Spec Generation: Tech lead generates spec with documentation
- Archive Distribution: Share documentation archives with team
- Offline Reference: Team members download for offline access
- Consistent Context: Everyone works from the same documentation version
- Implementation: Team builds with aligned understanding
CI/CD Pipeline
Error Handling
Partial Failures
If some documentation URLs fail to scrape:Complete Failure
If all documentation scraping fails:zippedDocsUrls array.
Invalid URLs
Invalid or inaccessible URLs are logged but don’t prevent processing:- URLs are validated before scraping
- Invalid formats are skipped
- Inaccessible sites are skipped
- Spec generation continues with available documentation
Performance Characteristics
Processing Time
Documentation scraping is parallelized:| Spec Type | With Documentation | Without Documentation |
|---|---|---|
| Fast Spec | ~30-40 seconds | ~30-40 seconds |
| Deep Spec | ~2-3 minutes | ~2-3 minutes |
Archive Sizes
Typical documentation archive sizes:| Documentation Source | Approximate Size |
|---|---|
| Stripe API Docs | 5-15 MB |
| React Documentation | 10-20 MB |
| AWS Service Docs | 20-50 MB |
| HIPAA Guidelines | 2-5 MB |
Rate Limits
Documentation scraping respects the same rate limits as spec generation:- No additional API calls consumed
- Same concurrent request limits apply
- Fair use policies remain unchanged
Advanced Features
Consolidated Markdown
Some archives include amasterMarkdownShortUrl:
- Quick searching
- AI context loading
- Documentation review
Version Tracking
Archives capture documentation at the time of spec generation:- Consistent Context: Same documentation version for entire team
- Historical Reference: Documentation state preserved
- Regression Prevention: Changes in docs don’t affect implementation
Custom Documentation
The system works with any accessible documentation:- Public APIs: Any publicly accessible documentation
- Internal APIs: Documentation behind authentication (when accessible)
- Custom Formats: Works with various documentation site structures
Troubleshooting
Empty zippedDocsUrls Array
Empty zippedDocsUrls Array
Possible causes:
- No
docURLsparameter provided - All documentation URLs failed to scrape
- URLs were inaccessible or invalid
- Verify URLs are accessible in a browser
- Check URL format is correct
- Ensure sites don’t block automated access
- Try alternative documentation URLs
Missing Expected Platform
Missing Expected Platform
Possible causes:
- URL failed to scrape
- URL was invalid
- Site blocked access
- Check the URL manually
- Try a different page from the same documentation
- Use the site’s main documentation URL
Download Links Not Working
Download Links Not Working
Possible causes:
- Links may have expired (rare)
- Network connectivity issues
- Copy the URL and try in a different browser
- Check network connectivity
- Re-generate spec if links are old
Archive Contents Incomplete
Archive Contents Incomplete
Possible causes:
- Documentation site structure limited scraping
- Some pages were inaccessible
- Rate limiting from documentation site
- Provide multiple specific URLs from the documentation
- Try different entry points into the documentation
- Use the site’s documentation index or API reference page
FAQ
Does documentation scraping cost extra credits?
Does documentation scraping cost extra credits?
No. Documentation scraping is included at no additional cost:
- Fast Spec: 10 credits (same as before)
- Deep Spec: 50 credits (same as before)
How many documentation URLs can I provide?
How many documentation URLs can I provide?
Recommended: 3-5 URLs per requestTechnical limit: No hard limit, but we recommend keeping it focusedBest practice: Provide specific, relevant documentation pages rather than entire documentation sites
Can I use internal/private documentation?
Can I use internal/private documentation?
Yes, with limitations:
- Public documentation: ✅ Always works
- Password-protected: ❌ Not supported currently
- API key authenticated: ⚠️ May work depending on implementation
- VPN-required: ❌ Not supported
How long are download links valid?
How long are download links valid?
Download links are long-lived (typically 1+ year) and designed to remain accessible.If you need permanent storage:
- Download archives immediately after generation
- Store in your own infrastructure
- Re-generate specs if links expire (rare)
Can I get raw HTML instead of markdown?
Can I get raw HTML instead of markdown?
Currently: Archives contain markdown files onlyFuture: Raw HTML and other formats may be added based on user feedbackMarkdown benefits:
- Cleaner, more readable format
- Better for AI agent consumption
- Smaller file sizes
- Easy to parse and search
Feedback & Feature Requests
We’re continuously improving documentation scraping. Share your feedback:- Feature requests: Contact support through your dashboard
- Bug reports: Report issues via enterprise support
- Use cases: Share how you’re using this feature to help us improve
Need Help?
Visit your dashboard for support and additional resources

