node-website-scraper: The Essential Tool Every Developer Needs
Struggling to create offline copies of websites with all their assets intact? You're not alone. Developers, researchers, and archivists constantly battle with tools that either miss critical CSS files, break JavaScript references, or fail to download images properly. The frustration peaks when you need a complete, browsable local mirror of a site for documentation, testing, or backup purposes.
Enter node-website-scraper – a revolutionary Node.js library that transforms the complex task of website archiving into a single, elegant command. This powerful tool doesn't just grab HTML; it intelligently crawls and downloads every single asset, recreating a fully functional local copy of any website. From CSS stylesheets and JavaScript files to images, fonts, and nested resources, nothing gets left behind.
In this deep-dive guide, you'll discover why developers are abandoning clunky command-line tools for this sleek JavaScript solution. We'll explore real-world use cases, walk through actual code examples extracted from the repository, and reveal advanced techniques that turn you into a web scraping pro. Whether you're building offline documentation, creating test environments, or archiving competitive intelligence, this comprehensive tutorial delivers everything you need to master node-website-scraper today.
What is node-website-scraper?
node-website-scraper is a modern, pure ESM Node.js library engineered to download complete websites to local directories with surgical precision. Created and maintained by the website-scraper organization, this tool represents a paradigm shift from traditional web archiving utilities by offering a programmable, JavaScript-native solution that integrates seamlessly into modern development workflows.
Unlike basic wget mirrors or browser-based extensions, node-website-scraper operates as a sophisticated crawler that parses HTTP responses for HTML and CSS files, extracts resource references, and systematically downloads every dependency while preserving directory structures and link relationships. The library leverages the powerful got HTTP client under the hood, providing enterprise-grade request handling with retry logic, cookie support, and custom header injection.
Why it's trending now: The recent v5.0.0 release marked a bold transition to pure ESM (ECMAScript Modules), aligning with Node.js's modern direction and requiring version 20.18.1 or higher. This architectural decision eliminates CommonJS compatibility issues and unlocks tree-shaking capabilities for smaller bundle sizes. The project's GitHub repository shows impressive activity with comprehensive CI/CD pipelines, code coverage reporting exceeding industry standards, and a vibrant sponsor community including GitHub itself backing the project.
The library's plugin architecture sets it apart, allowing developers to customize every stage of the scraping pipeline – from filename generation and request modification to response processing and storage handling. This extensibility makes it equally valuable for simple one-off downloads and complex, production-grade scraping operations.
Key Features That Make It Revolutionary
Complete Asset Preservation – node-website-scraper doesn't miss a beat. It automatically detects and downloads CSS files, JavaScript modules, images (PNG, JPG, SVG, WebP), fonts, favicons, and even manifest files. The intelligent parsing engine uses Cheerio to analyze HTML and CSS, identifying resources referenced in src, href, url(), and @import statements.
Pure ESM Architecture – Since v5, the library embraces modern JavaScript modules exclusively. This isn't just a technical detail; it's a performance revolution. ESM enables static analysis, better bundling, and native async/await support throughout the codebase. No more require() statements or module compatibility headaches.
Advanced Recursive Downloading – Control crawl depth with surgical precision using maxRecursiveDepth for HTML links and maxDepth for all resources. This dual-layer approach prevents runaway downloads while ensuring deep asset dependencies get captured. Set recursive: true and watch it intelligently follow internal links without ever getting stuck in infinite loops.
Enterprise-Grade Request Management – Built on the battle-tested got library, it offers configurable request concurrency, automatic retries, timeout handling, and custom header injection. Need to scrape behind authentication? Simply pass cookies and authorization tokens through the request options.
Flexible Filename Generation – The filenameGenerator plugin system lets you customize output paths dynamically. Preserve original structures, flatten everything to a single directory, or implement intelligent naming schemes based on content type, source domain, or custom logic.
Error Resilience – With ignoreErrors: true, the scraper continues operation even when individual resources fail. This fault-tolerant design ensures partial downloads complete successfully, perfect for archiving fragile or partially accessible websites.
Plugin Ecosystem – Extend functionality through a composable plugin architecture. Modify requests before sending, transform responses after receiving, or completely customize storage mechanisms. The official puppeteer plugin solves dynamic JavaScript rendering limitations.
Request Concurrency Control – The requestConcurrency option prevents server overload and IP blocking by limiting simultaneous requests. Scale from polite single-request crawling to aggressive parallel downloading based on target server capabilities.
Real-World Use Cases Where It Shines
1. Offline Documentation Portals
Development teams frequently need local copies of API documentation, SDK guides, or third-party references for air-gapped environments. node-website-scraper excels here by downloading entire documentation sites with working navigation, search functionality (if statically implemented), and all assets. Simply point it at https://docs.example.com, set recursive: true with appropriate depth limits, and deploy the resulting directory to your internal servers. The tool preserves relative links perfectly, ensuring the offline version behaves identically to the live site.
2. Website Backup and Disaster Recovery
Digital agencies and website owners use node-website-scraper as part of their backup strategy. Schedule nightly scrapes of critical landing pages, marketing sites, or content repositories. The tool's ability to capture complete asset snapshots means you can restore a fully functional static version of any site within minutes. Combine it with Git for version-controlled website history, enabling rollbacks to any point in time.
3. Development and Testing Environments
Frontend developers often need to work against production data without hitting live servers. Scrape the target website to a local directory, then serve it with a simple HTTP server. You now have a realistic test environment that loads instantly, costs nothing in bandwidth, and never triggers rate limiting. This approach is invaluable for debugging layout issues, testing performance optimizations, or prototyping redesigns against real content structures.
4. Competitive Intelligence and Market Research
Marketing teams and business analysts leverage node-website-scraper to archive competitor websites, pricing pages, and marketing campaigns. The tool's request customization allows setting specific user agents and headers to mimic different devices or geographic locations. Store periodic snapshots to analyze how competitors evolve their messaging, design, and feature sets over time.
5. Content Migration Projects
When migrating between CMS platforms or redesigning websites, node-website-scraper creates perfect static snapshots of legacy content. These archives serve as reference material during content modeling, ensure no information gets lost during transition, and provide rollback options if migration issues arise. The preserved asset structure makes it easy to identify which media files need transferring to new systems.
Step-by-Step Installation & Setup Guide
Step 1: Verify Node.js Version
First, ensure you're running Node.js version 20.18.1 or higher. This requirement is non-negotiable due to the library's pure ESM architecture.
node --version
# Should show v20.18.1 or higher
If your version is outdated, download the latest LTS from nodejs.org or use a version manager like nvm:
nvm install 20.18.1
nvm use 20.18.1
Step 2: Initialize Your Project
Create a new directory for your scraping project and initialize it as an ESM module:
mkdir website-scraper-project && cd website-scraper-project
npm init -y
Edit your package.json to enable ESM by adding "type": "module":
{
"name": "website-scraper-project",
"version": "1.0.0",
"type": "module",
"scripts": {
"scrape": "node scrape.js"
}
}
Step 3: Install the Package
Run the npm installation command exactly as specified in the repository:
npm install website-scraper
This installs the latest stable version with all dependencies, including got for HTTP requests and cheerio for HTML parsing.
Step 4: Create Your First Scraper Script
Create a file named scrape.js in your project root:
// scrape.js
import scrape from 'website-scraper';
const options = {
urls: ['https://example.com'],
directory: './downloaded-website',
requestConcurrency: 5,
ignoreErrors: true
};
try {
const result = await scrape(options);
console.log(`Successfully downloaded ${result.length} resources`);
} catch (error) {
console.error('Scraping failed:', error.message);
}
Step 5: Execute and Verify
Run your scraper using the npm script:
npm run scrape
The tool creates the ./downloaded-website directory (if it doesn't exist) and populates it with the complete website structure. Check the output directory to verify all assets downloaded correctly.
REAL Code Examples from the Repository
Let's examine actual code snippets from the node-website-scraper README, breaking down each component for maximum understanding.
Example 1: Basic Usage with Async/Await
This fundamental example demonstrates the simplest possible implementation:
import scrape from 'website-scraper'; // only as ESM, no CommonJS
const options = {
urls: ['http://nodejs.org/'],
directory: '/path/to/save/'
};
// with async/await
const result = await scrape(options);
// with promise
scrape(options).then((result) => {});
Explanation: The ESM import statement is mandatory – CommonJS require() will throw errors. The options object requires two parameters: urls (an array of strings or objects) and directory (absolute path where files save). The function returns a promise resolving to an array of downloaded resource objects. The async/await pattern is recommended for cleaner error handling and sequential logic. Each resource object contains metadata including filename, URL, and depth information.
Example 2: Advanced URL Configuration with Custom Filenames
This snippet shows how to control output filenames explicitly:
scrape({
urls: [
'http://nodejs.org/', // Will be saved with default filename 'index.html'
{url: 'http://nodejs.org/about', filename: 'about.html'},
{url: 'http://blog.nodejs.org/', filename: 'blog.html'}
],
directory: '/path/to/save'
});
Explanation: The urls array accepts mixed types – simple strings use default filename generation (typically 'index.html'), while objects with url and filename properties provide granular control. This is crucial when scraping multiple pages from the same domain where you want predictable output names. The scraper automatically resolves relative URLs within each page, ensuring internal links point to your custom filenames correctly. Use this pattern to create human-readable directory structures or to avoid filename collisions.
Example 3: Selective Resource Downloading
Control exactly which assets get downloaded using the sources option:
// Downloading images, css files and scripts
scrape({
urls: ['http://nodejs.org/'],
directory: '/path/to/save',
sources: [
{selector: 'img', attr: 'src'},
{selector: 'link[rel="stylesheet"]', attr: 'href'},
{selector: 'script', attr: 'src'}
]
});
Explanation: The sources array contains objects with Cheerio-compatible selectors and attribute names. This example targets three critical resource types: images (<img src>), stylesheets (<link href>), and scripts (<script src>). The scraper uses these selectors to parse HTML, extract attribute values, and queue matching URLs for download. You can extend this with custom selectors for fonts, videos, or data attributes. The default configuration includes comprehensive selectors, but narrowing focus reduces download time and storage requirements for specific use cases.
Example 4: Custom Request Configuration
This powerful example demonstrates how to customize HTTP requests:
// use same request options for all resources
scrape({
urls: ['http://example.com/'],
directory: '/path/to/save',
request: {
headers: {
'User-Agent': 'Mozilla/5.0 (Linux; Android 4.2.1; en-us...'
},
timeout: {
request: 10000
},
retry: {
limit: 3
}
}
});
Explanation: The request object passes directly to the underlying got library, unlocking enterprise features. Set custom User-Agent strings to avoid bot detection, inject authentication cookies, or specify Accept-Language headers for region-specific content. The timeout and retry options ensure robust operation against flaky servers. You can also configure https options, proxy settings, and response encoding. This level of control makes node-website-scraper suitable for scraping protected APIs, authenticated portals, and CDN-heavy sites.
Advanced Usage & Best Practices
Implement Intelligent Rate Limiting: While requestConcurrency controls parallel requests, implement external delays for polite scraping. Use the request option with got's delay property or wrap scraper calls in a sleep function to respect robots.txt and avoid IP bans.
Leverage the Plugin System: Create custom plugins for filename sanitization, content transformation, or metadata extraction. The plugin API exposes hooks for beforeRequest, afterResponse, and saveResource, enabling you to modify behavior without forking the library.
Handle Dynamic Websites: Remember that node-website-scraper doesn't execute JavaScript. For React, Vue, or Angular sites, pair it with website-scraper-puppeteer. This plugin renders pages in a headless browser, captures the post-execution DOM, then passes it to the scraper for asset downloading.
Optimize for Large Sites: Set maxRecursiveDepth and maxDepth conservatively. Use urlFilter functions to exclude external domains, admin pages, or resource-heavy sections. Enable ignoreErrors to prevent one broken asset from halting entire operations.
Monitor with Debug Logging: The repository includes comprehensive logging capabilities. Set environment variables to trace request flows, identify bottlenecks, and debug failed downloads. This is invaluable when scraping complex sites with hundreds of dependencies.
Store Configuration Externally: For recurring scrapes, maintain options in JSON or YAML files. This enables version-controlled scraping configurations, easy environment switching, and collaborative maintenance of complex projects.
Comparison with Alternatives
| Feature | node-website-scraper | wget | HTTrack | Puppeteer Direct |
|---|---|---|---|---|
| JavaScript Required | No (Node.js) | No | No | Yes |
| Asset Preservation | Excellent | Good | Excellent | Manual |
| Programmatic API | ✅ Full JavaScript | ❌ CLI only | ❌ GUI/CLI | ✅ Full JavaScript |
| ESM Support | ✅ Pure ESM | ❌ N/A | ❌ N/A | ✅ Yes |
| Plugin Extensibility | ✅ Rich ecosystem | ❌ Limited | ❌ Limited | ✅ Via ecosystem |
| Request Concurrency | ✅ Configurable | ❌ Single | ⚠️ Basic | ✅ Via third-party |
| Dynamic JS Rendering | ⚠️ Via plugin | ❌ No | ❌ No | ✅ Native |
| Learning Curve | Moderate | Low | Low | Steep |
| Modern Auth Support | ✅ Cookies/Headers | ⚠️ Basic | ⚠️ Basic | ✅ Full browser |
Why Choose node-website-scraper? Unlike wget's simplistic recursive downloading, this tool offers granular control over every aspect of the scraping process. HTTrack provides GUI convenience but lacks programmatic flexibility. Direct Puppeteer usage requires manual asset handling, while node-website-scraper automates the entire pipeline. The sweet spot lies in its balance of power, simplicity, and modern JavaScript ecosystem integration.
Frequently Asked Questions
Q: Does node-website-scraper execute JavaScript on pages?
A: No, it only parses static HTML and CSS responses. For dynamic content loaded via JavaScript, use the official website-scraper-puppeteer plugin which renders pages in a headless browser before scraping.
Q: Can I save websites to an existing directory? A: By default, no – the target directory must not exist for safety. The repository's FAQ explains workarounds involving custom filename generators or pre-creating empty directories. This prevents accidental overwrites of important data.
Q: How do I handle websites requiring authentication?
A: Pass cookies and authorization headers through the request option. For session-based auth, extract cookies from your browser and include them in the headers object. For OAuth, obtain a token and use Authorization: Bearer <token>.
Q: What's the difference between maxRecursiveDepth and maxDepth?
A: maxRecursiveDepth limits how many HTML link hops the scraper follows (e.g., homepage → article → related article). maxDepth limits dependency depth for all resources (e.g., HTML → CSS → font → font-variant). Use maxRecursiveDepth to control site breadth, maxDepth for resource depth.
Q: Is it legal to scrape any website? A: Legality depends on jurisdiction, website terms of service, and usage. Always check robots.txt, respect rate limits, and obtain permission for commercial use. The tool is designed for legitimate archiving, testing, and analysis – not content theft.
Q: How can I scrape extremely large websites without running out of memory?
A: Process sites in sections using urlFilter to target specific subdirectories. Set low requestConcurrency values, enable streaming responses, and periodically persist results. For massive enterprise sites, consider distributed scraping across multiple instances.
Q: Why does my scrape fail with "Directory exists" error? A: This safety feature prevents overwriting existing data. Either delete the directory before scraping, use a timestamped directory name, or implement a custom filename generator that handles existing paths gracefully.
Conclusion
node-website-scraper represents the gold standard for modern website archiving. Its pure ESM architecture, intelligent asset detection, and extensible plugin system make it infinitely more powerful than traditional CLI tools while remaining remarkably approachable for JavaScript developers.
The library shines brightest when you need programmatic control, reliable asset preservation, and integration into larger automation pipelines. Whether you're backing up critical documentation, creating offline development environments, or conducting competitive research, it delivers consistent, predictable results that just work.
Ready to revolutionize your web scraping workflow? Head to the official GitHub repository at github.com/website-scraper/node-website-scraper to star the project, explore the source code, and join the growing community of developers who've made this their go-to solution. Install it today with npm install website-scraper and experience the future of website archiving firsthand.
Comments (0)
No comments yet. Be the first to share your thoughts!