How to Download 100M Images in 20 Hours: The Ultimate Guide to Building Massive AI Training Datasets
How to Download 100M Images in 20 Hours: The Ultimate Guide to Building Massive AI Training Datasets
┌─────────────────────────────────────────────────────────────┐
│ ⚡ 100M IMAGES IN 20 HOURS: THE COMPLETE BREAKDOWN ⚡ │
├─────────────────────────────────────────────────────────────┤
│ 🎯 TOOL: img2dataset by LAION │
│ 🔥 SPEED: 1,350 images/second (4.8M/hour) │
│ 💾 OUTPUT: ~50GB (256x256 JPG, quality 95) │
│ 🖥️ HARDWARE: 1Gbps, 16 cores, 32GB RAM │
│ ⏱️ TIME TO 100M: 20.8 hours │
├─────────────────────────────────────────────────────────────┤
│ 3 SIMPLE STEPS: │
│ 1️⃣ Setup: pip install img2dataset + configure DNS │
│ 2️⃣ Prepare: URL list (txt/json/parquet) │
│ 3️⃣ Download: Run with optimized threads │
├─────────────────────────────────────────────────────────────┤
│ 🔒 SAFETY FIRST: │
│ ✓ Respect robots.txt & X-Robots-Tags │
│ ✓ Use --disallowed_header_directives filter │
│ ✓ Enable incremental mode to resume fails │
│ ✓ Set user_agent_token for transparency │
├─────────────────────────────────────────────────────────────┤
│ 💡 PRO TIP: WebDataset format >1M images avoids filesystem │
│ overload │
└─────────────────────────────────────────────────────────────┘
The Data Gold Rush
In the AI revolution, data is the new oil. Training state-of-the-art computer vision models requires massive image datasets and we're talking hundreds of millions of images. But here's the challenge: how do you ethically and efficiently download 100 million images without breaking the internet (or your infrastructure)?
Enter img2dataset, the powerhouse tool developed by LAION that transforms a simple list of URLs into a battle-ready training dataset at blistering speeds. This open-source marvel achieves what was once impossible: downloading, resizing, and packaging 100 million images in just 20 hours on a single machine.
Whether you're training the next CLIP model, building a generative AI system, or conducting large-scale research, this guide will show you exactly how to harness this capability safely and effectively.
🚀 Real-World Case Studies: Speed in Action
Case Study #1: LAION-400M - The Foundation of Open-Source AI
Dataset: 400 million image-text pairs
Time to Download: 3.5 days (continuous)
Average Speed: ~1,320 images/second
Infrastructure: Single node with 1Gbps connection, 32GB RAM, 16-core CPU
Result: The dataset that powered Stable Diffusion v1 and countless other models
Key Insight: The LAION team used img2dataset with CLIP similarity filtering (threshold >0.3) to create a high-quality subset from Common Crawl data. They processed petabytes of WAT files to extract image URLs with alt-text, demonstrating the importance of a robust preprocessing pipeline.
Case Study #2: COYO-700M - Pushing the Boundaries
Dataset: 747 million image-text pairs + metadata
Time to Download: ~6.5 days on optimized infrastructure
Unique Features: Preserved additional metadata (aesthetic scores, watermark detection, safety ratings)
Storage: ~15TB in WebDataset format
Use Case: Training multimodal models with rich attribute understanding
Key Insight: COYO leveraged img2dataset's --save_additional_columns feature to retain comprehensive metadata, enabling researchers to filter by aesthetic quality, safety, and watermarks post-download.
Case Study #3: LAION-5B - Distributed Scaling Mastery
Dataset: 5.85 billion image-text pairs
Time to Download: 7 days
Infrastructure: 10 nodes parallel processing (PySpark distributor)
Average Speed: 9,500 images/second (cluster-wide)
Total Storage: 240TB
Key Insight: At this scale, single-machine processing becomes impractical. The team used img2dataset's PySpark integration to distribute shards across 10 machines, achieving near-linear scaling. Each node handled ~585M URLs independently.
🛠️ Essential Tools & Alternatives
Primary Tool: img2dataset (⭐ Recommended)
- GitHub: https://github.com/rom1504/img2dataset/
- Best For: Production-scale downloads, research, AI training
- Strengths: Resizing, multiple formats, hash verification, incremental mode
Alternative Tools Comparison
| Tool | Speed | Scale | Resizing | Ethics Features | Best Use Case |
|---|---|---|---|---|---|
| img2dataset | 1,350 img/s | 100M-5B+ | ✅ Yes | ✅ X-Robots-Tag | AI training, research |
| wget/curl scripts | 50-200 img/s | <1M | ❌ No | ❌ Minimal | Quick prototype |
| Scrapy | 300-500 img/s | <10M | ❌ Manual | ⚠️ Partial | Custom crawling logic |
| FiftyOne | Varies | <1M | ✅ Yes | ❌ Limited | Computer vision workflows |
| TensorFlow Datasets | Varies | Pre-built only | ✅ Yes | ✅ Yes | Standard academic datasets |
Infrastructure Requirements
Minimum Setup (10M images):
- 4 CPU cores
- 16GB RAM
- 100 Mbps connection
- 500GB SSD storage
Recommended Setup (100M+ images):
- 16+ CPU cores (i7/Ryzen 9)
- 32GB+ RAM
- 1Gbps dedicated connection
- 2TB+ NVMe SSD
- Local DNS resolver (bind9/knot)
Enterprise Setup (1B+ images):
- 10+ node cluster (32 cores/node)
- 10Gbps aggregate bandwidth
- 100TB+ distributed storage (HDFS/S3)
- PySpark/Slurm orchestration
📋 Step-by-Step Safety Guide: Download Ethically & Efficiently
Phase 1: Pre-Download Preparation (1-2 hours)
Step 1: Legal & Ethical Compliance
# Check robots.txt for target domains
curl https://example.com/robots.txt
# Review Terms of Service
# Look for "noai", "noimageai" directives
# Configure img2dataset to respect opt-outs
img2dataset --disallowed_header_directives '["noai","noimageai","noindex"]' \
--user_agent_token "AcmeResearchDownloader"
Step 2: Optimize DNS Resolution
# Install bind9 (Ubuntu/Debian)
sudo apt install bind9
# Configure for high performance
sudo nano /etc/bind/named.conf.options
# Add:
recursive-clients 10000;
resolver-query-timeout 30000;
max-clients-per-query 10000;
max-cache-size 2000m;
# Restart and set as default resolver
sudo systemctl restart bind9
echo "nameserver 127.0.0.1" | sudo tee /etc/resolv.conf
Step 3: Prepare Your URL List
# Example: Filter URLs before download
import pandas as pd
# Load parquet with URLs and captions
df = pd.read_parquet('image_links.parquet')
# Remove duplicates
df = df.drop_duplicates(subset=['url'])
# Filter by domain if needed
allowed_domains = ['flickr.com', 'wikimedia.org']
df = df[df['url'].str.contains('|'.join(allowed_domains))]
# Save in optimal format
df.to_parquet('filtered_urls.parquet', compression='snappy')
Phase 2: Configure & Launch (30 minutes)
Step 4: Test with Small Sample
# Download 1,000 images first
head -n 1000 filtered_urls.parquet > test_sample.parquet
img2dataset --url_list=test_sample.parquet \
--output_format=webdataset \
--image_size=256 \
--thread_count=32 \
--number_sample_per_shard=10000 \
--output_folder=test_output
Step 5: Monitor Performance
# Enable Weights & Biases logging
pip install wandb
wandb login
img2dataset --enable_wandb=True \
--wandb_project="my-dataset-download"
Step 6: Full-Scale Launch
# OPTIMIZED COMMAND FOR 100M IMAGES
img2dataset --url_list=filtered_urls.parquet \
--output_folder=/mnt/nvme/dataset \
--output_format=webdataset \
--image_size=256 \
--encode_quality=95 \
--processes_count=16 \
--thread_count=256 \
--number_sample_per_shard=10000 \
--timeout=10 \
--retries=2 \
--incremental_mode=incremental \
--compute_hash=sha256 \
--disallowed_header_directives '["noai","noimageai","noindex","noimageindex"]' \
--user_agent_token="MyResearchProject_v1.0" \
--max_shard_retry=3 \
--enable_wandb=True
Phase 3: Safety & Verification (Ongoing)
Step 7: Verify Downloads
# Check successful downloads
find /mnt/nvme/dataset -name "*.parquet" | xargs \
parquet-tools head -n 5
# Verify hashes if available
img2dataset --verify_hash '["md5","md5"]' \
--compute_hash="md5"
Step 8: Handle Failures Gracefully
# Python script to resume failed shards
from img2dataset import download
download(
url_list="filtered_urls.parquet",
output_folder="/mnt/nvme/dataset",
incremental_mode="incremental", # Only download missing shards
max_shard_retry=3
)
Step 9: Respect Rate Limits
# Add delays for specific domains (if needed)
# Use --timeout and ensure thread_count isn't overwhelming small servers
# Monitor with: netstat -an | grep ESTABLISHED | wc -l
⚠️ Critical Safety Rules
RULE #1: Never Bypass Robots.txt or X-Robots-Tags
# ❌ WRONG: Ignoring all directives
img2dataset --disallowed_header_directories '[]'
# ✅ RIGHT: Respect creator opt-outs
img2dataset --disallowed_header_directories '["noai","noimageai"]'
RULE #2: Use Transparent User-Agent
# ✅ Include contact/project info
--user_agent_token="StanfordCVLab_ProjectName_Contact@stanford.edu"
RULE #3: Implement Incremental Mode
# ✅ Allows resuming without re-downloading
--incremental_mode=incremental
RULE #4: Monitor Bandwidth Impact
# ✅ Limit concurrent connections if needed
--thread_count=128 # Reduce from default 256 if causing issues
RULE #5: Validate Content Post-Download
# ✅ Check for corrupted images
from PIL import Image
import os
def validate_image(filepath):
try:
img = Image.open(filepath)
img.verify()
return True
except:
return False
💡 Real-World Use Cases
Use Case 1: Training Multimodal AI Models
Scenario: Building a CLIP-like vision-language model
Dataset: 400M image-text pairs (LAION-400M)
Pipeline: Download → CLIP filtering → Resizing → WebDataset → Training
Result: State-of-the-art zero-shot classification
Use Case 2: Generative AI Development
Scenario: Training Stable Diffusion on specific domains
Dataset: 50M high-resolution art images (filtered from LAION-Aesthetic)
Pipeline: Download → Aesthetic scoring → Watermark removal → Augmentation
Result: Domain-specific image generation model
Use Case 3: Academic Research
Scenario: Studying visual representation learning
Dataset: 10M curated subset of ImageNet + LAION
Pipeline: Download → ResNet preprocessing → TFRecord format
Result: Reproducible computer vision experiments
Use Case 4: E-commerce Analysis
Scenario: Product image classification at scale
Dataset: 20M product images from public listings
Pipeline: Download → Standardize format → Metadata extraction
Result: Automated product tagging system
Use Case 5: Medical Imaging Research
Scenario: Collecting public medical images for AI diagnostics
Dataset: 2M de-identified radiology images
Pipeline: Download → Privacy checking → DICOM conversion
Result: Disease detection model (with IRB approval)
🎯 Performance Optimization Tips
For Maximum Speed:
- DNS is King: Use local bind9 or knot resolver (4x instances)
- SSD is Essential: NVMe drives sustain 30-130MB/s writes
- Thread Tuning: Start with 256 threads, adjust based on CPU usage
- Format Matters: WebDataset format reduces filesystem overhead by 10x
For Cost Efficiency:
- Use Spot Instances: Save 70% on cloud costs
- Compress Smart: JPG quality 95 offers best size/quality (30% smaller than PNG)
- Filter Early: Remove duplicates and low-quality URLs before download
- Incremental Saves: Resume failed downloads instead of starting over
For Quality Control:
- Hash Verification: Use
--compute_hash=sha256for integrity - Size Filtering:
--min_image_size=100removes thumbnails - Aspect Ratio:
--max_aspect_ratio=3.0filters banners - Metadata Preservation: Always keep .parquet files for analysis
🔬 Technical Deep Dive: How It Achieves 1,350 img/s
The secret lies in shard-based multiprocessing architecture:
- URL Sharding: URLs split into 10K-sample shards → 10,000 shards for 100M
- Process Pool: Each CPU core gets its own shard and output tar file
- Thread Explosion: 256+ threads per process handle async I/O
- CPU Efficiency: Only 1 resize thread per core prevents overload
- Bandwidth Saturation: Thousands of parallel connections maximize pipe
Benchmark Results:
- 18M images: 3.7 hours (1350 img/s)
- 36M images: 7.4 hours (1345 img/s)
- 190M images: 41 hours (1280 img/s)
- 100M images: ~20.8 hours (theoretical optimal)
📊 Expected Resource Consumption
For 100M images at 256x256 resolution:
| Resource | Usage | Notes |
|---|---|---|
| Bandwidth | ~5TB total | 50KB average per image |
| Disk Write | 50-100GB/s sustained | NVMe recommended |
| CPU | 80-95% across 16 cores | Mostly resizing operations |
| Memory | 8-12GB peak | Per-process overhead |
| DNS Queries | 100M+ | Requires local resolver |
| Storage | 2.9TB (JPG Q95) | Or 9.8TB (PNG) |
🎓 Best Practices for Ethical AI Training
1. Transparency Documentation
Create a DATASET_CARD.md:
# Dataset Card
- **Source:** LAION-400M subset
- **Filtering:** CLIP similarity >0.3, aesthetic >7
- **Opt-out respected:** ✓ X-Robots-Tags honored
- **Contact:** research-team@university.edu
- **License:** CC-BY-4.0 (where applicable)
2. Attribution
When publishing models, cite:
@misc{beaumont-2021-img2dataset,
author = {Romain Beaumont},
title = {Dataset created using img2dataset},
year = {2025},
howpublished = {\url{https://github.com/rom1504/img2dataset}}
}
3. Data Minimization
Only download what you'll actually use:
# Filter by relevance BEFORE download
--max_image_area 1048576 # Max 1024x1024
--min_image_size 100 # Remove too-small images
4. Regular Audits
# Monthly check: Are we respecting new opt-outs?
grep -r "noai" /var/log/img2dataset/
⚡ Quick Start Commands
For Beginners (10K images):
pip install img2dataset
echo "https://picsum.photos/200/300" > urls.txt
img2dataset --url_list=urls.txt --output_folder=images
For Researchers (1M images):
img2dataset --url_list=urls.parquet \
--output_format=webdataset \
--processes_count=8 \
--image_size=224
For Production (100M+ images):
# See full command in Step 6 above
# Run in screen/tmux for persistence
screen -S download
# ... command ...
# Detach with Ctrl+A then D
🚨 Troubleshooting Common Issues
| Problem | Cause | Solution |
|---|---|---|
| Slow downloads (<500 img/s) | Poor DNS resolution | Setup local bind9 resolver |
| Corrupted images | Network timeouts | Increase --timeout=15, add retries |
| Disk full | Underestimated storage | Use WebDataset, lower quality to 90 |
| CPU bottleneck | Too many resize threads | Reduce --processes_count |
| Memory errors | Shard size too large | Lower --number_sample_per_shard |
| Legal concerns | Ignoring opt-outs | Enable --disallowed_header_directives |
🏆 Final Checklist: Before You Hit Enter
- Reviewed robots.txt for target domains
- Configured local DNS resolver (bind9/knot)
- Tested on 1K sample first
- Enabled incremental mode
- Set transparent user_agent_token
- Respecting X-Robots-Tags with
--disallowed_header_directives - Have 3x storage space available (9TB for 100M)
- Running in tmux/screen for persistence
- Enabled W&B for monitoring
- Documented dataset sources and filtering
📣 Share This Guide
Found this useful? Share the infographic above with your team! Tag #img2dataset and #MLCommunity to help others build ethical AI datasets.
Created with insights from LAION's official benchmarks and ethical AI practices. For the latest updates, follow the project at
Comments (0)
No comments yet. Be the first to share your thoughts!