How to Download 100M Images in 20 Hours: The Ultimate Guide to Building Massive AI Training Datasets

┌─────────────────────────────────────────────────────────────┐
│  ⚡ 100M IMAGES IN 20 HOURS: THE COMPLETE BREAKDOWN ⚡      │
├─────────────────────────────────────────────────────────────┤
│  🎯 TOOL: img2dataset by LAION                              │
│  🔥 SPEED: 1,350 images/second (4.8M/hour)                  │
│  💾 OUTPUT: ~50GB (256x256 JPG, quality 95)                 │
│  🖥️  HARDWARE: 1Gbps, 16 cores, 32GB RAM                   │
│  ⏱️  TIME TO 100M: 20.8 hours                               │
├─────────────────────────────────────────────────────────────┤
│  3 SIMPLE STEPS:                                            │
│  1️⃣  Setup: pip install img2dataset + configure DNS        │
│  2️⃣  Prepare: URL list (txt/json/parquet)                  │
│  3️⃣  Download: Run with optimized threads                  │
├─────────────────────────────────────────────────────────────┤
│  🔒 SAFETY FIRST:                                           │
│  ✓ Respect robots.txt & X-Robots-Tags                       │
│  ✓ Use --disallowed_header_directives filter                │
│  ✓ Enable incremental mode to resume fails                  │
│  ✓ Set user_agent_token for transparency                    │
├─────────────────────────────────────────────────────────────┤
│  💡 PRO TIP: WebDataset format >1M images avoids filesystem │
│     overload                                                │
└─────────────────────────────────────────────────────────────┘

The Data Gold Rush

In the AI revolution, data is the new oil. Training state-of-the-art computer vision models requires massive image datasets and we're talking hundreds of millions of images. But here's the challenge: how do you ethically and efficiently download 100 million images without breaking the internet (or your infrastructure)?

Enter img2dataset, the powerhouse tool developed by LAION that transforms a simple list of URLs into a battle-ready training dataset at blistering speeds. This open-source marvel achieves what was once impossible: downloading, resizing, and packaging 100 million images in just 20 hours on a single machine.

Whether you're training the next CLIP model, building a generative AI system, or conducting large-scale research, this guide will show you exactly how to harness this capability safely and effectively.

🚀 Real-World Case Studies: Speed in Action

Case Study #1: LAION-400M - The Foundation of Open-Source AI

Dataset: 400 million image-text pairs
Time to Download: 3.5 days (continuous)
Average Speed: ~1,320 images/second
Infrastructure: Single node with 1Gbps connection, 32GB RAM, 16-core CPU
Result: The dataset that powered Stable Diffusion v1 and countless other models

Key Insight: The LAION team used img2dataset with CLIP similarity filtering (threshold >0.3) to create a high-quality subset from Common Crawl data. They processed petabytes of WAT files to extract image URLs with alt-text, demonstrating the importance of a robust preprocessing pipeline.

Case Study #2: COYO-700M - Pushing the Boundaries

Dataset: 747 million image-text pairs + metadata
Time to Download: ~6.5 days on optimized infrastructure
Unique Features: Preserved additional metadata (aesthetic scores, watermark detection, safety ratings)
Storage: ~15TB in WebDataset format
Use Case: Training multimodal models with rich attribute understanding

Key Insight: COYO leveraged img2dataset's --save_additional_columns feature to retain comprehensive metadata, enabling researchers to filter by aesthetic quality, safety, and watermarks post-download.

Case Study #3: LAION-5B - Distributed Scaling Mastery

Dataset: 5.85 billion image-text pairs
Time to Download: 7 days
Infrastructure: 10 nodes parallel processing (PySpark distributor)
Average Speed: 9,500 images/second (cluster-wide)
Total Storage: 240TB

Key Insight: At this scale, single-machine processing becomes impractical. The team used img2dataset's PySpark integration to distribute shards across 10 machines, achieving near-linear scaling. Each node handled ~585M URLs independently.

🛠️ Essential Tools & Alternatives

Primary Tool: img2dataset (⭐ Recommended)

GitHub: https://github.com/rom1504/img2dataset/
Best For: Production-scale downloads, research, AI training
Strengths: Resizing, multiple formats, hash verification, incremental mode

Alternative Tools Comparison

Tool	Speed	Scale	Resizing	Ethics Features	Best Use Case
img2dataset	1,350 img/s	100M-5B+	✅ Yes	✅ X-Robots-Tag	AI training, research
wget/curl scripts	50-200 img/s	<1M	❌ No	❌ Minimal	Quick prototype
Scrapy	300-500 img/s	<10M	❌ Manual	⚠️ Partial	Custom crawling logic
FiftyOne	Varies	<1M	✅ Yes	❌ Limited	Computer vision workflows
TensorFlow Datasets	Varies	Pre-built only	✅ Yes	✅ Yes	Standard academic datasets

Infrastructure Requirements

Minimum Setup (10M images):

4 CPU cores
16GB RAM
100 Mbps connection
500GB SSD storage

Recommended Setup (100M+ images):

16+ CPU cores (i7/Ryzen 9)
32GB+ RAM
1Gbps dedicated connection
2TB+ NVMe SSD
Local DNS resolver (bind9/knot)

Enterprise Setup (1B+ images):

10+ node cluster (32 cores/node)
10Gbps aggregate bandwidth
100TB+ distributed storage (HDFS/S3)
PySpark/Slurm orchestration

📋 Step-by-Step Safety Guide: Download Ethically & Efficiently

Phase 1: Pre-Download Preparation (1-2 hours)

Step 1: Legal & Ethical Compliance

# Check robots.txt for target domains
curl https://example.com/robots.txt

# Review Terms of Service
# Look for "noai", "noimageai" directives

# Configure img2dataset to respect opt-outs
img2dataset --disallowed_header_directives '["noai","noimageai","noindex"]' \
            --user_agent_token "AcmeResearchDownloader"

Step 2: Optimize DNS Resolution

# Install bind9 (Ubuntu/Debian)
sudo apt install bind9

# Configure for high performance
sudo nano /etc/bind/named.conf.options
# Add:
recursive-clients 10000;
resolver-query-timeout 30000;
max-clients-per-query 10000;
max-cache-size 2000m;

# Restart and set as default resolver
sudo systemctl restart bind9
echo "nameserver 127.0.0.1" | sudo tee /etc/resolv.conf

Step 3: Prepare Your URL List

# Example: Filter URLs before download
import pandas as pd

# Load parquet with URLs and captions
df = pd.read_parquet('image_links.parquet')

# Remove duplicates
df = df.drop_duplicates(subset=['url'])

# Filter by domain if needed
allowed_domains = ['flickr.com', 'wikimedia.org']
df = df[df['url'].str.contains('|'.join(allowed_domains))]

# Save in optimal format
df.to_parquet('filtered_urls.parquet', compression='snappy')

Phase 2: Configure & Launch (30 minutes)

Step 4: Test with Small Sample

# Download 1,000 images first
head -n 1000 filtered_urls.parquet > test_sample.parquet

img2dataset --url_list=test_sample.parquet \
            --output_format=webdataset \
            --image_size=256 \
            --thread_count=32 \
            --number_sample_per_shard=10000 \
            --output_folder=test_output

Step 5: Monitor Performance

# Enable Weights & Biases logging
pip install wandb
wandb login

img2dataset --enable_wandb=True \
            --wandb_project="my-dataset-download"

Step 6: Full-Scale Launch

# OPTIMIZED COMMAND FOR 100M IMAGES
img2dataset --url_list=filtered_urls.parquet \
            --output_folder=/mnt/nvme/dataset \
            --output_format=webdataset \
            --image_size=256 \
            --encode_quality=95 \
            --processes_count=16 \
            --thread_count=256 \
            --number_sample_per_shard=10000 \
            --timeout=10 \
            --retries=2 \
            --incremental_mode=incremental \
            --compute_hash=sha256 \
            --disallowed_header_directives '["noai","noimageai","noindex","noimageindex"]' \
            --user_agent_token="MyResearchProject_v1.0" \
            --max_shard_retry=3 \
            --enable_wandb=True

Phase 3: Safety & Verification (Ongoing)

Step 7: Verify Downloads

# Check successful downloads
find /mnt/nvme/dataset -name "*.parquet" | xargs \
  parquet-tools head -n 5

# Verify hashes if available
img2dataset --verify_hash '["md5","md5"]' \
            --compute_hash="md5"

Step 8: Handle Failures Gracefully

# Python script to resume failed shards
from img2dataset import download

download(
    url_list="filtered_urls.parquet",
    output_folder="/mnt/nvme/dataset",
    incremental_mode="incremental",  # Only download missing shards
    max_shard_retry=3
)

Step 9: Respect Rate Limits

# Add delays for specific domains (if needed)
# Use --timeout and ensure thread_count isn't overwhelming small servers
# Monitor with: netstat -an | grep ESTABLISHED | wc -l

⚠️ Critical Safety Rules

RULE #1: Never Bypass Robots.txt or X-Robots-Tags

# ❌ WRONG: Ignoring all directives
img2dataset --disallowed_header_directories '[]'

# ✅ RIGHT: Respect creator opt-outs
img2dataset --disallowed_header_directories '["noai","noimageai"]'

RULE #2: Use Transparent User-Agent

# ✅ Include contact/project info
--user_agent_token="StanfordCVLab_ProjectName_Contact@stanford.edu"

RULE #3: Implement Incremental Mode

# ✅ Allows resuming without re-downloading
--incremental_mode=incremental

RULE #4: Monitor Bandwidth Impact

# ✅ Limit concurrent connections if needed
--thread_count=128  # Reduce from default 256 if causing issues

RULE #5: Validate Content Post-Download

# ✅ Check for corrupted images
from PIL import Image
import os

def validate_image(filepath):
    try:
        img = Image.open(filepath)
        img.verify()
        return True
    except:
        return False

💡 Real-World Use Cases

Use Case 1: Training Multimodal AI Models

Scenario: Building a CLIP-like vision-language model
Dataset: 400M image-text pairs (LAION-400M)
Pipeline: Download → CLIP filtering → Resizing → WebDataset → Training
Result: State-of-the-art zero-shot classification

Use Case 2: Generative AI Development

Scenario: Training Stable Diffusion on specific domains
Dataset: 50M high-resolution art images (filtered from LAION-Aesthetic)
Pipeline: Download → Aesthetic scoring → Watermark removal → Augmentation
Result: Domain-specific image generation model

Use Case 3: Academic Research

Scenario: Studying visual representation learning
Dataset: 10M curated subset of ImageNet + LAION
Pipeline: Download → ResNet preprocessing → TFRecord format
Result: Reproducible computer vision experiments

Use Case 4: E-commerce Analysis

Scenario: Product image classification at scale
Dataset: 20M product images from public listings
Pipeline: Download → Standardize format → Metadata extraction
Result: Automated product tagging system

Use Case 5: Medical Imaging Research

Scenario: Collecting public medical images for AI diagnostics
Dataset: 2M de-identified radiology images
Pipeline: Download → Privacy checking → DICOM conversion
Result: Disease detection model (with IRB approval)

🎯 Performance Optimization Tips

For Maximum Speed:

DNS is King: Use local bind9 or knot resolver (4x instances)
SSD is Essential: NVMe drives sustain 30-130MB/s writes
Thread Tuning: Start with 256 threads, adjust based on CPU usage
Format Matters: WebDataset format reduces filesystem overhead by 10x

For Cost Efficiency:

Use Spot Instances: Save 70% on cloud costs
Compress Smart: JPG quality 95 offers best size/quality (30% smaller than PNG)
Filter Early: Remove duplicates and low-quality URLs before download
Incremental Saves: Resume failed downloads instead of starting over

For Quality Control:

Hash Verification: Use --compute_hash=sha256 for integrity
Size Filtering: --min_image_size=100 removes thumbnails
Aspect Ratio: --max_aspect_ratio=3.0 filters banners
Metadata Preservation: Always keep .parquet files for analysis

🔬 Technical Deep Dive: How It Achieves 1,350 img/s

The secret lies in shard-based multiprocessing architecture:

URL Sharding: URLs split into 10K-sample shards → 10,000 shards for 100M
Process Pool: Each CPU core gets its own shard and output tar file
Thread Explosion: 256+ threads per process handle async I/O
CPU Efficiency: Only 1 resize thread per core prevents overload
Bandwidth Saturation: Thousands of parallel connections maximize pipe

Benchmark Results:

18M images: 3.7 hours (1350 img/s)
36M images: 7.4 hours (1345 img/s)
190M images: 41 hours (1280 img/s)
100M images: ~20.8 hours (theoretical optimal)

📊 Expected Resource Consumption

For 100M images at 256x256 resolution:

Resource	Usage	Notes
Bandwidth	~5TB total	50KB average per image
Disk Write	50-100GB/s sustained	NVMe recommended
CPU	80-95% across 16 cores	Mostly resizing operations
Memory	8-12GB peak	Per-process overhead
DNS Queries	100M+	Requires local resolver
Storage	2.9TB (JPG Q95)	Or 9.8TB (PNG)

🎓 Best Practices for Ethical AI Training

1. Transparency Documentation

Create a DATASET_CARD.md:

# Dataset Card
- **Source:** LAION-400M subset
- **Filtering:** CLIP similarity >0.3, aesthetic >7
- **Opt-out respected:** ✓ X-Robots-Tags honored
- **Contact:** research-team@university.edu
- **License:** CC-BY-4.0 (where applicable)

2. Attribution

When publishing models, cite:

@misc{beaumont-2021-img2dataset,
  author = {Romain Beaumont},
  title = {Dataset created using img2dataset},
  year = {2025},
  howpublished = {\url{https://github.com/rom1504/img2dataset}}
}

3. Data Minimization

Only download what you'll actually use:

# Filter by relevance BEFORE download
--max_image_area 1048576  # Max 1024x1024
--min_image_size 100      # Remove too-small images

4. Regular Audits

# Monthly check: Are we respecting new opt-outs?
grep -r "noai" /var/log/img2dataset/

⚡ Quick Start Commands

For Beginners (10K images):

pip install img2dataset
echo "https://picsum.photos/200/300" > urls.txt
img2dataset --url_list=urls.txt --output_folder=images

For Researchers (1M images):

img2dataset --url_list=urls.parquet \
            --output_format=webdataset \
            --processes_count=8 \
            --image_size=224

For Production (100M+ images):

# See full command in Step 6 above
# Run in screen/tmux for persistence
screen -S download
# ... command ...
# Detach with Ctrl+A then D

🚨 Troubleshooting Common Issues

Problem	Cause	Solution
Slow downloads (<500 img/s)	Poor DNS resolution	Setup local bind9 resolver
Corrupted images	Network timeouts	Increase `--timeout=15`, add retries
Disk full	Underestimated storage	Use WebDataset, lower quality to 90
CPU bottleneck	Too many resize threads	Reduce `--processes_count`
Memory errors	Shard size too large	Lower `--number_sample_per_shard`
Legal concerns	Ignoring opt-outs	Enable `--disallowed_header_directives`

🏆 Final Checklist: Before You Hit Enter

Reviewed robots.txt for target domains
Configured local DNS resolver (bind9/knot)
Tested on 1K sample first
Enabled incremental mode
Set transparent user_agent_token
Respecting X-Robots-Tags with --disallowed_header_directives
Have 3x storage space available (9TB for 100M)
Running in tmux/screen for persistence
Enabled W&B for monitoring
Documented dataset sources and filtering

📣 Share This Guide

Found this useful? Share the infographic above with your team! Tag #img2dataset and #MLCommunity to help others build ethical AI datasets.

Created with insights from LAION's official benchmarks and ethical AI practices. For the latest updates, follow the project at

https://github.com/rom1504/img2dataset/