Developer Tools Web Scraping 1 min read

Pipet: Command-Line Scraping with JavaScript & Unix Pipes

B
Bright Coding
Author
Share:
Pipet: Command-Line Scraping with JavaScript & Unix Pipes
Advertisement

Pipet: Command-Line Scraping with JavaScript & Unix Pipes

Tired of bloated web scraping frameworks that require Python virtual environments, dozens of dependencies, and complex configuration? Meet Pipet—the revolutionary command-line scraper that treats web data like Unix treats everything else: as streams of text ready for composition, transformation, and piping. Built for hackers who value simplicity and power, Pipet transforms the tedious task of data extraction into a delightful exercise in elegant shell scripting.

In this deep dive, you'll discover how Pipet's unique architecture leverages JavaScript evaluation, CSS selectors, and Unix pipes to create scraping workflows that are both readable and infinitely extensible. We'll explore real-world examples, from monitoring Hacker News headlines to tracking package deliveries, and show you why this Go-powered tool is rapidly becoming the secret weapon of developers who refuse to overcomplicate their stack.

What Is Pipet?

Pipet is a Swiss-army command-line tool for scraping and extracting data from online assets, engineered specifically for hackers who live in their terminals. Created by developer bjesus and written in Go, Pipet reimagines web scraping as a pipeline-based operation where each step is a simple, composable command.

Unlike traditional scraping frameworks that trap you in complex object models and asynchronous callbacks, Pipet embraces the Unix philosophy: do one thing well, and work seamlessly with other tools. It supports three distinct modes of operation—HTML parsing, JSON parsing, and client-side JavaScript evaluation—giving you the flexibility to tackle any data source without switching tools.

What makes Pipet genuinely revolutionary is its pipe-first architecture. Every data extraction step can be extended with standard Unix commands like grep, sed, awk, jq, or any other CLI tool in your arsenal. This means you're not learning a new API; you're applying skills you already possess. The tool has gained significant traction in the developer community precisely because it eliminates the friction between "I need this data" and "I have this data"—turning what used to be a 50-line Python script into a 5-line text file.

Key Features That Make Pipet Irresistible

Multi-Modal Scraping Engine

Pipet doesn't force you into a single extraction method. It intelligently handles three distinct data sources through a unified syntax:

  • HTML Parsing: Use familiar CSS selectors to navigate DOM structures. The whitespace-sensitive nesting system creates natural parent-child relationships, making iteration intuitive.
  • JSON Parsing: Navigate JSON APIs using dot notation (current_condition.0.FeelsLikeC) to drill into nested objects and arrays effortlessly.
  • JavaScript Evaluation: For modern SPAs and dynamically rendered content, Pipet integrates with Playwright to execute custom JavaScript in a headless browser, giving you access to the fully rendered DOM.

Native Unix Pipe Integration

This is Pipet's superpower. Any query line can be piped through external commands using the | operator, exactly as you would in Bash. Want to count characters in extracted titles? Append | wc -c. Need to extract an attribute from HTML? Pipe to htmlq. This transforms Pipet from a scraper into a data processing orchestrator that leverages the entire Unix ecosystem.

Curl Compatibility & Browser Fidelity

Resource lines starting with curl accept complete curl commands, including headers, cookies, and authentication. You can literally copy a request from Chrome DevTools' "Copy as cURL" feature and paste it directly into your Pipet file. This makes bypassing anti-bot measures and accessing authenticated sessions trivial—no manual header reconstruction required.

Declarative Data Structure

Pipet files use indentation to define data hierarchies. A parent selector runs as an iterator, executing child selectors for each match. This creates clean, visual mappings between DOM structure and output format. The result? Your scraper definition looks like the data it produces.

Template-Driven Output

Beyond raw text and JSON, Pipet supports Go text/template for custom formatting. Drop a .tpl file next to your .pipet file, and Pipet automatically renders results through it. This enables generating HTML reports, CSV files, or any structured format without post-processing.

Change Monitoring Built-In

The --interval and --on-change flags turn Pipet into a monitoring daemon. It reruns your scraper at specified intervals and executes commands only when data changes—perfect for price tracking, stock alerts, or content change detection.

Real-World Use Cases Where Pipet Dominates

1. Real-Time Price Monitoring

E-commerce sites constantly change prices. With Pipet, create a scraper that checks competitor pricing every 30 minutes and sends a Slack notification when discounts appear:

pipet --interval 1800 --on-change 'curl -X POST -d "Price changed: {}" https://hooks.slack.com/...' prices.pipet

The .pipet file uses a copied curl command with your session cookies to access vendor portals, extracts prices with CSS selectors, and pipes results through jq for normalization.

2. Content Aggregation & News Tracking

Media analysts need to track breaking stories across multiple sources. Pipet's multi-block files let you scrape Hacker News, Reddit, and niche forums simultaneously, outputting a unified JSON feed for your dashboard. The Unix pipe integration means you can deduplicate entries with sort | uniq or filter keywords with grep before the data even hits your database.

3. Shipment & Logistics Tracking

That "track your package" page requires JavaScript execution and session cookies. Pipet handles both: use Playwright mode to render the tracking page, execute JavaScript to extract the delivery status, and pipe it through sed to format a clean status message. Set a 5-minute interval, and you'll never miss a delivery update.

4. GitHub Repository Analytics

Monitor your open-source project's popularity by scraping star counts, fork numbers, and issue activity. The example in Pipet's documentation shows extracting metrics from GitHub's SPA interface using JavaScript evaluation:

playwright https://github.com/bjesus/pipet
Array.from(document.querySelectorAll('.about-margin .Link')).map(e => e.innerText.trim()).filter(t=> /^\d/.test(t) )

This runs in a headless browser, executes the JavaScript to parse the DOM, and returns clean metrics—all from a two-line text file.

5. Weather & Environmental Data Collection

The README's weather example demonstrates JSON API consumption:

curl https://wttr.in/Alert%20Canada?format=j1
current_condition.0.FeelsLikeC
current_condition.0.FeelsLikeF

Perfect for IoT projects, home automation, or data science pipelines needing reliable weather feeds without heavy SDKs.

Step-by-Step Installation & Setup Guide

Method 1: Pre-Built Binary (Fastest)

Download the latest release for your platform from the Releases page:

# Example for Linux amd64
wget https://github.com/bjesus/pipet/releases/latest/download/pipet_linux_amd64.tar.gz
tar -xzf pipet_linux_amd64.tar.gz
chmod +x pipet
sudo mv pipet /usr/local/bin/

Method 2: Go Install (Recommended for Developers)

Requires Go 1.19+:

go install github.com/bjesus/pipet/cmd/pipet@latest

This compiles Pipet from source and installs it to $GOPATH/bin. Verify with:

pipet --version

Method 3: Package Managers

Arch Linux (AUR):

yay -S pipet-git

Homebrew (macOS/Linux):

brew install pipet

Nix:

nix-env -iA nixpkgs.pipet

Method 4: Run Without Installing

For one-off usage or testing:

go run github.com/bjesus/pipet/cmd/pipet@latest your-scraper.pipet

Environment Setup

  1. For JavaScript rendering: Install Playwright dependencies:

    pipet --verbose your-js-scraper.pipet  # Auto-downloads on first run
    
  2. For pipe integration: Ensure your favorite CLI tools are installed:

    # Recommended tools
    sudo apt install htmlq jq pup
    
  3. Create a project directory:

    mkdir ~/scrapers && cd ~/scrapers
    

REAL Code Examples from the Repository

Example 1: Basic Hacker News Scraper

This is the canonical Pipet example. Create hackernews.pipet:

curl https://news.ycombinator.com/
.title .titleline
  span > a
  .sitebit a

Line-by-line breakdown:

  1. curl https://news.ycombinator.com/ - Fetches the raw HTML using curl. You could paste a full curl command with headers here.

  2. .title .titleline - CSS selector targeting each news item container. The leading dot indicates a class. This becomes our iterator—Pipet runs the child selectors for each match.

  3. span > a - Indented selector that runs within each .titleline. The > means direct child. This extracts the article title link.

  4. .sitebit a - Second indented selector at the same level, extracting the domain name from the .sitebit container.

Run it:

pipet hackernews.pipet

Output shows title and domain pairs. Add --json for structured data:

pipet --json hackernews.pipet

Example 2: Multi-Source Data Aggregation

This advanced example from the README demonstrates Pipet's multi-block capability:

// Read Wikipedia's "On This Day" and the subject of today's featured article
curl https://en.wikipedia.org/wiki/Main_Page
div#mp-otd li
  body
div#mp-tfa > p > b > a

// Get the weather in Alert, Canada
curl https://wttr.in/Alert%20Canada?format=j1
current_condition.0.FeelsLikeC
current_condition.0.FeelsLikeF

// Check how popular the Pipet repo is
playwright https://github.com/bjesus/pipet
Array.from(document.querySelectorAll('.about-margin .Link')).map(e => e.innerText.trim()).filter(t=> /^\d/.test(t) )

Technical insights:

  • Blocks are separated by empty lines—each block runs independently and outputs sequentially.
  • Comments start with // and are ignored, perfect for documentation.
  • Wikipedia block uses CSS selectors: div#mp-otd li selects list items from "On This Day", while div#mp-tfa > p > b > a drills down to the featured article link.
  • Weather block shows JSON parsing: current_condition.0.FeelsLikeC navigates to the first array item and extracts the Celsius value.
  • GitHub block uses Playwright mode to render the SPA, then executes JavaScript to extract numeric metrics (stars, forks) using Array.from() and filter() with a regex.

Example 3: Template Rendering

Create hackernews.tpl alongside your .pipet file:

<ul>
  {{range $index, $item := index (index . 0) 0}}
    <li>{{index $item 0}} ({{index $item 1}})</li>
  {{end}}
</ul>

Template breakdown:

  • {{range ...}} loops through the results. The complex index (index . 0) 0 navigates Pipet's nested data structure: first block, first query set.
  • {{index $item 0}} accesses the first element (title), {{index $item 1}} the second (domain).
  • Pipet auto-detects the .tpl file and renders output through it, producing clean HTML.

Example 4: Unix Pipe Integration

Enhance the Hacker News scraper with post-processing:

curl https://news.ycombinator.com/
.title .titleline
  span > a
  span > a | wc -c      # Count characters in each title
  .sitebit a
  .sitebit a | htmlq --attribute href a  # Extract full URL

Pipe magic explained:

  • | wc -c pipes each title through wc -c, outputting character counts instead of text.
  • | htmlq --attribute href a pipes the domain link through htmlq to extract the href attribute, giving you the full URL.
  • Pipes execute in your actual shell, so you can use any installed command. This makes Pipet infinitely extensible without plugins.

Example 5: Change Monitoring Daemon

Track when the #1 Hacker News story changes:

curl https://news.ycombinator.com/
.title .titleline a

Run with monitoring flags:

pipet --interval 60 --on-change "notify-send {}" hackernews.pipet

How it works:

  • --interval 60 reruns the scraper every 60 seconds.
  • --on-change executes only when output differs from the previous run.
  • {} in the command gets replaced with the new data.
  • notify-send creates a desktop notification—perfect for staying informed without constant polling.

Advanced Usage & Best Practices

Leverage Browser DevTools

Pro tip: Right-click any network request in Chrome/Firefox DevTools, select "Copy as cURL", and paste directly into your Pipet file. This preserves authentication, headers, and cookies, making it trivial to scrape authenticated pages.

Optimize Selector Performance

Use specific selectors (div#mp-otd li) over broad ones (div li). For large pages, combine Pipet's --max-pages flag with precise selectors to minimize bandwidth and processing time.

Handle Pagination Elegantly

Add a "next page" line at the block's end:

curl https://example.com/page1
.item .title
> .next-page a  # Selector for the "Next" button

Pipet automatically follows pagination until --max-pages is reached.

Secure Credential Management

Never hardcode secrets. Use environment variables:

curl -H "Authorization: Bearer $API_TOKEN" https://api.example.com

Combine with Cron for Scheduling

While --interval is great for monitoring, use cron for scheduled scrapes:

# Run daily at 9 AM
0 9 * * * /usr/local/bin/pipet --json /home/user/daily-scrape.pipet > /var/data/$(date +\%Y-\%m-\%d).json

Debug with Verbose Mode

Stuck? Use --verbose to see exactly what Pipet is doing:

pipet --verbose --json debug.pipet

This shows request details, selector matches, and pipe executions.

Comparison with Alternatives

Feature Pipet Beautiful Soup Scrapy Puppeteer
Interface CLI / Text files Python API Python Framework Node.js API
Learning Curve Minimal (CSS/Shell) Moderate Steep Moderate
JavaScript Rendering Yes (Playwright) No Via plugins Yes (Native)
Unix Pipe Integration Native No No No
Authentication Copy-paste curl Manual setup Middleware Browser context
Output Formats Text/JSON/Template Custom JSON/CSV/XML Custom
Resource Usage Very Low Medium High Very High
Scalability Medium Medium High Medium
Setup Time Seconds Minutes Hours Minutes
Best For Quick hacks, monitoring Simple HTML parsing Large crawls Complex SPA automation

Why choose Pipet? When you need to scrape something in the next 5 minutes, Pipet wins. No virtual environments, no dependency hell, no boilerplate. It's the difference between writing a 50-line script and a 5-line text file. For large-scale distributed crawling, Scrapy remains superior. For complex browser automation, Puppeteer offers more control. But for 99% of daily scraping tasks, Pipet's simplicity and pipe integration make it unbeatable.

FAQ: Everything You Need to Know

Q: How is Pipet different from just using curl and grep?

A: While curl | grep works for simple cases, Pipet provides structured data extraction through CSS selectors, handles pagination automatically, supports JavaScript rendering, and outputs JSON or templated formats. The indentation-based nesting creates parent-child relationships that grep simply cannot express.

Q: Can Pipet handle websites that require login?

A: Absolutely. Copy the authenticated request from your browser as a curl command (right-click → Copy as cURL) and paste it into your Pipet file. All cookies, headers, and session tokens are preserved, giving you the same access as your browser.

Q: Does Pipet work on Windows?

A: Yes. While designed with Unix philosophy in mind, Pipet runs on Windows via WSL (Windows Subsystem for Linux) or natively with PowerShell. The pipe integration works best in WSL where standard Unix tools are available.

Q: How does the JavaScript evaluation mode work?

A: When a resource line starts with playwright, Pipet launches a headless browser, navigates to the URL, and executes your JavaScript code in the page context. You have full access to document.querySelector, fetch, and modern browser APIs. If Playwright isn't installed, Pipet downloads it automatically on first use.

Q: Is Pipet suitable for large-scale web crawling?

A: Pipet excels at targeted extraction and monitoring tasks. For massive crawls spanning millions of pages, dedicated frameworks like Scrapy with distributed queues are more appropriate. However, for scraping hundreds of pages with complex JavaScript requirements, Pipet's simplicity often outweighs the overhead of heavier tools.

Q: Can I use XPath instead of CSS selectors?

A: Currently, Pipet supports CSS selectors for HTML queries. For most use cases, modern CSS selectors are sufficiently powerful. If you absolutely need XPath, you can pipe through tools like xmllint or xidel that support XPath expressions.

Q: How do I debug when selectors aren't working?

A: Use the --verbose flag to see detailed logs. Additionally, test selectors directly in your browser's DevTools console with document.querySelectorAll(). For JavaScript mode, you can console.log() values and see them in verbose output.

Conclusion: The Scraper You've Been Waiting For

Pipet represents a paradigm shift in how we approach web scraping. By embracing Unix pipes instead of fighting them, by treating curl commands as first-class citizens, and by making JavaScript evaluation as simple as writing a selector, it removes the barriers between idea and implementation.

Whether you're a data journalist tracking public records, a developer monitoring API changes, or a hacker automating personal workflows, Pipet gives you superpowers without the super-bloat. Its declarative syntax reads like documentation, its pipe integration leverages your existing tool knowledge, and its multi-modal support means you'll never need another scraper for 99% of tasks.

The beauty of Pipet lies in its respect for your time. That 5-minute scraping task actually takes 5 minutes. That monitoring script doesn't become a maintenance nightmare. Your scrapers become shareable, version-controlled text files that any team member can understand and modify.

Ready to transform your scraping workflow? Head to the Pipet GitHub repository, star it for later reference, and try the Hacker News example in the next 5 minutes. Your future self will thank you every time you need to extract data from the web—which, let's be honest, is every day.

Install Pipet now. Write your first .pipet file. Join the growing community of developers who've discovered that the best tool is often the simplest one.

Advertisement

Comments (0)

No comments yet. Be the first to share your thoughts!

Leave a Comment

Apps & Tools Open Source

Apps & Tools Open Source

Bright Coding Prompt

Bright Coding Prompt

Categories

Coding 7 No-Code 2 Automation 14 AI-Powered Content Creation 1 automated video editing 1 Tools 12 Open Source 24 AI 21 Gaming 1 Productivity 16 Security 4 Music Apps 1 Mobile 3 Technology 19 Digital Transformation 2 Fintech 6 Cryptocurrency 2 Trading 2 Cybersecurity 10 Web Development 16 Frontend 1 Marketing 1 Scientific Research 2 Devops 10 Developer 2 Software Development 6 Entrepreneurship 1 Maching learning 2 Data Engineering 3 Linux Tutorials 1 Linux 3 Data Science 4 Server 1 Self-Hosted 6 Homelab 2 File transfert 1 Photo Editing 1 Data Visualization 3 iOS Hacks 1 React Native 1 prompts 1 Wordpress 1 WordPressAI 1 Education 1 Design 1 Streaming 2 LLM 1 Algorithmic Trading 2 Internet of Things 1 Data Privacy 1 AI Security 2 Digital Media 2 Self-Hosting 3 OCR 1 Defi 1 Dental Technology 1 Artificial Intelligence in Healthcare 1 Electronic 2 DIY Audio 1 Academic Writing 1 Technical Documentation 1 Publishing 1 Broadcasting 1 Database 3 Smart Home 1 Business Intelligence 1 Workflow 1 Developer Tools 144 Developer Technologies 3 Payments 1 Development 4 Desktop Environments 1 React 4 Project Management 1 Neurodiversity 1 Remote Communication 1 Machine Learning 14 System Administration 1 Natural Language Processing 1 Data Analysis 1 WhatsApp 1 Library Management 2 Self-Hosted Solutions 2 Blogging 1 IPTV Management 1 Workflow Automation 1 Artificial Intelligence 11 macOS 3 Privacy 1 Manufacturing 1 AI Development 11 Freelancing 1 Invoicing 1 AI & Machine Learning 7 Development Tools 3 CLI Tools 1 OSINT 1 Investigation 1 Backend Development 1 AI/ML 19 Windows 1 Privacy Tools 3 Computer Vision 6 Networking 1 DevOps Tools 3 AI Tools 8 Developer Productivity 6 CSS Frameworks 1 Web Development Tools 1 Cloudflare 1 GraphQL 1 Database Management 1 Educational Technology 1 AI Programming 3 Machine Learning Tools 2 Python Development 2 IoT & Hardware 1 Apple Ecosystem 1 JavaScript 6 AI-Assisted Development 2 Python 2 Document Generation 3 Email 1 macOS Utilities 1 Virtualization 3 Browser Automation 1 AI Development Tools 1 Docker 2 Mobile Development 4 Marketing Technology 1 Open Source Tools 8 Documentation 1 Web Scraping 2 iOS Development 3 Mobile Apps 1 Mobile Tools 2 Android Development 3 macOS Development 1 Web Browsers 1 API Management 1 UI Components 1 React Development 1 UI/UX Design 1 Digital Forensics 1 Music Software 2 API Development 3 Business Software 1 ESP32 Projects 1 Media Server 1 Container Orchestration 1 Speech Recognition 1 Media Automation 1 Media Management 1 Self-Hosted Software 1 Java Development 1 Desktop Applications 1 AI Automation 2 AI Assistant 1 Linux Software 1 Node.js 1 3D Printing 1 Low-Code Platforms 1 Software-Defined Radio 2 CLI Utilities 1 Music Production 1 Monitoring 1 IoT 1 Hardware Programming 1 Godot 1 Game Development Tools 1 IoT Projects 1 ESP32 Development 1 Career Development 1 Python Tools 1 Product Management 1 Python Libraries 1 Legal Tech 1 Home Automation 1 Robotics 1 Hardware Hacking 1 macOS Apps 3 Game Development 1 Network Security 1 Terminal Applications 1 Data Recovery 1 Developer Resources 1 Video Editing 1 AI Integration 4 SEO Tools 1 macOS Applications 1 Penetration Testing 1 System Design 1 Edge AI 1 Audio Production 1 Live Streaming Technology 1 Music Technology 1 Generative AI 1 Flutter Development 1 Privacy Software 1 API Integration 1 Android Security 1 Cloud Computing 1 AI Engineering 1 Command Line Utilities 1 Audio Processing 1 Swift Development 1 AI Frameworks 1 Multi-Agent Systems 1 JavaScript Frameworks 1 Media Applications 1 Mathematical Visualization 1 AI Infrastructure 1 Edge Computing 1 Financial Technology 2 Security Tools 1 AI/ML Tools 1 3D Graphics 2 Database Technology 1 Observability 1 RSS Readers 1 Next.js 1 SaaS Development 1 Docker Tools 1 DevOps Monitoring 1 Visual Programming 1 Testing Tools 1 Video Processing 1 Database Tools 1 Family Technology 1 Open Source Software 1 Motion Capture 1 Scientific Computing 1 Infrastructure 1 CLI Applications 1 AI and Machine Learning 1 Finance/Trading 1 Cloud Infrastructure 1 Quantum Computing 1
Advertisement
Advertisement