Pipet: Command-Line Scraping with JavaScript & Unix Pipes

Tired of bloated web scraping frameworks that require Python virtual environments, dozens of dependencies, and complex configuration? Meet Pipet—the revolutionary command-line scraper that treats web data like Unix treats everything else: as streams of text ready for composition, transformation, and piping. Built for hackers who value simplicity and power, Pipet transforms the tedious task of data extraction into a delightful exercise in elegant shell scripting.

In this deep dive, you'll discover how Pipet's unique architecture leverages JavaScript evaluation, CSS selectors, and Unix pipes to create scraping workflows that are both readable and infinitely extensible. We'll explore real-world examples, from monitoring Hacker News headlines to tracking package deliveries, and show you why this Go-powered tool is rapidly becoming the secret weapon of developers who refuse to overcomplicate their stack.

What Is Pipet?

Pipet is a Swiss-army command-line tool for scraping and extracting data from online assets, engineered specifically for hackers who live in their terminals. Created by developer bjesus and written in Go, Pipet reimagines web scraping as a pipeline-based operation where each step is a simple, composable command.

Unlike traditional scraping frameworks that trap you in complex object models and asynchronous callbacks, Pipet embraces the Unix philosophy: do one thing well, and work seamlessly with other tools. It supports three distinct modes of operation—HTML parsing, JSON parsing, and client-side JavaScript evaluation—giving you the flexibility to tackle any data source without switching tools.

What makes Pipet genuinely revolutionary is its pipe-first architecture. Every data extraction step can be extended with standard Unix commands like grep, sed, awk, jq, or any other CLI tool in your arsenal. This means you're not learning a new API; you're applying skills you already possess. The tool has gained significant traction in the developer community precisely because it eliminates the friction between "I need this data" and "I have this data"—turning what used to be a 50-line Python script into a 5-line text file.

Key Features That Make Pipet Irresistible

Multi-Modal Scraping Engine

Pipet doesn't force you into a single extraction method. It intelligently handles three distinct data sources through a unified syntax:

HTML Parsing: Use familiar CSS selectors to navigate DOM structures. The whitespace-sensitive nesting system creates natural parent-child relationships, making iteration intuitive.
JSON Parsing: Navigate JSON APIs using dot notation (current_condition.0.FeelsLikeC) to drill into nested objects and arrays effortlessly.
JavaScript Evaluation: For modern SPAs and dynamically rendered content, Pipet integrates with Playwright to execute custom JavaScript in a headless browser, giving you access to the fully rendered DOM.

Native Unix Pipe Integration

This is Pipet's superpower. Any query line can be piped through external commands using the | operator, exactly as you would in Bash. Want to count characters in extracted titles? Append | wc -c. Need to extract an attribute from HTML? Pipe to htmlq. This transforms Pipet from a scraper into a data processing orchestrator that leverages the entire Unix ecosystem.

Curl Compatibility & Browser Fidelity

Resource lines starting with curl accept complete curl commands, including headers, cookies, and authentication. You can literally copy a request from Chrome DevTools' "Copy as cURL" feature and paste it directly into your Pipet file. This makes bypassing anti-bot measures and accessing authenticated sessions trivial—no manual header reconstruction required.

Declarative Data Structure

Pipet files use indentation to define data hierarchies. A parent selector runs as an iterator, executing child selectors for each match. This creates clean, visual mappings between DOM structure and output format. The result? Your scraper definition looks like the data it produces.

Template-Driven Output

Beyond raw text and JSON, Pipet supports Go text/template for custom formatting. Drop a .tpl file next to your .pipet file, and Pipet automatically renders results through it. This enables generating HTML reports, CSV files, or any structured format without post-processing.

Change Monitoring Built-In

The --interval and --on-change flags turn Pipet into a monitoring daemon. It reruns your scraper at specified intervals and executes commands only when data changes—perfect for price tracking, stock alerts, or content change detection.

Real-World Use Cases Where Pipet Dominates

1. Real-Time Price Monitoring

E-commerce sites constantly change prices. With Pipet, create a scraper that checks competitor pricing every 30 minutes and sends a Slack notification when discounts appear:

pipet --interval 1800 --on-change 'curl -X POST -d "Price changed: {}" https://hooks.slack.com/...' prices.pipet

The .pipet file uses a copied curl command with your session cookies to access vendor portals, extracts prices with CSS selectors, and pipes results through jq for normalization.

2. Content Aggregation & News Tracking

Media analysts need to track breaking stories across multiple sources. Pipet's multi-block files let you scrape Hacker News, Reddit, and niche forums simultaneously, outputting a unified JSON feed for your dashboard. The Unix pipe integration means you can deduplicate entries with sort | uniq or filter keywords with grep before the data even hits your database.

3. Shipment & Logistics Tracking

That "track your package" page requires JavaScript execution and session cookies. Pipet handles both: use Playwright mode to render the tracking page, execute JavaScript to extract the delivery status, and pipe it through sed to format a clean status message. Set a 5-minute interval, and you'll never miss a delivery update.

4. GitHub Repository Analytics

Monitor your open-source project's popularity by scraping star counts, fork numbers, and issue activity. The example in Pipet's documentation shows extracting metrics from GitHub's SPA interface using JavaScript evaluation:

playwright https://github.com/bjesus/pipet
Array.from(document.querySelectorAll('.about-margin .Link')).map(e => e.innerText.trim()).filter(t=> /^\d/.test(t) )

This runs in a headless browser, executes the JavaScript to parse the DOM, and returns clean metrics—all from a two-line text file.

5. Weather & Environmental Data Collection

The README's weather example demonstrates JSON API consumption:

curl https://wttr.in/Alert%20Canada?format=j1
current_condition.0.FeelsLikeC
current_condition.0.FeelsLikeF

Perfect for IoT projects, home automation, or data science pipelines needing reliable weather feeds without heavy SDKs.

Step-by-Step Installation & Setup Guide

Method 1: Pre-Built Binary (Fastest)

Download the latest release for your platform from the Releases page:

# Example for Linux amd64
wget https://github.com/bjesus/pipet/releases/latest/download/pipet_linux_amd64.tar.gz
tar -xzf pipet_linux_amd64.tar.gz
chmod +x pipet
sudo mv pipet /usr/local/bin/

Method 2: Go Install (Recommended for Developers)

Requires Go 1.19+:

go install github.com/bjesus/pipet/cmd/pipet@latest

This compiles Pipet from source and installs it to $GOPATH/bin. Verify with:

pipet --version

Method 3: Package Managers

Arch Linux (AUR):

yay -S pipet-git

Homebrew (macOS/Linux):

brew install pipet

Nix:

nix-env -iA nixpkgs.pipet

Method 4: Run Without Installing

For one-off usage or testing:

go run github.com/bjesus/pipet/cmd/pipet@latest your-scraper.pipet

Environment Setup

For JavaScript rendering: Install Playwright dependencies:

pipet --verbose your-js-scraper.pipet  # Auto-downloads on first run

For pipe integration: Ensure your favorite CLI tools are installed:
```
# Recommended tools
sudo apt install htmlq jq pup
```
Create a project directory:
```
mkdir ~/scrapers && cd ~/scrapers
```

REAL Code Examples from the Repository

Example 1: Basic Hacker News Scraper

This is the canonical Pipet example. Create hackernews.pipet:

curl https://news.ycombinator.com/
.title .titleline
  span > a
  .sitebit a

Line-by-line breakdown:

curl https://news.ycombinator.com/ - Fetches the raw HTML using curl. You could paste a full curl command with headers here.
.title .titleline - CSS selector targeting each news item container. The leading dot indicates a class. This becomes our iterator—Pipet runs the child selectors for each match.
span > a - Indented selector that runs within each .titleline. The > means direct child. This extracts the article title link.
.sitebit a - Second indented selector at the same level, extracting the domain name from the .sitebit container.

Run it:

pipet hackernews.pipet

Output shows title and domain pairs. Add --json for structured data:

pipet --json hackernews.pipet

Example 2: Multi-Source Data Aggregation

This advanced example from the README demonstrates Pipet's multi-block capability:

// Read Wikipedia's "On This Day" and the subject of today's featured article
curl https://en.wikipedia.org/wiki/Main_Page
div#mp-otd li
  body
div#mp-tfa > p > b > a

// Get the weather in Alert, Canada
curl https://wttr.in/Alert%20Canada?format=j1
current_condition.0.FeelsLikeC
current_condition.0.FeelsLikeF

// Check how popular the Pipet repo is
playwright https://github.com/bjesus/pipet
Array.from(document.querySelectorAll('.about-margin .Link')).map(e => e.innerText.trim()).filter(t=> /^\d/.test(t) )

Technical insights:

Blocks are separated by empty lines—each block runs independently and outputs sequentially.
Comments start with // and are ignored, perfect for documentation.
Wikipedia block uses CSS selectors: div#mp-otd li selects list items from "On This Day", while div#mp-tfa > p > b > a drills down to the featured article link.
Weather block shows JSON parsing: current_condition.0.FeelsLikeC navigates to the first array item and extracts the Celsius value.
GitHub block uses Playwright mode to render the SPA, then executes JavaScript to extract numeric metrics (stars, forks) using Array.from() and filter() with a regex.

Example 3: Template Rendering

Create hackernews.tpl alongside your .pipet file:

<ul>
  {{range $index, $item := index (index . 0) 0}}
    <li>{{index $item 0}} ({{index $item 1}})</li>
  {{end}}
</ul>

Template breakdown:

{{range ...}} loops through the results. The complex index (index . 0) 0 navigates Pipet's nested data structure: first block, first query set.
{{index $item 0}} accesses the first element (title), {{index $item 1}} the second (domain).
Pipet auto-detects the .tpl file and renders output through it, producing clean HTML.

Example 4: Unix Pipe Integration

Enhance the Hacker News scraper with post-processing:

curl https://news.ycombinator.com/
.title .titleline
  span > a
  span > a | wc -c      # Count characters in each title
  .sitebit a
  .sitebit a | htmlq --attribute href a  # Extract full URL

Pipe magic explained:

| wc -c pipes each title through wc -c, outputting character counts instead of text.
| htmlq --attribute href a pipes the domain link through htmlq to extract the href attribute, giving you the full URL.
Pipes execute in your actual shell, so you can use any installed command. This makes Pipet infinitely extensible without plugins.

Example 5: Change Monitoring Daemon

Track when the #1 Hacker News story changes:

curl https://news.ycombinator.com/
.title .titleline a

Run with monitoring flags:

pipet --interval 60 --on-change "notify-send {}" hackernews.pipet

How it works:

--interval 60 reruns the scraper every 60 seconds.
--on-change executes only when output differs from the previous run.
{} in the command gets replaced with the new data.
notify-send creates a desktop notification—perfect for staying informed without constant polling.

Advanced Usage & Best Practices

Leverage Browser DevTools

Pro tip: Right-click any network request in Chrome/Firefox DevTools, select "Copy as cURL", and paste directly into your Pipet file. This preserves authentication, headers, and cookies, making it trivial to scrape authenticated pages.

Optimize Selector Performance

Use specific selectors (div#mp-otd li) over broad ones (div li). For large pages, combine Pipet's --max-pages flag with precise selectors to minimize bandwidth and processing time.

Handle Pagination Elegantly

Add a "next page" line at the block's end:

curl https://example.com/page1
.item .title
> .next-page a  # Selector for the "Next" button

Pipet automatically follows pagination until --max-pages is reached.

Secure Credential Management

Never hardcode secrets. Use environment variables:

curl -H "Authorization: Bearer $API_TOKEN" https://api.example.com

Combine with Cron for Scheduling

While --interval is great for monitoring, use cron for scheduled scrapes:

# Run daily at 9 AM
0 9 * * * /usr/local/bin/pipet --json /home/user/daily-scrape.pipet > /var/data/$(date +\%Y-\%m-\%d).json

Debug with Verbose Mode

Stuck? Use --verbose to see exactly what Pipet is doing:

pipet --verbose --json debug.pipet

This shows request details, selector matches, and pipe executions.

Comparison with Alternatives

Feature	Pipet	Beautiful Soup	Scrapy	Puppeteer
Interface	CLI / Text files	Python API	Python Framework	Node.js API
Learning Curve	Minimal (CSS/Shell)	Moderate	Steep	Moderate
JavaScript Rendering	Yes (Playwright)	No	Via plugins	Yes (Native)
Unix Pipe Integration	Native	No	No	No
Authentication	Copy-paste curl	Manual setup	Middleware	Browser context
Output Formats	Text/JSON/Template	Custom	JSON/CSV/XML	Custom
Resource Usage	Very Low	Medium	High	Very High
Scalability	Medium	Medium	High	Medium
Setup Time	Seconds	Minutes	Hours	Minutes
Best For	Quick hacks, monitoring	Simple HTML parsing	Large crawls	Complex SPA automation

Why choose Pipet? When you need to scrape something in the next 5 minutes, Pipet wins. No virtual environments, no dependency hell, no boilerplate. It's the difference between writing a 50-line script and a 5-line text file. For large-scale distributed crawling, Scrapy remains superior. For complex browser automation, Puppeteer offers more control. But for 99% of daily scraping tasks, Pipet's simplicity and pipe integration make it unbeatable.

FAQ: Everything You Need to Know

Q: How is Pipet different from just using curl and grep?

A: While curl | grep works for simple cases, Pipet provides structured data extraction through CSS selectors, handles pagination automatically, supports JavaScript rendering, and outputs JSON or templated formats. The indentation-based nesting creates parent-child relationships that grep simply cannot express.

Q: Can Pipet handle websites that require login?

A: Absolutely. Copy the authenticated request from your browser as a curl command (right-click → Copy as cURL) and paste it into your Pipet file. All cookies, headers, and session tokens are preserved, giving you the same access as your browser.

Q: Does Pipet work on Windows?

A: Yes. While designed with Unix philosophy in mind, Pipet runs on Windows via WSL (Windows Subsystem for Linux) or natively with PowerShell. The pipe integration works best in WSL where standard Unix tools are available.

Q: How does the JavaScript evaluation mode work?

A: When a resource line starts with playwright, Pipet launches a headless browser, navigates to the URL, and executes your JavaScript code in the page context. You have full access to document.querySelector, fetch, and modern browser APIs. If Playwright isn't installed, Pipet downloads it automatically on first use.

Q: Is Pipet suitable for large-scale web crawling?

A: Pipet excels at targeted extraction and monitoring tasks. For massive crawls spanning millions of pages, dedicated frameworks like Scrapy with distributed queues are more appropriate. However, for scraping hundreds of pages with complex JavaScript requirements, Pipet's simplicity often outweighs the overhead of heavier tools.

Q: Can I use XPath instead of CSS selectors?

A: Currently, Pipet supports CSS selectors for HTML queries. For most use cases, modern CSS selectors are sufficiently powerful. If you absolutely need XPath, you can pipe through tools like xmllint or xidel that support XPath expressions.

Q: How do I debug when selectors aren't working?

A: Use the --verbose flag to see detailed logs. Additionally, test selectors directly in your browser's DevTools console with document.querySelectorAll(). For JavaScript mode, you can console.log() values and see them in verbose output.

Conclusion: The Scraper You've Been Waiting For

Pipet represents a paradigm shift in how we approach web scraping. By embracing Unix pipes instead of fighting them, by treating curl commands as first-class citizens, and by making JavaScript evaluation as simple as writing a selector, it removes the barriers between idea and implementation.

Whether you're a data journalist tracking public records, a developer monitoring API changes, or a hacker automating personal workflows, Pipet gives you superpowers without the super-bloat. Its declarative syntax reads like documentation, its pipe integration leverages your existing tool knowledge, and its multi-modal support means you'll never need another scraper for 99% of tasks.

The beauty of Pipet lies in its respect for your time. That 5-minute scraping task actually takes 5 minutes. That monitoring script doesn't become a maintenance nightmare. Your scrapers become shareable, version-controlled text files that any team member can understand and modify.

Ready to transform your scraping workflow? Head to the Pipet GitHub repository, star it for later reference, and try the Hacker News example in the next 5 minutes. Your future self will thank you every time you need to extract data from the web—which, let's be honest, is every day.

Install Pipet now. Write your first .pipet file. Join the growing community of developers who've discovered that the best tool is often the simplest one.