Pipet: Command-Line Scraping with JavaScript & Unix Pipes
Pipet: Command-Line Scraping with JavaScript & Unix Pipes
Tired of bloated web scraping frameworks that require Python virtual environments, dozens of dependencies, and complex configuration? Meet Pipet—the revolutionary command-line scraper that treats web data like Unix treats everything else: as streams of text ready for composition, transformation, and piping. Built for hackers who value simplicity and power, Pipet transforms the tedious task of data extraction into a delightful exercise in elegant shell scripting.
In this deep dive, you'll discover how Pipet's unique architecture leverages JavaScript evaluation, CSS selectors, and Unix pipes to create scraping workflows that are both readable and infinitely extensible. We'll explore real-world examples, from monitoring Hacker News headlines to tracking package deliveries, and show you why this Go-powered tool is rapidly becoming the secret weapon of developers who refuse to overcomplicate their stack.
What Is Pipet?
Pipet is a Swiss-army command-line tool for scraping and extracting data from online assets, engineered specifically for hackers who live in their terminals. Created by developer bjesus and written in Go, Pipet reimagines web scraping as a pipeline-based operation where each step is a simple, composable command.
Unlike traditional scraping frameworks that trap you in complex object models and asynchronous callbacks, Pipet embraces the Unix philosophy: do one thing well, and work seamlessly with other tools. It supports three distinct modes of operation—HTML parsing, JSON parsing, and client-side JavaScript evaluation—giving you the flexibility to tackle any data source without switching tools.
What makes Pipet genuinely revolutionary is its pipe-first architecture. Every data extraction step can be extended with standard Unix commands like grep, sed, awk, jq, or any other CLI tool in your arsenal. This means you're not learning a new API; you're applying skills you already possess. The tool has gained significant traction in the developer community precisely because it eliminates the friction between "I need this data" and "I have this data"—turning what used to be a 50-line Python script into a 5-line text file.
Key Features That Make Pipet Irresistible
Multi-Modal Scraping Engine
Pipet doesn't force you into a single extraction method. It intelligently handles three distinct data sources through a unified syntax:
- HTML Parsing: Use familiar CSS selectors to navigate DOM structures. The whitespace-sensitive nesting system creates natural parent-child relationships, making iteration intuitive.
- JSON Parsing: Navigate JSON APIs using dot notation (
current_condition.0.FeelsLikeC) to drill into nested objects and arrays effortlessly. - JavaScript Evaluation: For modern SPAs and dynamically rendered content, Pipet integrates with Playwright to execute custom JavaScript in a headless browser, giving you access to the fully rendered DOM.
Native Unix Pipe Integration
This is Pipet's superpower. Any query line can be piped through external commands using the | operator, exactly as you would in Bash. Want to count characters in extracted titles? Append | wc -c. Need to extract an attribute from HTML? Pipe to htmlq. This transforms Pipet from a scraper into a data processing orchestrator that leverages the entire Unix ecosystem.
Curl Compatibility & Browser Fidelity
Resource lines starting with curl accept complete curl commands, including headers, cookies, and authentication. You can literally copy a request from Chrome DevTools' "Copy as cURL" feature and paste it directly into your Pipet file. This makes bypassing anti-bot measures and accessing authenticated sessions trivial—no manual header reconstruction required.
Declarative Data Structure
Pipet files use indentation to define data hierarchies. A parent selector runs as an iterator, executing child selectors for each match. This creates clean, visual mappings between DOM structure and output format. The result? Your scraper definition looks like the data it produces.
Template-Driven Output
Beyond raw text and JSON, Pipet supports Go text/template for custom formatting. Drop a .tpl file next to your .pipet file, and Pipet automatically renders results through it. This enables generating HTML reports, CSV files, or any structured format without post-processing.
Change Monitoring Built-In
The --interval and --on-change flags turn Pipet into a monitoring daemon. It reruns your scraper at specified intervals and executes commands only when data changes—perfect for price tracking, stock alerts, or content change detection.
Real-World Use Cases Where Pipet Dominates
1. Real-Time Price Monitoring
E-commerce sites constantly change prices. With Pipet, create a scraper that checks competitor pricing every 30 minutes and sends a Slack notification when discounts appear:
pipet --interval 1800 --on-change 'curl -X POST -d "Price changed: {}" https://hooks.slack.com/...' prices.pipet
The .pipet file uses a copied curl command with your session cookies to access vendor portals, extracts prices with CSS selectors, and pipes results through jq for normalization.
2. Content Aggregation & News Tracking
Media analysts need to track breaking stories across multiple sources. Pipet's multi-block files let you scrape Hacker News, Reddit, and niche forums simultaneously, outputting a unified JSON feed for your dashboard. The Unix pipe integration means you can deduplicate entries with sort | uniq or filter keywords with grep before the data even hits your database.
3. Shipment & Logistics Tracking
That "track your package" page requires JavaScript execution and session cookies. Pipet handles both: use Playwright mode to render the tracking page, execute JavaScript to extract the delivery status, and pipe it through sed to format a clean status message. Set a 5-minute interval, and you'll never miss a delivery update.
4. GitHub Repository Analytics
Monitor your open-source project's popularity by scraping star counts, fork numbers, and issue activity. The example in Pipet's documentation shows extracting metrics from GitHub's SPA interface using JavaScript evaluation:
playwright https://github.com/bjesus/pipet
Array.from(document.querySelectorAll('.about-margin .Link')).map(e => e.innerText.trim()).filter(t=> /^\d/.test(t) )
This runs in a headless browser, executes the JavaScript to parse the DOM, and returns clean metrics—all from a two-line text file.
5. Weather & Environmental Data Collection
The README's weather example demonstrates JSON API consumption:
curl https://wttr.in/Alert%20Canada?format=j1
current_condition.0.FeelsLikeC
current_condition.0.FeelsLikeF
Perfect for IoT projects, home automation, or data science pipelines needing reliable weather feeds without heavy SDKs.
Step-by-Step Installation & Setup Guide
Method 1: Pre-Built Binary (Fastest)
Download the latest release for your platform from the Releases page:
# Example for Linux amd64
wget https://github.com/bjesus/pipet/releases/latest/download/pipet_linux_amd64.tar.gz
tar -xzf pipet_linux_amd64.tar.gz
chmod +x pipet
sudo mv pipet /usr/local/bin/
Method 2: Go Install (Recommended for Developers)
Requires Go 1.19+:
go install github.com/bjesus/pipet/cmd/pipet@latest
This compiles Pipet from source and installs it to $GOPATH/bin. Verify with:
pipet --version
Method 3: Package Managers
Arch Linux (AUR):
yay -S pipet-git
Homebrew (macOS/Linux):
brew install pipet
Nix:
nix-env -iA nixpkgs.pipet
Method 4: Run Without Installing
For one-off usage or testing:
go run github.com/bjesus/pipet/cmd/pipet@latest your-scraper.pipet
Environment Setup
-
For JavaScript rendering: Install Playwright dependencies:
pipet --verbose your-js-scraper.pipet # Auto-downloads on first run -
For pipe integration: Ensure your favorite CLI tools are installed:
# Recommended tools sudo apt install htmlq jq pup -
Create a project directory:
mkdir ~/scrapers && cd ~/scrapers
REAL Code Examples from the Repository
Example 1: Basic Hacker News Scraper
This is the canonical Pipet example. Create hackernews.pipet:
curl https://news.ycombinator.com/
.title .titleline
span > a
.sitebit a
Line-by-line breakdown:
-
curl https://news.ycombinator.com/- Fetches the raw HTML using curl. You could paste a full curl command with headers here. -
.title .titleline- CSS selector targeting each news item container. The leading dot indicates a class. This becomes our iterator—Pipet runs the child selectors for each match. -
span > a- Indented selector that runs within each.titleline. The>means direct child. This extracts the article title link. -
.sitebit a- Second indented selector at the same level, extracting the domain name from the.sitebitcontainer.
Run it:
pipet hackernews.pipet
Output shows title and domain pairs. Add --json for structured data:
pipet --json hackernews.pipet
Example 2: Multi-Source Data Aggregation
This advanced example from the README demonstrates Pipet's multi-block capability:
// Read Wikipedia's "On This Day" and the subject of today's featured article
curl https://en.wikipedia.org/wiki/Main_Page
div#mp-otd li
body
div#mp-tfa > p > b > a
// Get the weather in Alert, Canada
curl https://wttr.in/Alert%20Canada?format=j1
current_condition.0.FeelsLikeC
current_condition.0.FeelsLikeF
// Check how popular the Pipet repo is
playwright https://github.com/bjesus/pipet
Array.from(document.querySelectorAll('.about-margin .Link')).map(e => e.innerText.trim()).filter(t=> /^\d/.test(t) )
Technical insights:
- Blocks are separated by empty lines—each block runs independently and outputs sequentially.
- Comments start with
//and are ignored, perfect for documentation. - Wikipedia block uses CSS selectors:
div#mp-otd liselects list items from "On This Day", whilediv#mp-tfa > p > b > adrills down to the featured article link. - Weather block shows JSON parsing:
current_condition.0.FeelsLikeCnavigates to the first array item and extracts the Celsius value. - GitHub block uses Playwright mode to render the SPA, then executes JavaScript to extract numeric metrics (stars, forks) using
Array.from()andfilter()with a regex.
Example 3: Template Rendering
Create hackernews.tpl alongside your .pipet file:
<ul>
{{range $index, $item := index (index . 0) 0}}
<li>{{index $item 0}} ({{index $item 1}})</li>
{{end}}
</ul>
Template breakdown:
{{range ...}}loops through the results. The complexindex (index . 0) 0navigates Pipet's nested data structure: first block, first query set.{{index $item 0}}accesses the first element (title),{{index $item 1}}the second (domain).- Pipet auto-detects the
.tplfile and renders output through it, producing clean HTML.
Example 4: Unix Pipe Integration
Enhance the Hacker News scraper with post-processing:
curl https://news.ycombinator.com/
.title .titleline
span > a
span > a | wc -c # Count characters in each title
.sitebit a
.sitebit a | htmlq --attribute href a # Extract full URL
Pipe magic explained:
| wc -cpipes each title throughwc -c, outputting character counts instead of text.| htmlq --attribute href apipes the domain link throughhtmlqto extract thehrefattribute, giving you the full URL.- Pipes execute in your actual shell, so you can use any installed command. This makes Pipet infinitely extensible without plugins.
Example 5: Change Monitoring Daemon
Track when the #1 Hacker News story changes:
curl https://news.ycombinator.com/
.title .titleline a
Run with monitoring flags:
pipet --interval 60 --on-change "notify-send {}" hackernews.pipet
How it works:
--interval 60reruns the scraper every 60 seconds.--on-changeexecutes only when output differs from the previous run.{}in the command gets replaced with the new data.notify-sendcreates a desktop notification—perfect for staying informed without constant polling.
Advanced Usage & Best Practices
Leverage Browser DevTools
Pro tip: Right-click any network request in Chrome/Firefox DevTools, select "Copy as cURL", and paste directly into your Pipet file. This preserves authentication, headers, and cookies, making it trivial to scrape authenticated pages.
Optimize Selector Performance
Use specific selectors (div#mp-otd li) over broad ones (div li). For large pages, combine Pipet's --max-pages flag with precise selectors to minimize bandwidth and processing time.
Handle Pagination Elegantly
Add a "next page" line at the block's end:
curl https://example.com/page1
.item .title
> .next-page a # Selector for the "Next" button
Pipet automatically follows pagination until --max-pages is reached.
Secure Credential Management
Never hardcode secrets. Use environment variables:
curl -H "Authorization: Bearer $API_TOKEN" https://api.example.com
Combine with Cron for Scheduling
While --interval is great for monitoring, use cron for scheduled scrapes:
# Run daily at 9 AM
0 9 * * * /usr/local/bin/pipet --json /home/user/daily-scrape.pipet > /var/data/$(date +\%Y-\%m-\%d).json
Debug with Verbose Mode
Stuck? Use --verbose to see exactly what Pipet is doing:
pipet --verbose --json debug.pipet
This shows request details, selector matches, and pipe executions.
Comparison with Alternatives
| Feature | Pipet | Beautiful Soup | Scrapy | Puppeteer |
|---|---|---|---|---|
| Interface | CLI / Text files | Python API | Python Framework | Node.js API |
| Learning Curve | Minimal (CSS/Shell) | Moderate | Steep | Moderate |
| JavaScript Rendering | Yes (Playwright) | No | Via plugins | Yes (Native) |
| Unix Pipe Integration | Native | No | No | No |
| Authentication | Copy-paste curl | Manual setup | Middleware | Browser context |
| Output Formats | Text/JSON/Template | Custom | JSON/CSV/XML | Custom |
| Resource Usage | Very Low | Medium | High | Very High |
| Scalability | Medium | Medium | High | Medium |
| Setup Time | Seconds | Minutes | Hours | Minutes |
| Best For | Quick hacks, monitoring | Simple HTML parsing | Large crawls | Complex SPA automation |
Why choose Pipet? When you need to scrape something in the next 5 minutes, Pipet wins. No virtual environments, no dependency hell, no boilerplate. It's the difference between writing a 50-line script and a 5-line text file. For large-scale distributed crawling, Scrapy remains superior. For complex browser automation, Puppeteer offers more control. But for 99% of daily scraping tasks, Pipet's simplicity and pipe integration make it unbeatable.
FAQ: Everything You Need to Know
Q: How is Pipet different from just using curl and grep?
A: While curl | grep works for simple cases, Pipet provides structured data extraction through CSS selectors, handles pagination automatically, supports JavaScript rendering, and outputs JSON or templated formats. The indentation-based nesting creates parent-child relationships that grep simply cannot express.
Q: Can Pipet handle websites that require login?
A: Absolutely. Copy the authenticated request from your browser as a curl command (right-click → Copy as cURL) and paste it into your Pipet file. All cookies, headers, and session tokens are preserved, giving you the same access as your browser.
Q: Does Pipet work on Windows?
A: Yes. While designed with Unix philosophy in mind, Pipet runs on Windows via WSL (Windows Subsystem for Linux) or natively with PowerShell. The pipe integration works best in WSL where standard Unix tools are available.
Q: How does the JavaScript evaluation mode work?
A: When a resource line starts with playwright, Pipet launches a headless browser, navigates to the URL, and executes your JavaScript code in the page context. You have full access to document.querySelector, fetch, and modern browser APIs. If Playwright isn't installed, Pipet downloads it automatically on first use.
Q: Is Pipet suitable for large-scale web crawling?
A: Pipet excels at targeted extraction and monitoring tasks. For massive crawls spanning millions of pages, dedicated frameworks like Scrapy with distributed queues are more appropriate. However, for scraping hundreds of pages with complex JavaScript requirements, Pipet's simplicity often outweighs the overhead of heavier tools.
Q: Can I use XPath instead of CSS selectors?
A: Currently, Pipet supports CSS selectors for HTML queries. For most use cases, modern CSS selectors are sufficiently powerful. If you absolutely need XPath, you can pipe through tools like xmllint or xidel that support XPath expressions.
Q: How do I debug when selectors aren't working?
A: Use the --verbose flag to see detailed logs. Additionally, test selectors directly in your browser's DevTools console with document.querySelectorAll(). For JavaScript mode, you can console.log() values and see them in verbose output.
Conclusion: The Scraper You've Been Waiting For
Pipet represents a paradigm shift in how we approach web scraping. By embracing Unix pipes instead of fighting them, by treating curl commands as first-class citizens, and by making JavaScript evaluation as simple as writing a selector, it removes the barriers between idea and implementation.
Whether you're a data journalist tracking public records, a developer monitoring API changes, or a hacker automating personal workflows, Pipet gives you superpowers without the super-bloat. Its declarative syntax reads like documentation, its pipe integration leverages your existing tool knowledge, and its multi-modal support means you'll never need another scraper for 99% of tasks.
The beauty of Pipet lies in its respect for your time. That 5-minute scraping task actually takes 5 minutes. That monitoring script doesn't become a maintenance nightmare. Your scrapers become shareable, version-controlled text files that any team member can understand and modify.
Ready to transform your scraping workflow? Head to the Pipet GitHub repository, star it for later reference, and try the Hacker News example in the next 5 minutes. Your future self will thank you every time you need to extract data from the web—which, let's be honest, is every day.
Install Pipet now. Write your first .pipet file. Join the growing community of developers who've discovered that the best tool is often the simplest one.
Comments (0)
No comments yet. Be the first to share your thoughts!