All checks were successful
Cargo Build & Test / Rust project - latest (1.86) (push) Successful in 3m49s
Cargo Build & Test / Rust project - latest (1.87) (push) Successful in 4m2s
Cargo Build & Test / Rust project - latest (1.88) (push) Successful in 4m17s
Cargo Build & Test / Rust project - latest (1.85.1) (push) Successful in 9m36s
123 lines
3.8 KiB
Markdown
123 lines
3.8 KiB
Markdown
# Scraper Web API
|
|
|
|
This is a dumb little tool which ingests raw HTML files, does some parsing on them, and serves the results over a web API.
|
|
|
|
```bash
|
|
export URL_BASE="http://scraper.homelab.hak8or.com:8080"; \
|
|
echo run0 && http POST "$URL_BASE/page/parse/ssd" && \
|
|
echo run1 && http POST "$URL_BASE/listing/parse" && \
|
|
echo run2 && http GET "$URL_BASE/listing/since/12345678/2" && \
|
|
echo run3 && http GET "$URL_BASE/listing/388484391867" && \
|
|
echo run4 && http GET "$URL_BASE/listing/286605201240/history"
|
|
```
|
|
|
|
And some jq usage for raw interaction of the data;
|
|
```bash
|
|
# Download a bunch of listings.
|
|
http https://scraper.hak8or.com/api/listings since==0 limit==20 > listings.json
|
|
|
|
# Show what a single listing looks like.
|
|
listings.json | jq '.[0]'
|
|
{
|
|
"listing": {
|
|
"id": 22563,
|
|
"item_id": 286707621236,
|
|
"title": "WD_BLACK SN770M 2TB M.2 NVMe Internal SSD (WDBDNH0020BBK-WRSN)",
|
|
"buy_it_now_price_cents": null,
|
|
"has_best_offer": false,
|
|
"image_url": "https://i.ebayimg.com/images/g/It4AAeSwzz5oddoa/s-l140.jpg"
|
|
},
|
|
"history": [
|
|
{
|
|
"item": 286707621236,
|
|
"timestamp": "2025-07-15T04:46:54Z",
|
|
"category": "ssd",
|
|
"current_bid_usd_cents": 12900
|
|
}
|
|
],
|
|
"parsed": [
|
|
{
|
|
"id": 6,
|
|
"item": 286707621236,
|
|
"total_gigabytes": 2048,
|
|
"quantity": 1,
|
|
"individual_size_gigabytes": 2048,
|
|
"parse_engine": 0,
|
|
"needed_description_check": false
|
|
}
|
|
]
|
|
}
|
|
|
|
# Show the 1st and 2nd items, but only grab a few specific entries.
|
|
cat listings_small.json | jq '[.[1:3][] | {
|
|
item_id: .listing.item_id,
|
|
title: .listing.title,
|
|
parsed: .parsed[] | {
|
|
total_gigabytes,
|
|
quantity,
|
|
individual_size_gigabytes
|
|
}
|
|
}]'
|
|
[
|
|
{
|
|
"item_id": 297545995095,
|
|
"title": "Crucial P3 Plus 1TB NVMe Internal M.2 SSD (CT1000P3PSSD8) - Barely used!",
|
|
"parsed": {
|
|
"total_gigabytes": 1024,
|
|
"quantity": 1,
|
|
"individual_size_gigabytes": 1024
|
|
}
|
|
},
|
|
{
|
|
"item_id": 127220979797,
|
|
"title": "Kingston NV2 2TB M.2 3500MG/S NVMe Internal SSD PCIe 4.0 Gen SNV2S/2000G C-/#qWT",
|
|
"parsed": {
|
|
"total_gigabytes": 2048,
|
|
"quantity": 1,
|
|
"individual_size_gigabytes": 2048
|
|
}
|
|
}
|
|
]
|
|
```
|
|
|
|
And now a LLM based parse, such that the prompt is this (189 Tokens for Gemini 2.5 Flash Lite)
|
|
```
|
|
I will provide you with a listing title I want you to analyse. Then you will tell me the total gigabytes of all drives listed in the listing, how many drives are specified in the title, and the gigabytes of each drive in the listing. Here is an example for a title of "Crucial P3 Plus 1TB NVMe Internal M.2 SSD (CT1000P3PSSD8) - Barely used!";
|
|
```
|
|
{
|
|
"total_gigabytes": 1024,
|
|
"quantity": 1,
|
|
"individual_size_gigabytes": 1024
|
|
}
|
|
```
|
|
Reply with "OK" (and _only_ "OK") if you understand this. After you reply with that, I will provide you with a title, and then you will reply with solely the requested json (and ONLY said json).
|
|
```
|
|
|
|
And passing a title of (30 tokens);
|
|
```
|
|
Lot Of 3 Western Digital PC SN740 512GB M.2 2230 NVMe Internal SSD
|
|
```
|
|
returns the following json of (41 tokens);
|
|
```json
|
|
{
|
|
"total_gigabytes": 1536,
|
|
"quantity": 3,
|
|
"individual_size_gigabytes": 512
|
|
}
|
|
```
|
|
|
|
and another example of sending (49 tokens)
|
|
```
|
|
(Lot of 6) Samsung MZ-VLB2560 256GB M.2 NVMe Internal SSD (MZVLB256HBHQ-000H1)
|
|
```
|
|
returns the following json of (42 tokens);
|
|
```json
|
|
{
|
|
"total_gigabytes": 1536,
|
|
"quantity": 6,
|
|
"individual_size_gigabytes": 256
|
|
}
|
|
```
|
|
|
|
So for 1 listing we have a 189 Token "System Prompt", then a ~45 token title prompt, and 42 Token parsed reply. Given 30,000 listings, that's 5,670,000 Token "System Prompt" as Input, 1,350,000 Token Title prompt as Input, and 1,260,000 Token Parsed information (output). Assuming Gemini 2.5 Flash Mini which is $0.10/M for input and $0.40/M for output, would pay $0.702 for input and $0.504 for output, or $1.206 total.
|