Files
ebay_scraper_rust/readme.md
hak8or 4ae1622f02
All checks were successful
Cargo Build & Test / Rust project - latest (1.88) (push) Successful in 4m11s
Cargo Build & Test / Rust project - latest (1.87) (push) Successful in 4m34s
Cargo Build & Test / Rust project - latest (1.86) (push) Successful in 5m14s
Cargo Build & Test / Rust project - latest (1.85.1) (push) Successful in 11m4s
Add LLM based parsing
2025-09-07 00:08:06 -04:00

123 lines
3.8 KiB
Markdown

# Scraper Web API
This is a dumb little tool which ingests raw HTML files, does some parsing on them, and serves the results over a web API.
```bash
export URL_BASE="http://scraper.homelab.hak8or.com:8080"; \
echo run0 && http POST "$URL_BASE/page/parse/ssd" && \
echo run1 && http POST "$URL_BASE/listing/parse" && \
echo run2 && http GET "$URL_BASE/listing/since/12345678/2" && \
echo run3 && http GET "$URL_BASE/listing/388484391867" && \
echo run4 && http GET "$URL_BASE/listing/286605201240/history"
```
And some jq usage for raw interaction of the data;
```bash
# Download a bunch of listings.
http https://scraper.hak8or.com/api/listings since==0 limit==20 > listings.json
# Show what a single listing looks like.
listings.json | jq '.[0]'
{
"listing": {
"id": 22563,
"item_id": 286707621236,
"title": "WD_BLACK SN770M 2TB M.2 NVMe Internal SSD (WDBDNH0020BBK-WRSN)",
"buy_it_now_price_cents": null,
"has_best_offer": false,
"image_url": "https://i.ebayimg.com/images/g/It4AAeSwzz5oddoa/s-l140.jpg"
},
"history": [
{
"item": 286707621236,
"timestamp": "2025-07-15T04:46:54Z",
"category": "ssd",
"current_bid_usd_cents": 12900
}
],
"parsed": [
{
"id": 6,
"item": 286707621236,
"total_gigabytes": 2048,
"quantity": 1,
"individual_size_gigabytes": 2048,
"parse_engine": 0,
"needed_description_check": false
}
]
}
# Show the 1st and 2nd items, but only grab a few specific entries.
cat listings_small.json | jq '[.[1:3][] | {
item_id: .listing.item_id,
title: .listing.title,
parsed: .parsed[] | {
total_gigabytes,
quantity,
individual_size_gigabytes
}
}]'
[
{
"item_id": 297545995095,
"title": "Crucial P3 Plus 1TB NVMe Internal M.2 SSD (CT1000P3PSSD8) - Barely used!",
"parsed": {
"total_gigabytes": 1024,
"quantity": 1,
"individual_size_gigabytes": 1024
}
},
{
"item_id": 127220979797,
"title": "Kingston NV2 2TB M.2 3500MG/S NVMe Internal SSD PCIe 4.0 Gen SNV2S/2000G C-/#qWT",
"parsed": {
"total_gigabytes": 2048,
"quantity": 1,
"individual_size_gigabytes": 2048
}
}
]
```
And now a LLM based parse, such that the prompt is this (189 Tokens for Gemini 2.5 Flash Lite)
```
I will provide you with a listing title I want you to analyse. Then you will tell me the total gigabytes of all drives listed in the listing, how many drives are specified in the title, and the gigabytes of each drive in the listing. Here is an example for a title of "Crucial P3 Plus 1TB NVMe Internal M.2 SSD (CT1000P3PSSD8) - Barely used!";
```
{
"total_gigabytes": 1024,
"quantity": 1,
"individual_size_gigabytes": 1024
}
```
Reply with "OK" (and _only_ "OK") if you understand this. After you reply with that, I will provide you with a title, and then you will reply with solely the requested json (and ONLY said json).
```
And passing a title of (30 tokens);
```
Lot Of 3 Western Digital PC SN740 512GB M.2 2230 NVMe Internal SSD
```
returns the following json of (41 tokens);
```json
{
"total_gigabytes": 1536,
"quantity": 3,
"individual_size_gigabytes": 512
}
```
and another example of sending (49 tokens)
```
(Lot of 6) Samsung MZ-VLB2560 256GB M.2 NVMe Internal SSD (MZVLB256HBHQ-000H1)
```
returns the following json of (42 tokens);
```json
{
"total_gigabytes": 1536,
"quantity": 6,
"individual_size_gigabytes": 256
}
```
So for 1 listing we have a 189 Token "System Prompt", then a ~45 token title prompt, and 42 Token parsed reply. Given 30,000 listings, that's 5,670,000 Token "System Prompt" as Input, 1,350,000 Token Title prompt as Input, and 1,260,000 Token Parsed information (output). Assuming Gemini 2.5 Flash Mini which is $0.10/M for input and $0.40/M for output, would pay $0.702 for input and $0.504 for output, or $1.206 total.