This page teaches web scraping in Python by solving one concrete problem: "What does an MSI EdgeXpert mini-PC cost, and where is it cheapest?" We'll try to answer it with the usual tools, watch them fail, and then reach for CloakBrowser — a stealth browser that loads pages the way a real person's browser does, so the data is actually there to read.
Everything here is built on real, working code. The krueng.ai site has a rental-finder report that pulls listings from FazWaz, DDproperty, Hipflat and RentHub — all sites that flatly refuse a plain Python request. We'll walk through that pipeline, then build a fresh price-comparison bot for the EdgeXpert across Amazon, Shopee, and Lazada.
If you're new to Python, read Python Basics first — this page assumes you can read a function, an import, and an f-string.
การแปลภาษาไทยกำลังจะมา (Thai translation coming soon)
本页通过解决一个具体问题来教你用 Python 抓取网页:"一台 MSI EdgeXpert 迷你电脑要多少钱,在哪买最便宜?" 我们先用常规工具去回答,看着它们失败,然后请出 CloakBrowser —— 一个隐身浏览器,它像真人的浏览器一样加载页面,于是数据真的就在那里、可以读取。
这里的一切都建立在真实可运行的代码之上。krueng.ai 网站有一个租房查询报告,从 FazWaz、DDproperty、Hipflat、RentHub 抓取房源 —— 这些网站都会直接拒绝普通的 Python 请求。我们会走一遍那条流水线,然后为 EdgeXpert 重新搭一个比价机器人,覆盖 Amazon、Shopee、Lazada。
如果你刚接触 Python,请先读 Python 基础 —— 本页假设你能看懂函数、导入和 f-string。
1. Why curl and requests hit a wall
· ทำไม curl กับ requests ถึงโดนบล็อก
· 为什么 curl 和 requests 会撞墙
The simplest way to fetch a web page in Python is the requests library:
import requests
r = requests.get("https://www.fazwaz.com/condo-for-rent/thailand/chiang-mai")
print(r.status_code) # hoped-for 200 ... but you get 403
print(r.text[:300]) # "Just a moment..." — a Cloudflare challenge page
You asked for a list of condos and got back 403 Forbidden and a page that says "Checking your browser…". You never saw a single listing. Here's why.
Big commercial sites sit behind a bot-protection service — most often Cloudflare. Before letting you see the real page, it scores how "human" your request looks. A bare requests call fails almost every check:
- It sends a giveaway User-Agent like
python-requests/2.31.0. - It runs no JavaScript, so the challenge that real browsers solve silently never gets solved.
- It has no browser fingerprint — no canvas, no WebGL, no fonts, no screen size. A real Chrome leaks hundreds of these signals;
requestsleaks none.
So the firewall concludes "this is a script" and serves the challenge page instead of the data. curl, wget, and even the WebFetch tool inside AI assistants fail for the exact same reasons.
การแปลกำลังจะมา
在 Python 里抓网页最简单的办法是 requests 库:
import requests
r = requests.get("https://www.fazwaz.com/condo-for-rent/thailand/chiang-mai")
print(r.status_code) # 本想要 200 …… 结果是 403
print(r.text[:300]) # "Just a moment..." —— 一个 Cloudflare 验证页
你想要一份公寓列表,拿回来的却是 403 Forbidden 和一句"正在检查你的浏览器……"。一条房源都没看到。原因如下。
大型商业网站背后都有反爬服务 —— 最常见的就是 Cloudflare。在放你看真正的页面之前,它会给你的请求打一个"像不像真人"的分。一个裸的 requests 调用几乎过不了任何一项:
- 它发送一个暴露身份的 User-Agent,比如
python-requests/2.31.0。 - 它不执行 JavaScript,真实浏览器悄悄解开的验证它根本没法解。
- 它没有浏览器指纹 —— 没有 canvas、WebGL、字体、屏幕尺寸。真正的 Chrome 会泄露上百个这样的信号,而
requests一个都没有。
于是防火墙判定"这是脚本",返回验证页而不是数据。curl、wget,甚至 AI 助手里的 WebFetch 工具,都因为完全相同的原因失败。
2. What CloakBrowser is · CloakBrowser คืออะไร · CloakBrowser 是什么
CloakBrowser is a Python package that drives a patched, stealthy build of Chromium. Because it's a real browser, it runs the site's JavaScript and presents a complete, believable fingerprint — so the bot check passes and the page loads normally. You then read the finished HTML out of the browser.
It matters where the stealth lives. Many anti-detection tricks inject JavaScript into the page to lie about the browser — but that injection is itself detectable. CloakBrowser instead forges the canvas, WebGL, audio, font, GPU and WebRTC fingerprints in the browser's C++ core, below the JavaScript layer, where the bot check can't tell the difference.
If you've used Selenium or Playwright before, the API will feel familiar — CloakBrowser is shaped almost identically. The difference is purely the stealth: a default Playwright Chromium often gets flagged; CloakBrowser usually doesn't.
การแปลกำลังจะมา
CloakBrowser 是一个 Python 包,它驱动一个经过改造的隐身版 Chromium。因为它是真实浏览器,会执行网站的 JavaScript,并呈现一套完整、可信的指纹 —— 于是机器人检测通过,页面正常加载。然后你从浏览器里读取已经渲染好的 HTML。
隐身藏在哪一层很关键。很多反检测技巧是往页面里注入 JavaScript 去伪装浏览器 —— 但注入本身就能被检测到。CloakBrowser 改为在浏览器的 C++ 内核里伪造 canvas、WebGL、音频、字体、GPU 和 WebRTC 指纹,位于 JavaScript 层之下,机器人检测分辨不出真假。
如果你用过 Selenium 或 Playwright,这个 API 会很熟悉 —— CloakBrowser 的形状几乎一模一样。区别纯粹在隐身:默认的 Playwright Chromium 常被标记,CloakBrowser 通常不会。
3. Installing it · การติดตั้ง · 安装
One command. On first run it downloads its patched Chromium binary (~200 MB), so the first launch is slow; after that it's cached.
pip install cloakbrowser
If you read Python Basics §8, do this inside a virtual environment so it doesn't pollute your system Python:
python -m venv .venv
.venv\Scripts\activate # Windows PowerShell
pip install cloakbrowser
# or, with uv (much faster):
uv venv
uv pip install cloakbrowser
การแปลกำลังจะมา
一条命令。首次运行时会下载它改过的 Chromium 二进制(约 200 MB),所以第一次启动很慢;之后会被缓存。
pip install cloakbrowser
如果你读过 Python 基础 §8,在虚拟环境里装,免得污染系统 Python:
python -m venv .venv
.venv\Scripts\activate # Windows PowerShell
pip install cloakbrowser
# 或者用 uv(快很多):
uv venv
uv pip install cloakbrowser
4. The minimal recipe · สูตรพื้นฐานที่สุด · 最小可用配方
Five lines. This is the whole shape of a CloakBrowser scrape:
from cloakbrowser import launch
browser = launch() # start the stealth Chromium
page = browser.new_page() # open a tab
page.goto("https://example.com", timeout=60000, wait_until="domcontentloaded")
page.wait_for_timeout(4000) # let JavaScript-rendered content settle
html = page.content() # the fully-rendered HTML, as a string
browser.close() # always close — it's a real process
Read it line by line:
launch()starts the browser process and hands you abrowserobject.browser.new_page()opens a tab. You work through thispageobject.page.goto(url, ...)navigates.timeout=60000is 60 seconds (in milliseconds).wait_until="domcontentloaded"means "return once the HTML document is parsed" rather than waiting for every last image.page.wait_for_timeout(4000)pauses 4 seconds. Modern sites render their content after the initial HTML, with JavaScript — this gives that a moment to finish so the listings are actually present when you grab the page.page.content()returns the current HTML as a Python string. This is the prize: the same HTML a human would see, ready to parse.browser.close()shuts the browser down. Skip it and you leak a running Chromium process.
การแปลกำลังจะมา
五行代码。这就是一次 CloakBrowser 抓取的全部形状:
from cloakbrowser import launch
browser = launch() # 启动隐身 Chromium
page = browser.new_page() # 打开一个标签页
page.goto("https://example.com", timeout=60000, wait_until="domcontentloaded")
page.wait_for_timeout(4000) # 等 JavaScript 渲染的内容稳定下来
html = page.content() # 完整渲染后的 HTML 字符串
browser.close() # 一定要关 —— 它是一个真实进程
逐行读:
launch()启动浏览器进程,返回一个browser对象。browser.new_page()打开一个标签页。你通过这个page对象操作。page.goto(url, ...)导航。timeout=60000是 60 秒(单位毫秒)。wait_until="domcontentloaded"意思是"HTML 文档解析完就返回",而不是等到每一张图片都加载完。page.wait_for_timeout(4000)暂停 4 秒。现代网站的内容是在初始 HTML 之后用 JavaScript 渲染出来的 —— 给它一点时间完成,等你抓页面时房源才真的在那里。page.content()返回当前 HTML 字符串。这就是战利品:和真人看到的一样,可以拿去解析。browser.close()关闭浏览器。不关就会泄露一个还在跑的 Chromium 进程。
5. One browser, many pages · เบราว์เซอร์เดียว หลายหน้า · 一个浏览器,多个页面
Starting a browser is expensive (it's a whole Chromium process). When you scrape several URLs, launch once and reuse it — but open a fresh new_page() for each URL. Reusing the same page across navigations triggers "navigation is changing" races.
from cloakbrowser import launch
from pathlib import Path
URLS = [
"https://www.fazwaz.com/condo-for-rent/thailand/chiang-mai/mueang-chiang-mai/tha-sala",
"https://www.fazwaz.com/condo-for-rent/thailand/chiang-mai/mueang-chiang-mai/wat-ket",
]
browser = launch()
try:
for i, url in enumerate(URLS):
page = browser.new_page() # fresh tab per URL
try:
if i:
page.wait_for_timeout(2000) # courtesy gap between same-site hits
page.goto(url, timeout=60000, wait_until="domcontentloaded")
page.wait_for_timeout(5000)
html = page.content()
name = url.rstrip("/").rsplit("/", 1)[-1]
Path(f"raw_{name}.html").write_text(html, encoding="utf-8")
print(f"ok {name} ({len(html):,} bytes)")
except Exception as e:
print(f"FAIL {url}: {e}") # one bad URL shouldn't kill the run
finally:
page.close() # close the tab, keep the browser
finally:
browser.close() # close the browser exactly once
Three habits worth copying from this, all visible in krueng.ai's real scraper:
- Save the raw HTML to disk (
raw_*.html) before you parse it. Fetching is the slow, fragile, rate-limited part; parsing is fast and you'll redo it many times as you refine your code. Separating them means you scrape once and re-parse for free. - Wrap each URL in
try/exceptso one failure (a 404, a timeout) doesn't abort the whole batch. try/finallyfor cleanup — the tab and the browser get closed even if something throws.
การแปลกำลังจะมา
启动浏览器代价很大(它是一整个 Chromium 进程)。抓多个 URL 时,只启动一次并复用它 —— 但为每个 URL 开一个全新的 new_page()。在同一个 page 上反复导航会触发 "navigation is changing" 竞态。
from cloakbrowser import launch
from pathlib import Path
URLS = [
"https://www.fazwaz.com/condo-for-rent/thailand/chiang-mai/mueang-chiang-mai/tha-sala",
"https://www.fazwaz.com/condo-for-rent/thailand/chiang-mai/mueang-chiang-mai/wat-ket",
]
browser = launch()
try:
for i, url in enumerate(URLS):
page = browser.new_page() # 每个 URL 一个新标签
try:
if i:
page.wait_for_timeout(2000) # 同一站点之间礼貌性停顿
page.goto(url, timeout=60000, wait_until="domcontentloaded")
page.wait_for_timeout(5000)
html = page.content()
name = url.rstrip("/").rsplit("/", 1)[-1]
Path(f"raw_{name}.html").write_text(html, encoding="utf-8")
print(f"ok {name} ({len(html):,} bytes)")
except Exception as e:
print(f"FAIL {url}: {e}") # 一个坏 URL 不该拖垮整批
finally:
page.close() # 关标签,留浏览器
finally:
browser.close() # 浏览器只关一次
三个值得照搬的习惯,都能在 krueng.ai 真实抓取代码里看到:
- 先把原始 HTML 存到磁盘(
raw_*.html)再解析。抓取是慢、脆弱、被限速的那部分;解析很快,而且你会随着改代码反复重做。把两者分开,意味着抓一次、之后免费重解析。 - 每个 URL 用
try/except包住,一次失败(404、超时)不会中断整批。 - 用
try/finally做清理 —— 即使出错,标签和浏览器也会被关掉。
6. A real four-stage pipeline · ไปป์ไลน์จริง 4 ขั้น · 一条真实的四阶段流水线
The rental map on krueng.ai answers "what can I rent near Wat Don Chan?" It's built from four scripts, and the split is the most important lesson in this whole page: separate fetching from parsing from rendering. Each stage writes a file the next stage reads.
| Stage | Script | Reads → Writes |
|---|---|---|
| 1. Collect | rentals_collect.py | 6 listing pages (CloakBrowser) → raw_*.html |
| 2. Expand | rentals_expand.py | 10 more pages (CloakBrowser) → raw2_*.html |
| 3. Parse | rentals_parse_v3.py | all raw*.html → rentals_v3.json |
| 4. Render | rentals_final.py | rentals_v3.json → rentals_full.html |
Only stages 1 and 2 touch the network, and they're the only ones that need CloakBrowser. Once the raw HTML is on disk, you can re-run parsing and rendering as many times as you like — instantly, offline, free. To regenerate the report after tweaking the layout, you run only python rentals_final.py; you re-scrape (collect → expand → parse → final) only when the data itself is stale.
This is also why you don't fork a new script per site. Notice rentals_collect.py drives its six URLs from one TARGETS list and one loop — not six copy-pasted functions. We'll keep that discipline in the capstone.
การแปลกำลังจะมา
krueng.ai 上的租房地图回答"在 Wat Don Chan 附近能租到什么?"。它由四个脚本搭成,而这种拆分是整页最重要的一课:把抓取、解析、渲染分开。每一阶段写一个文件,下一阶段去读。
| 阶段 | 脚本 | 读 → 写 |
|---|---|---|
| 1. 采集 | rentals_collect.py | 6 个房源页(CloakBrowser)→ raw_*.html |
| 2. 扩展 | rentals_expand.py | 再 10 个页面(CloakBrowser)→ raw2_*.html |
| 3. 解析 | rentals_parse_v3.py | 所有 raw*.html → rentals_v3.json |
| 4. 渲染 | rentals_final.py | rentals_v3.json → rentals_full.html |
只有第 1、2 阶段碰网络,也只有它们需要 CloakBrowser。原始 HTML 一旦落盘,解析和渲染想跑多少次都行 —— 瞬间、离线、免费。调完版式后只需跑 python rentals_final.py;只有当数据本身过期时,才重新抓取(collect → expand → parse → final)。
这也是为什么不要为每个网站新开一个脚本。注意 rentals_collect.py 用一个 TARGETS 列表加一个循环来驱动它的六个 URL —— 而不是六个复制粘贴的函数。这个纪律我们会在压轴示例里继续保持。
7. Pulling data out of the HTML · การดึงข้อมูลออกจาก HTML · 从 HTML 里把数据抠出来
Once content() hands you the HTML string, you have to find the actual data inside it. There are two good approaches, in order of preference.
7a. Read the structured data the site already embeds
Most listing and shopping sites embed a machine-readable copy of their data as JSON-LD — a <script type="application/ld+json"> block they put there so Google can show rich results. It's far more stable than the visible markup, which changes constantly. This is exactly what the real scraper does:
import json, re
def extract_jsonld(html: str) -> list:
"""Pull every application/ld+json block out of a page as Python objects."""
blocks = re.findall(
r'<script[^>]+type=["\']application/ld\+json["\'][^>]*>(.*?)</script>',
html, flags=re.DOTALL | re.IGNORECASE,
)
out = []
for raw in blocks:
try:
out.append(json.loads(raw.strip()))
except json.JSONDecodeError:
continue # skip malformed blocks, don't crash
return out
From those objects you read clean fields like name, price, geo.latitude — no fragile HTML digging.
7b. Regex the visible text as a fallback
When there's no JSON-LD, you fall back to pattern-matching the rendered text. The rental scraper finds Thai-baht prices with one small regex:
import re
THB = re.compile(r'฿\s?([\d,]{3,12})')
def first_price(html: str):
m = THB.search(html)
return int(m.group(1).replace(",", "")) if m else None
The [\d,]{3,12} says "3 to 12 digits-or-commas" so it catches 15,000 but not a stray single digit. .replace(",", "") strips the thousands separators before int(). We'll lean on this same trick in the capstone, for dollars as well as baht.
การแปลกำลังจะมา
content() 把 HTML 字符串交给你之后,你得在里面找到真正的数据。有两种好办法,按优先级排列。
7a. 读网站自己嵌进去的结构化数据
大多数房源和购物网站会把数据的机器可读副本以 JSON-LD 形式嵌进页面 —— 一个 <script type="application/ld+json"> 块,放在那里好让 Google 展示富结果。它比不停变化的可见标记稳定得多。真实抓取代码做的正是这件事:
import json, re
def extract_jsonld(html: str) -> list:
"""把页面里每个 application/ld+json 块解析为 Python 对象。"""
blocks = re.findall(
r'<script[^>]+type=["\']application/ld\+json["\'][^>]*>(.*?)</script>',
html, flags=re.DOTALL | re.IGNORECASE,
)
out = []
for raw in blocks:
try:
out.append(json.loads(raw.strip()))
except json.JSONDecodeError:
continue # 跳过损坏的块,别崩
return out
从这些对象里你能读到干净的字段,比如 name、price、geo.latitude —— 不用在脆弱的 HTML 里翻找。
7b. 退而求其次:用正则匹配可见文本
没有 JSON-LD 时,就退回到对渲染文本做模式匹配。租房抓取代码用一个小正则找泰铢价格:
import re
THB = re.compile(r'฿\s?([\d,]{3,12})')
def first_price(html: str):
m = THB.search(html)
return int(m.group(1).replace(",", "")) if m else None
[\d,]{3,12} 表示"3 到 12 个数字或逗号",于是能抓到 15,000 而不会抓到一个孤零零的数字。.replace(",", "") 在 int() 之前去掉千位分隔符。压轴示例里我们会用同一招,同时处理美元和泰铢。
8. Capstone — an MSI EdgeXpert price-comparison bot · โปรเจกต์รวบยอด — บอทเทียบราคา · 压轴 —— MSI EdgeXpert 比价机器人
Now the payoff. We want the price of the MSI EdgeXpert — MSI's build of the NVIDIA DGX Spark mini-AI-supercomputer (the same class of box krueng.ai runs its local avatar tutor on) — across three storefronts: Amazon (global, prices in USD), Shopee and Lazada (Southeast Asia, prices in Thai baht). All three are JavaScript-heavy and bot-protected; a plain requests.get on any of their search pages returns an empty shell or a block page. CloakBrowser loads the real, rendered search results.
Following the rule from §6, this is one config-driven loop, not three copy-pasted scrapers. Each site is a dict carrying its own extraction pattern; the loop pulls the same three fields from each — a product name, its price, and a link to the item page:
"""The heart of the capstone. The full cloakbrowser_tutorial.py (Appendix A) wraps
this as the compare() step, with a JSON-LD fallback and a page-price range. Here we
do the core job: for each storefront, pull the product results — name, price, and a
link to the item page.
A plain requests.get() returns a Cloudflare/JS shell on all three storefronts;
CloakBrowser loads the real, rendered search results.
"""
import re
import sys
from cloakbrowser import launch
sys.stdout.reconfigure(encoding="utf-8") # so printing ฿ doesn't crash on Windows
# Each site buries product data differently, so the extraction pattern lives in
# the config (named groups: url, name, price) and the loop stays generic.
SITES = [
{"name": "Amazon", "url": "https://www.amazon.com/s?k=MSI+EdgeXpert",
"item_pattern": None}, # usually blocks scripted hits
{"name": "Shopee", "url": "https://shopee.co.th/search?keyword=MSI%20EdgeXpert",
"item_pattern": None}, # results load via API, not in the HTML
{"name": "Lazada", "url": "https://www.lazada.co.th/catalog/?q=MSI+EdgeXpert",
# in Lazada's grid the link, title and price sit right next to each other:
"item_pattern": r'href="(?P//www\.lazada\.co\.th/products/[^"]+)"\s+'
r'title="(?P[^"]+)".{0,1500}?'
r'<span class="ooOxS">฿(?P[\d,]+(?:\.\d{2})?)'},
]
def detail_items(html, pattern, limit=6):
"""Pull product results (name, price, link) out of the rendered page."""
if not pattern:
return []
items, seen = [], set()
for m in re.finditer(pattern, html, re.DOTALL):
g = m.groupdict()
url = "https:" + g["url"] if g["url"].startswith("//") else g["url"]
if url in seen:
continue
seen.add(url)
items.append({"name": g["name"].strip(),
"price": float(g["price"].replace(",", "")),
"url": url})
if len(items) >= limit:
break
return items
BLOCK = ("Robot Check", "validateCaptcha", "automated access") # bot-wall markers
def blocked(html):
return len(html) < 5000 or any(m in html for m in BLOCK)
def shop(browser, site, tries=3):
"""Fetch a storefront, retrying past a transient block, then extract."""
for i in range(tries):
page = browser.new_page()
try:
page.goto(site["url"], timeout=60000, wait_until="domcontentloaded")
page.wait_for_timeout(6000) # let results render
html = page.content()
if not blocked(html):
return detail_items(html, site["item_pattern"])
page.wait_for_timeout(6000 * (i + 1)) # transient block — back off, retry
finally:
page.close()
return [] # still blocked (Amazon rate-limits) — see the note on its API
browser = launch()
try:
for site in SITES:
items = shop(browser, site)
print(f"\n{site['name']} — {len(items)} result(s):")
for it in items:
print(f" ฿{it['price']:>10,.0f} {it['name'][:50]}")
print(f" {'':>12} {it['url']}") # link to the item page
finally:
browser.close()
What makes this work where requests can't:
- Each storefront's search page is rendered by JavaScript after the HTML arrives.
requestssees the empty shell; CloakBrowser waits (wait_for_timeout(6000)) and reads the populated grid. - All three sit behind bot protection that 403s a scripted request. The stealth browser passes the check.
- The extraction pattern lives in each site's dict, so the loop is identical for all of them — adding a fourth storefront is one new entry, not a new function. The named groups (
url,name,price) mean the assembling code doesn't care what order a site lists them in. - The
sys.stdout.reconfigure(encoding="utf-8")line is not decoration: without it, printing a฿sign crashes on a default Windows console (cp1252). The first run of this script taught us that the hard way.
การแปลกำลังจะมา
到了收获的时候。我们想知道 MSI EdgeXpert 的价格 —— 它是 MSI 版的 NVIDIA DGX Spark 迷你 AI 超算(krueng.ai 跑本地虚拟人老师用的就是同一类机器)—— 在三个店面上的报价:Amazon(全球,美元)、Shopee 和 Lazada(东南亚,泰铢)。三家都重度依赖 JavaScript 且有反爬;对它们任何一个搜索页做普通的 requests.get,拿回的都是空壳或拦截页。CloakBrowser 能加载真正渲染好的搜索结果。
遵循 §6 的规则,这是一个配置驱动的函数,不是三段复制粘贴的抓取器。每个网站是列表里的一个字典;循环对所有网站都一样:
"""压轴的核心。完整的 cloakbrowser_tutorial.py(附录 A)把它包成 compare() 那一步,
还加了 JSON-LD 兜底和页面价格区间。这里做核心工作:对每个店面,抠出商品结果 ——
名称、价格,以及商品页的链接。
普通的 requests.get() 在三家都只拿到 Cloudflare/JS 空壳;CloakBrowser
能加载真正渲染好的搜索结果。
"""
import re
import sys
from cloakbrowser import launch
sys.stdout.reconfigure(encoding="utf-8") # 这样在 Windows 上打印 ฿ 才不崩
# 每家把商品数据藏在不同地方,所以提取模式放进配置里(命名组:url、name、price),
# 循环保持通用。
SITES = [
{"name": "Amazon", "url": "https://www.amazon.com/s?k=MSI+EdgeXpert",
"item_pattern": None}, # 通常会拦截脚本请求
{"name": "Shopee", "url": "https://shopee.co.th/search?keyword=MSI%20EdgeXpert",
"item_pattern": None}, # 结果通过 API 加载,不在 HTML 里
{"name": "Lazada", "url": "https://www.lazada.co.th/catalog/?q=MSI+EdgeXpert",
# 在 Lazada 的结果网格里,链接、标题、价格挨在一起:
"item_pattern": r'href="(?P//www\.lazada\.co\.th/products/[^"]+)"\s+'
r'title="(?P[^"]+)".{0,1500}?'
r'<span class="ooOxS">฿(?P[\d,]+(?:\.\d{2})?)'},
]
def detail_items(html, pattern, limit=6):
"""从渲染后的页面里抠出商品结果(名称、价格、链接)。"""
if not pattern:
return []
items, seen = [], set()
for m in re.finditer(pattern, html, re.DOTALL):
g = m.groupdict()
url = "https:" + g["url"] if g["url"].startswith("//") else g["url"]
if url in seen:
continue
seen.add(url)
items.append({"name": g["name"].strip(),
"price": float(g["price"].replace(",", "")),
"url": url})
if len(items) >= limit:
break
return items
BLOCK = ("Robot Check", "validateCaptcha", "automated access") # 反爬墙的标志串
def blocked(html):
return len(html) < 5000 or any(m in html for m in BLOCK)
def shop(browser, site, tries=3):
"""抓取一个店面,遇到临时拦截就重试,然后提取。"""
for i in range(tries):
page = browser.new_page()
try:
page.goto(site["url"], timeout=60000, wait_until="domcontentloaded")
page.wait_for_timeout(6000) # 等结果渲染
html = page.content()
if not blocked(html):
return detail_items(html, site["item_pattern"])
page.wait_for_timeout(6000 * (i + 1)) # 临时拦截 —— 退避后重试
finally:
page.close()
return [] # 仍被拦(Amazon 会限速)—— 见关于它 API 的说明
browser = launch()
try:
for site in SITES:
items = shop(browser, site)
print(f"\n{site['name']} —— {len(items)} 条结果:")
for it in items:
print(f" ฿{it['price']:>10,.0f} {it['name'][:50]}")
print(f" {'':>12} {it['url']}") # 商品页链接
finally:
browser.close()
它能成而 requests 不能的原因:
- 每个店面的搜索页是在 HTML 到达之后由 JavaScript 渲染的。
requests看到的是空壳;CloakBrowser 等待(wait_for_timeout(6000))后读到已填充的网格。 - 三家都有反爬,会对脚本请求返回 403。隐身浏览器能通过检测。
- 提取模式放在每个网站的字典里,所以循环对三家都一样 —— 加第四家只是多一个条目,不是多一个函数。命名组(
url、name、price)让组装代码不在乎各家把它们排成什么顺序。 sys.stdout.reconfigure(encoding="utf-8")这行不是装饰:没有它,在默认 Windows 控制台(cp1252)打印฿会崩溃。这个脚本第一次运行就用崩溃教会了我们这一点。
9. Scraping responsibly (not optional) · การ scrape อย่างมีความรับผิดชอบ · 负责任地抓取(非可选)
A stealth browser makes it possible to ignore a site's wishes. That doesn't make it right. The line is simple: read what a human visitor could read, at a human pace, for a legitimate purpose.
- Check
robots.txtand the Terms of Service. Some sites forbid scraping outright. Personal price-checking is one thing; redistributing or reselling someone's data is another. - Go slow. Add
page.wait_for_timeout()between requests. A few pages spaced out by seconds looks like a person; hundreds per minute is an attack and can knock a small site over. - Cache aggressively. This is the real reason §6 saves raw HTML — re-parsing a file you already fetched puts zero load on anyone's server. Never re-fetch what you already have.
- Take only what you need, and prefer an official API when one exists. Amazon, for instance, has a Product Advertising API; scraping is the fallback, not the first choice.
- Don't scrape personal data or anything behind a login you're not entitled to bypass.
Used this way, CloakBrowser is just a more capable version of opening the page yourself — which is exactly what it is.
การแปลกำลังจะมา
隐身浏览器让"无视网站的意愿"成为可能,但这不代表它正当。界线很简单:读一个真人访客本来就能读的内容,用真人的节奏,为正当的目的。
- 查
robots.txt和服务条款。 有些网站明令禁止抓取。个人查价是一回事;转发或转售别人的数据是另一回事。 - 放慢。 在请求之间加
page.wait_for_timeout()。几页、间隔几秒,像个人;每分钟几百次就是攻击,能把小网站打垮。 - 积极缓存。 这正是 §6 保存原始 HTML 的真正原因 —— 重新解析一个你已经抓过的文件,对任何人的服务器都是零负载。永远不要重抓你已经有的东西。
- 只取所需, 有官方 API 就优先用。比如 Amazon 有 Product Advertising API;抓取是备选,不是首选。
- 不要抓个人数据,也不要抓你无权绕过的登录之后的内容。
这样使用,CloakBrowser 不过是"你自己打开页面"的一个更能干的版本 —— 它本来就是这个。
10. What to read next · อ่านอะไรต่อ · 接下来读什么
- Get the script & run it — download the whole tutorial as one runnable script (this page's capstone plus the earlier lessons). Setup and run commands are in that section.
- Python Basics — if any of the syntax above was new (functions, f-strings,
try/except). - python_apis_async.ipynb — when a site does offer an API, you'll often want it over scraping. This covers calling HTTP APIs with
requests. - The
requestsquickstart andBeautifulSoupdocs — the two libraries you'll pair with CloakBrowser most often. - The official Python tutorial — docs.python.org/3/tutorial.
The deepest lesson here isn't about any one library. It's the pipeline shape: fetch slowly and save raw, parse fast and often, render last. Once you internalise that split, every scraper you write gets easier to debug.
การแปลกำลังจะมา
- 下载脚本并运行 —— 把整个教程作为一个可运行脚本下载(本页的压轴加上前面的小节)。安装和运行命令都在那一节。
- Python 基础 —— 如果上面有什么语法对你是新的(函数、f-string、
try/except)。 - python_apis_async.ipynb —— 当网站确实提供 API 时,往往优先用 API 而不是抓取。这篇讲用
requests调 HTTP API。 requests的 quickstart 和BeautifulSoup文档 —— 最常和 CloakBrowser 搭配的两个库。- Python 官方教程 —— docs.python.org/3/tutorial。
这里最深的一课不在于某个库,而在于流水线的形状:慢慢抓、存原始;快快解析、反复解析;最后才渲染。 一旦你把这种拆分内化,你写的每个抓取器都会更好调试。
11. Get the script & run it · ดาวน์โหลดสคริปต์แล้วรัน · 下载脚本并运行
Everything on this page is bundled into one runnable script. Two byte-identical copies are provided — they differ only in line endings, so grab the one that matches your OS:
Install CloakBrowser
In your terminal (PowerShell, Terminal, or your IDE's) — not inside Python. The first install downloads a patched Chromium (~200 MB), so it's slow once, then cached. Section 3 above shows the virtual-environment version.
pip install cloakbrowser
Run it
With no argument it runs the MSI EdgeXpert price comparison. Pass a step name to run a single lesson instead:
python cloakbrowser_tutorial.py # the price comparison (default)
python cloakbrowser_tutorial.py minimal # the 5-line recipe
python cloakbrowser_tutorial.py multipage # one browser, many pages -> raw_*.html
python cloakbrowser_tutorial.py jsonld # extract the embedded JSON-LD
การแปลกำลังจะมา
本页的全部内容打包成了一个可运行脚本。提供两份逐字节相同的副本 —— 它们只有行尾不同,挑选与你操作系统匹配的那个:
安装 CloakBrowser
在终端里运行(PowerShell、Terminal 或 IDE 的终端)—— 不是在 Python 里。首次安装会下载改过的 Chromium(约 200 MB),慢一次,之后缓存。虚拟环境版本见上面第 3 节。
pip install cloakbrowser
运行
不带参数时跑 MSI EdgeXpert 比价。传一个小节名就只跑那一节:
python cloakbrowser_tutorial.py # 比价(默认)
python cloakbrowser_tutorial.py minimal # 五行配方
python cloakbrowser_tutorial.py multipage # 一个浏览器多个页面 -> raw_*.html
python cloakbrowser_tutorial.py jsonld # 提取嵌入的 JSON-LD
Appendix A — Full source: cloakbrowser_tutorial.py
· ภาคผนวก A
· 附录 A —— cloakbrowser_tutorial.py 完整源码
"""cloakbrowser_tutorial.py — learn CloakBrowser by running it.
Setup, usage, and the "why a script and not a notebook" note all live on the
lesson page: https://krueng.ai/cloakbrowser.html
"""
import re
import sys
import json
import statistics
from pathlib import Path
from cloakbrowser import launch
# Windows consoles default to cp1252, which can't encode ฿ or — . Force UTF-8 so
# printing prices doesn't crash. (Same family as the Thai-font codec gotchas.)
sys.stdout.reconfigure(encoding="utf-8")
# Step 1: the minimal recipe
def minimal():
"""Launch -> open a tab -> navigate -> wait for JS -> read the HTML."""
browser = launch()
try:
page = browser.new_page()
page.goto("https://www.fazwaz.com/condo-for-rent/thailand/chiang-mai",
timeout=60000, wait_until="domcontentloaded")
page.wait_for_timeout(4000) # let JavaScript-rendered content settle
html = page.content()
page.close()
finally:
browser.close() # always close — it's a real process
print(len(html), "bytes of real, rendered HTML")
print("contains a baht price:", "฿" in html)
# Step 2: one browser, many pages, save raw
def multipage():
"""Launch once, open a FRESH page per URL, save each raw page before parsing."""
urls = [
"https://www.fazwaz.com/condo-for-rent/thailand/chiang-mai/mueang-chiang-mai/tha-sala",
"https://www.fazwaz.com/condo-for-rent/thailand/chiang-mai/mueang-chiang-mai/wat-ket",
]
polite_delay = 2000 # ms — courtesy gap between hits to the SAME site. Etiquette
# and load-reduction, NOT a rate-limit cure (see fetch()).
browser = launch()
try:
for i, u in enumerate(urls):
page = browser.new_page() # fresh tab per URL (avoids nav races)
try:
if i:
page.wait_for_timeout(polite_delay) # don't hammer the same host
page.goto(u, timeout=60000, wait_until="domcontentloaded")
page.wait_for_timeout(5000)
name = u.rstrip("/").rsplit("/", 1)[-1]
Path(f"raw_{name}.html").write_text(page.content(), encoding="utf-8")
print(f"ok {name}")
except Exception as e:
print(f"FAIL {u}: {e}") # one bad URL shouldn't kill the run
finally:
page.close() # close the tab, keep the browser
finally:
browser.close() # close the browser exactly once
# Step 3: pull data out — JSON-LD first
def extract_jsonld(html: str) -> list:
"""Pull every application/ld+json block out of a page as Python objects.
Most listing/shopping sites embed a machine-readable copy of their data for
Google. It's far more stable than the visible markup, which changes constantly.
"""
blocks = re.findall(
r'<script[^>]+type=["\']application/ld\+json["\'][^>]*>(.*?)</script>',
html, flags=re.DOTALL | re.IGNORECASE,
)
out = []
for raw in blocks:
try:
out.append(json.loads(raw.strip()))
except json.JSONDecodeError:
continue # skip malformed blocks, don't crash
return out
def jsonld_products(html: str) -> list:
"""Every schema.org Product in the page's JSON-LD, as {name, price, url}.
A clean, fairly site-agnostic way to get product names + links: many shops
embed Product objects for Google. Price is often absent here (see detail_items).
"""
out, seen = [], set()
for block in extract_jsonld(html):
stack = [block]
while stack:
d = stack.pop()
if isinstance(d, list):
stack.extend(d)
elif isinstance(d, dict):
if d.get("@type") == "Product":
url = d.get("url") or d.get("@id") or ""
if url.startswith("//"):
url = "https:" + url
offers = d.get("offers") or {}
if isinstance(offers, list):
offers = offers[0] if offers else {}
price = offers.get("price") if isinstance(offers, dict) else None
name = (d.get("name") or "").strip()
if name and url and url not in seen:
seen.add(url)
out.append({"name": name,
"price": float(price) if price else None,
"url": url})
stack.extend(d.values())
return out
def jsonld():
"""Fetch one page and show the first structured-data block."""
browser = launch()
try:
page = browser.new_page()
page.goto("https://www.fazwaz.com/condo-for-rent/thailand/chiang-mai",
timeout=60000, wait_until="domcontentloaded")
page.wait_for_timeout(4000)
blocks = extract_jsonld(page.content())
page.close()
finally:
browser.close()
print(f"found {len(blocks)} JSON-LD block(s)")
if blocks:
print(json.dumps(blocks[0], indent=2, ensure_ascii=False)[:600])
# Step 4 (capstone): MSI EdgeXpert shopping results, with links
QUERY = "MSI EdgeXpert"
# One row per storefront. `patterns` finds bare prices (for a page price range);
# `item_pattern` (optional) captures name + price + product link together via
# named groups. Each site buries data differently, so this lives in the config
# and the scraping loop stays generic. rate = THB per unit.
SITES = [
{"name": "Amazon", "rate": 36.0,
"url": "https://www.amazon.com/s?k=MSI+EdgeXpert",
"patterns": [r'a-price-whole[^>]*>([\d,]{3,9})', r'\$\s?([\d,]{3,9}(?:\.\d{2})?)'],
"item_pattern": None},
{"name": "Shopee", "rate": 1.0,
"url": "https://shopee.co.th/search?keyword=MSI%20EdgeXpert",
"patterns": [r'฿\s?([\d,]{3,9})'],
"item_pattern": None},
{"name": "Lazada", "rate": 1.0,
"url": "https://www.lazada.co.th/catalog/?q=MSI+EdgeXpert",
"patterns": [r'฿\s?([\d,]{3,9})'],
# name + price + link are adjacent in Lazada's result grid:
"item_pattern": r'href="(?P<url>//www\.lazada\.co\.th/products/[^"]+)"\s+'
r'title="(?P<name>[^"]+)".{0,1500}?'
r'<span class="ooOxS">฿(?P<price>[\d,]+(?:\.\d{2})?)'},
]
def prices_in(html: str, patterns: list) -> list:
"""Collect every bare price the site's patterns match, as a list of floats."""
out = []
for pat in patterns:
for raw in re.findall(pat, html):
try:
out.append(float(raw.replace(",", "")))
except ValueError:
continue
return out
def detail_items(html: str, site: dict, limit: int = 8) -> list:
"""Top product results as {name, price, url}.
Uses the site's grid pattern when given (name + price + link together);
otherwise falls back to JSON-LD products (name + link, price if present).
"""
pat = site.get("item_pattern")
if pat:
items, seen = [], set()
for m in re.finditer(pat, html, re.DOTALL):
g = m.groupdict()
url = g["url"]
if url.startswith("//"):
url = "https:" + url
if url in seen:
continue
seen.add(url)
try:
price = float(g["price"].replace(",", ""))
except (KeyError, ValueError, TypeError, AttributeError):
price = None
items.append({"name": g.get("name", "").strip(), "price": price, "url": url})
if len(items) >= limit:
break
if items:
return items
return jsonld_products(html)[:limit]
# Markers a storefront serves when it turns a scripted hit away.
BLOCK_MARKERS = ("Robot Check", "validateCaptcha", "api-services-support@amazon",
"Enter the characters you see below", "automated access")
def is_blocked(html: str) -> bool:
"""A tiny page or a CAPTCHA / Robot-Check marker means we were turned away."""
return len(html) < 5000 or any(m in html for m in BLOCK_MARKERS)
def fetch(browser, url: str, tries: int = 3) -> str:
"""Load a URL, retrying past a transient bot-block with growing back-off.
This catches LIGHT, transient blocks (a fresh-ish IP that gets challenged
once). It will NOT clear a deep rate-limit — hit Amazon enough times from one
IP and it puts you in a cooldown that no amount of back-off fixes in seconds.
For anything you actually depend on, use a storefront's official API instead.
"""
html = ""
for i in range(tries):
page = browser.new_page()
try:
page.goto(url, timeout=60000, wait_until="domcontentloaded")
page.wait_for_timeout(6000) # search grids render late
html = page.content()
if not is_blocked(html):
return html
if i < tries - 1:
page.wait_for_timeout(6000 * (i + 1)) # 6s, then 12s back-off
except Exception:
pass # timeout/nav error → just retry
finally:
page.close()
return html # still blocked; caller checks is_blocked()
def scrape(browser, site: dict) -> dict:
"""Load one storefront's search page; return its product results + status."""
html = fetch(browser, site["url"])
return {"name": site["name"],
"items": detail_items(html, site),
"prices": prices_in(html, site["patterns"]),
"blocked": is_blocked(html),
"note": ""}
def compare():
"""Scrape all three storefronts and print product results with links."""
browser = launch()
try:
rows = [scrape(browser, s) for s in SITES]
finally:
browser.close()
print(f"\nMSI EdgeXpert — shopping results ('{QUERY}')\n" + "=" * 60)
for r in rows:
nm = r["name"]
if r["blocked"]:
print(f"\n{nm} — blocked even after retries: an IP rate-limit from too many")
print(f"{'':>9}recent runs. It clears on its own — wait a while, or use the API.")
continue
if not r["items"]:
print(f"\n{nm} — no product details in the page HTML "
f"(results load via API after render)")
continue
rng = ""
if r["prices"]:
rng = f" · page prices ฿{min(r['prices']):,.0f}–฿{max(r['prices']):,.0f}"
shown = r["items"][:6]
print(f"\n{nm} — {len(r['items'])}+ result(s){rng}; top {len(shown)}:")
for it in shown:
pr = f"฿{it['price']:,.0f}" if it.get("price") else "฿—"
print(f" {pr:>11} {it['name'][:52]}")
print(f" {'':>11} {it['url']}")
priced = [it for it in r["items"] if it.get("price")]
if priced:
c = min(priced, key=lambda it: it["price"])
print(f" → cheapest listed: ฿{c['price']:,.0f} {c['url']}")
print("\nThis is a demonstration of scraping, blocks and all. When a storefront")
print("offers an official API (e.g. Amazon's Product Advertising API), prefer it —")
print("it's stable and won't rate-limit you like this. But APIs often need approval,")
print("cost money, or don't exist for the site you care about; that's exactly where")
print("a stealth browser like CloakBrowser earns its place. API when you can,")
print("scrape when you can't.")
print("\nResults are a mix and shift run to run: the real MSI EdgeXpert (a rebadged")
print("NVIDIA DGX Spark) shows up on Lazada when in stock, next to related GPUs.")
print("Open a link to confirm the actual product and its current price.")
STEPS = {"minimal": minimal, "multipage": multipage, "jsonld": jsonld, "compare": compare}
def main():
step = sys.argv[1] if len(sys.argv) > 1 else "compare"
if step not in STEPS:
print(f"Unknown step '{step}'. Choose one of: {', '.join(STEPS)}")
return
STEPS[step]()
if __name__ == "__main__":
main()
Appendix B — The minimal recipe, copy-paste ready · ภาคผนวก B · 附录 B —— 最小配方,可直接复制
from cloakbrowser import launch
browser = launch()
try:
page = browser.new_page()
page.goto("https://www.fazwaz.com/condo-for-rent/thailand/chiang-mai",
timeout=60000, wait_until="domcontentloaded")
page.wait_for_timeout(4000)
html = page.content()
print(len(html), "bytes of real, rendered HTML")
page.close()
finally:
browser.close()