CatEye

前言

上次使用 claude + 中转的 Claude Oplus 4.6 做重构,但是 WEB 毕竟丑。中转站的充的钱花完了。

后面准备接着自己改,但是代码风格的确是很不符合,改的很难受,所以干脆就重新自己写了,白费钱。

目标还是重构 threat-sail 项目,进行解耦,顺便让 AI 写个好看的前端。

WEB 是使用 Codex 写的,好用还便宜,5 块钱一个月能进一个 team,买了俩个换着来,比 Oplus 中转便宜多了。

image-20260325140642155

Spider

和上次设计的相同,每个爬虫都是单独的脚本,生产数据,使用 .running、.done 来表示状态。

image-20260325134342713

image-20260325134639729

在获取文章的内容时候,之前的项目使用的浏览器截图 + 视觉模型进行分析,但是在公众号发现了一个 Crawl4ai 的项目,说是可以直接生成干净的 markdonw。

随后就试了一试,就是直接按照 github 上面的去用

image-20260325135105723

发现爬微信文章的时候会失败,应该是请求头的问题,而且也很慢,所以就干脆找找方案自己写。

核心的诉求就是 HTML => Markdonw,就找到了 trafilatura,效果还不错,至少普通的博客文章获取完全没有问题。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def get_html_browser(url: str, timeout=45) -> str:
try:
from playwright.sync_api import sync_playwright
import time
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
args=[
"--no-sandbox",
"--disable-dev-shm-usage",
"--disable-blink-features=AutomationControlled"
]
)
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)
page = context.new_page()
page.goto(url, timeout=timeout * 1000, wait_until="load")
page.wait_for_timeout(30 * 1000)
text = page.content()
browser.close()
return text
except Exception as e:
print('[-] Use Browser get {} html failed, error is {}'.format(url, e))
return ''

def get_html_requests(url: str, timeout=45) -> str:
try:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
}
res = requests.get(url, headers=headers, timeout=timeout, verify=False)
res.raise_for_status()
return res.text
except Exception as e:
print('[-] Use Requests get {} html failed, error is {}'.format(url, e))
return ''

def fetch_webpage(url: str, use_browser=False, timeout=45) -> str:
try:
print('[+] Fetching {} webpage ...'.format(url))
from trafilatura import extract
if use_browser:
html = get_html_browser(url, timeout)
else:
html = get_html_requests(url, timeout)
extracted = extract(
html,
output_format='markdown',
include_links=True,
include_images=False,
with_metadata=True,
url=url
)
return extracted if extracted is not None else ''
except Exception as e:
print("[-] Fetch {} webpage failed, error is {}".format(url, e))
return ''

对于博客的爬虫,之前是每种博客一个 py,现在换成统一的 xpath,复制下文章的 HTML,让 AI 直接输出对应格式的 xpath 语法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
blog:
- url: https://horizon3.ai/category/attack-research/attack-blogs/
use_browser: false
xpath:
article: "//div[@id='feed1']/div[contains(@class, 'brxe-fvkcoy')]"
title: ".//h4[contains(@class, 'brxe-heading')]/a/text()"
link: ".//h4[contains(@class, 'brxe-heading')]/a/@href"
desc: ".//div[contains(@class, 'brxe-giogpl')]/text()"
- url: https://labs.watchtowr.com/
use_browser: false
xpath:
article: "//div[contains(@class, 'gh-feed')]/article[contains(@class, 'gh-card')]"
title: ".//h2[contains(@class, 'gh-card-title')]/text()"
link: ".//a[contains(@class, 'gh-card-link')]/@href"
desc: ".//div[contains(@class, 'gh-card-excerpt')]/text()"
- url: "https://www.rapid7.com/blog/?blog_tags=Vulnerability%20Management,Research,Zero-Day,Vulnerability%20disclosure,Emergent%20Threat%20Response&blog_category=Threat%20Research,Vulnerabilities%20and%20Exploits"
use_browser: true
xpath:
article: "//div[@id='blog-cards-list']//a[contains(@class, 'group')]"
title: ".//h3[contains(@class, 'text-[23px]')]/text()"
link: "./@href"
desc: ".//p[contains(@class, 'eyebrow-card')]/text()"
- url: https://projectdiscovery.io/blog/category/vulnerability-research/1
use_browser: false
xpath:
article: "//div[contains(@class, 'grid')]/div[contains(@style, 'opacity')]"
title: ".//h3[contains(@class, 'text-xl')]/text()"
link: ".//a[contains(@class, 'group')]/@href"
desc: ".//p[contains(@class, 'line-clamp-3')]/text()"
- url: https://unit42.paloaltonetworks.com/category/threat-research/
use_browser: true
xpath:
article: "//div[contains(@class, 'l-card') and contains(@class, 'l-card--transparent')]"
title: ".//h5[contains(@class, 'post-title')]/text()"
link: ".//a[.//h5[contains(@class, 'post-title')]]/@href"
desc: ".//span[contains(@class, 'post-pub-date')]/time/text()"
- url: https://unit42.paloaltonetworks.com/category/top-cyberthreats/
use_browser: true
xpath:
article: "//div[contains(@class, 'l-card') and contains(@class, 'l-card--transparent')]"
title: ".//h5[contains(@class, 'post-title')]/text()"
link: ".//a[.//h5[contains(@class, 'post-title')]]/@href"
desc: ".//span[contains(@class, 'post-pub-date')]/time/text()"
- url: https://projectzero.google/archive.html
use_browser: false
xpath:
article: "//section[contains(@class, 'post-content')]/div"
title: ".//a[contains(@class, 'archive-link')]/text()"
link: ".//a[contains(@class, 'archive-link')]/@href"
desc: ".//p[contains(@class, 'post-date')]/text()"

剩下的推特和 CVE 暂时就不写了,等后面需要再说,毕竟目前我自己也不太能用得到,现在流程能跑通就没问题。

Pipeline

把之前的拆分拆分

  1. 入库,清洗数据
  2. LLM 总结
  3. 重要数据通知

WEB

没有很多的介绍,就是让 Codex 参照我之前的 threat-sail-view 项目去实现,高级一点。

这次是直接前后端全部自己生成了,也是一遍过,后面就是调整细节样式之类的。

结果还是蛮不错的,至少比我自己用 element 弄的好看

以前的:

image-20250803201022512

现在的,高级不少,而且还把详情页弄出来了:

image-20260325140821936

image-20260325140958180

image-20260325141054268

image-20260325141105491

结语

AI 从最初的对话到 Cursor、Claude Code 这些写代码的工具,发展的太快了,如果不需要维护的话,完全可以让 AI 去编写一个完整的项目。

但是如果个人需要维护,最好还是写完的看一遍,就像之前使用 CC 重构,功能完全的实现,但是我自己回过头想改就很难受。


CatEye
https://liancccc.github.io/2026/03/24/技术/AI/CatEye/
作者
守心
发布于
2026年3月24日
许可协议