欢迎来到牛博站长教程网!

Nginx

当前位置: 主页 > 服务器教程 > Nginx

屏蔽AI蜘蛛和防止网站文章采集方法

时间:2024-10-16 12:49:57|栏目:Nginx|点击:

我从最经济实惠,简单粗暴开始说;不说废话,直接开整。

方法一:域名DNS托管到cloudflare,一键屏蔽AI爬虫

如果访问不了cloudflare,那就需要自己搞定梯子。
(国内域名几乎不影响访问速度,有些人会觉得使用国内DNS速度快,其实速度差不多)

方法二:宝塔防火墙设置屏蔽AI爬虫(我用的是破解版宝塔,免费版不知道能不能设置)

  1. Amazonbot
  2. ClaudeBot
  3. PetalBot
  4. gptbot
  5. Ahrefs
  6. Semrush
  7. Imagesift
  8. Teoma
  9. ia_archiver
  10. twiceler
  11. MSNBot
  12. Scrubby
  13. Robozilla
  14. Gigabot
  15. yahoo-mmcrawler
  16. yahoo-blogs/v3.9
  17. psbot
  18. Scrapy
  19. SemrushBot
  20. AhrefsBot
  21. Applebot
  22. AspiegelBot
  23. DotBot
  24. DataForSeoBot
  25. java
  26. MJ12bot
  27. python
  28. seo
  29. Censys
复制代码




方法三:复制下面的代码,保存为robots.txt,上传到网站根目录

  1. User-agent: Ahrefs
  2. Disallow: /
  3. User-agent: Semrush
  4. Disallow: /
  5. User-agent: Imagesift
  6. Disallow: /
  7. User-agent: Amazonbot
  8. Disallow: /
  9. User-agent: gptbot
  10. Disallow: /
  11. User-agent: ClaudeBot
  12. Disallow: /
  13. User-agent: PetalBot
  14. Disallow: /
  15. User-agent: Baiduspider
  16. Disallow:
  17. User-agent: Sosospider
  18. Disallow:
  19. User-agent: sogou spider
  20. Disallow:
  21. User-agent: YodaoBot
  22. Disallow:
  23. User-agent: Googlebot
  24. Disallow:
  25. User-agent: Bingbot
  26. Disallow:
  27. User-agent: Slurp
  28. Disallow:
  29. User-agent: Teoma
  30. Disallow: /
  31. User-agent: ia_archiver
  32. Disallow: /
  33. User-agent: twiceler
  34. Disallow: /
  35. User-agent: MSNBot
  36. Disallow: /
  37. User-agent: Scrubby
  38. Disallow: /
  39. User-agent: Robozilla
  40. Disallow: /
  41. User-agent: Gigabot
  42. Disallow: /
  43. User-agent: googlebot-image
  44. Disallow:
  45. User-agent: googlebot-mobile
  46. Disallow:
  47. User-agent: yahoo-mmcrawler
  48. Disallow: /
  49. User-agent: yahoo-blogs/v3.9
  50. Disallow: /
  51. User-agent: psbot
  52. Disallow:
  53. User-agent: dotbot
  54. Disallow: /
复制代码



方法四:防止网站被采集(宝塔配置文件保存以下代码)

  1. #禁止Scrapy等工具的抓取
  2. if ($http_user_agent ~* (Scrapy|Curl|HttpClient|crawl|curb|git|Wtrace)) {

  3.      return 403;

  4. }

  5. #禁止指定UA及UA为空的访问
  6. if ($http_user_agent ~* "CheckMarkNetwork|Synapse|Nimbostratus-Bot|Dark|scraper|LMAO|Hakai|Gemini|Wappalyzer|masscan|crawler4j|Mappy|Center|eright|aiohttp|MauiBot|Crawler|researchscan|Dispatch|AlphaBot|Census|ips-agent|NetcraftSurveyAgent|ToutiaoSpider|EasyHttp|Iframely|sysscan|fasthttp|muhstik|DeuSu|mstshash|HTTP_Request|ExtLinksBot|package|SafeDNSBot|CPython|SiteExplorer|SSH|MegaIndex|BUbiNG|CCBot|NetTrack|Digincore|aiHitBot|SurdotlyBot|null|SemrushBot|Test|Copied|ltx71|Nmap|DotBot|AdsBot|InetURL|Pcore-HTTP|PocketParser|Wotbox|newspaper|DnyzBot|redback|PiplBot|SMTBot|WinHTTP|Auto Spider 1.0|GrabNet|TurnitinBot|Go-Ahead-Got-It|Download Demon|Go!Zilla|GetWeb!|GetRight|libwww-perl|Cliqzbot|MailChimp|SMTBot|Dataprovider|XoviBot|linkdexbot|SeznamBot|Qwantify|spbot|evc-batch|zgrab|Go-http-client|FeedDemon|Jullo|Feedly|YandexBot|oBot|FlightDeckReports|Linguee Bot|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|EasouSpider|LinkpadBot|Ezooms|^$" ) {

  7.      return 403;

  8. }

  9. #禁止非GET|HEAD|POST方式的抓取
  10. if ($request_method !~ ^(GET|HEAD|POST)$) {

  11.     return 403;

  12. }

复制代码


添加完毕后保存,重启nginx即可,这样这些蜘蛛或工具扫描网站的时候就会提示403禁止访问。
注意:如果你网站使用火车头采集发布,使用以上代码会返回403错误,发布不了的。如果想使用火车头采集发布,请使用下面的代码:

  1. #禁止Scrapy等工具的抓取
  2. if ($http_user_agent ~* (Scrapy|Curl|HttpClient|crawl|curb|git|Wtrace)) {

  3.      return 403;

  4. }

  5. #禁止指定UA及UA为空的访问
  6. if ($http_user_agent ~* "CheckMarkNetwork|Synapse|Nimbostratus-Bot|Dark|scraper|LMAO|Hakai|Gemini|Wappalyzer|masscan|crawler4j|Mappy|Center|eright|aiohttp|MauiBot|Crawler|researchscan|Dispatch|AlphaBot|Census|ips-agent|NetcraftSurveyAgent|ToutiaoSpider|EasyHttp|Iframely|sysscan|fasthttp|muhstik|DeuSu|mstshash|HTTP_Request|ExtLinksBot|package|SafeDNSBot|CPython|SiteExplorer|SSH|MegaIndex|BUbiNG|CCBot|NetTrack|Digincore|aiHitBot|SurdotlyBot|null|SemrushBot|Test|Copied|ltx71|Nmap|DotBot|AdsBot|InetURL|Pcore-HTTP|PocketParser|Wotbox|newspaper|DnyzBot|redback|PiplBot|SMTBot|WinHTTP|Auto Spider 1.0|GrabNet|TurnitinBot|Go-Ahead-Got-It|Download Demon|Go!Zilla|GetWeb!|GetRight|libwww-perl|Cliqzbot|MailChimp|SMTBot|Dataprovider|XoviBot|linkdexbot|SeznamBot|Qwantify|spbot|evc-batch|zgrab|Go-http-client|FeedDemon|Jullo|Feedly|YandexBot|oBot|FlightDeckReports|Linguee Bot|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|EasouSpider|LinkpadBot|Ezooms ) {

  7.      return 403;

  8. }

  9. #禁止非GET|HEAD|POST方式的抓取
  10. if ($request_method !~ ^(GET|HEAD|POST)$) {

  11.     return 403;

  12. }
复制代码

设置完了可以用模拟爬去来看看有没有误伤了好蜘蛛,说明:以上屏蔽的蜘蛛名不包括以下常见的6大蜘蛛名:百度蜘蛛:Baiduspider谷歌蜘蛛:Googlebot必应蜘蛛:bingbot搜狗蜘蛛:Sogou web spider360蜘蛛:360Spider神马蜘蛛:YisouSpider爬虫常见的User-Agent如下:

  1. FeedDemon       内容采集
  2. BOT/0.1 (BOT for JCE) sql注入
  3. CrawlDaddy      sql注入
  4. Java         内容采集
  5. Jullo         内容采集
  6. Feedly        内容采集
  7. UniversalFeedParser  内容采集
  8. ApacheBench      cc攻击器
  9. Swiftbot       无用爬虫
  10. YandexBot       无用爬虫
  11. AhrefsBot       无用爬虫
  12. jikeSpider      无用爬虫
  13. MJ12bot        无用爬虫
  14. ZmEu phpmyadmin    漏洞扫描
  15. WinHttp        采集cc攻击
  16. EasouSpider      无用爬虫
  17. HttpClient      tcp攻击
  18. Microsoft URL Control 扫描
  19. YYSpider       无用爬虫
  20. jaunty        wordpress爆破扫描器
  21. oBot         无用爬虫
  22. Python-urllib     内容采集
  23. Indy Library     扫描
  24. FlightDeckReports Bot 无用爬虫
  25. Linguee Bot      无用爬虫

复制代码


转自:图片.png



上一篇:Nginx实现跨域使用字体文件的配置详解

栏    目:Nginx

下一篇:暂无

本文标题:屏蔽AI蜘蛛和防止网站文章采集方法

本文地址:https://nb.sd.cn/Nginx/314.html

广告投放 | 联系我们 | 版权申明

重要申明:本站所有的文章、图片、评论等,均由网友发表或上传并维护或收集自网络,属个人行为,与本站立场无关。本站涉及源码和程序均为学习用途,无商业盈利

如果侵犯了您的权利,请与我们联系,我们将在24小时内进行处理、任何非本站因素导致的法律后果,本站均不负任何责任。

联系QQ:44281525 | 邮箱:44281525@qq.com

Copyright © 2002-202X 牛博站长教程网 版权所有 Powered by EyouCms鲁ICP备2024061276号-1