新闻资讯
php - 如何检测伪装用户/爬虫/cURL
问题描述
其他一些网站可能会使用cURL和伪造的http Referer复制我的网站内容。
我们是否可以检测出请求是cURL而不是真正的Web浏览器?
最佳思路
没有任何完美的方法可以避免自动爬取网页。因为人可以做到的一切,机器人也可以模拟做到。但是有很多能让机器抓取变得更困难的做法,从而防止绝大部分人的专区,不过对于非常精通技术的极客效果有限。
这里介绍几种不同类型的反爬技术。
1.每个IP的会话数
如果用户每分钟使用50个新会话,则可以认为该用户可能是不处理Cookie的爬虫程序。当然,curl可以完美地管理cookie,但是如果您将其与每个会话的访问计数器结合使用(稍后说明),或者爬虫对cookie处理得不好,那么这个方法可能是有效的。
一般不太可能有50个具有相同共享连接的人会同时在您的网站上访问。如果发生这种情况,则认为是爬虫在抓取,您可以锁定网站页面,直到输入验证码为止。
具体步骤:
1)创建2个表:1个保存禁用的ips,1个保存ip和会话
create table if not exists sessions_per_ip (
ip int unsigned,
session_id varchar(32), creation timestamp default current_timestamp,
primary key(ip, session_id)
); create table if not exists banned_ips (
ip int unsigned, creation timestamp default current_timestamp,
primary key(ip)
);
2)在脚本的开头,您从两个表中删除了太旧的条目
3)接下来,您检查用户的IP是否被禁止(将标志设置为true)
4)如果没有,您可以计算出他的IP会话数
5)如果TA的会话过多,则将其插入到被禁止的表中并设置一个标志
6)如果尚未插入sessions_per_ip表,则将其ip插入
我编写了一个代码示例,以更好地显示我的想法。
<?php try { // Some configuration (small values for demo) $max_sessions = 5; // 5 sessions/ip simultaneousely allowed $check_duration = 30; // 30 secs max lifetime of an ip on the sessions_per_ip table $lock_duration = 60; // time to lock your website for this ip if max_sessions is reached // Mysql connection require_once("config.php");
$dbh = new PDO("mysql:host={$host};dbname={$base}", $user, $password);
$dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION); // Delete old entries in tables $query = "delete from sessions_per_ip where timestampdiff(second, creation, now()) > {$check_duration}";
$dbh->exec($query);
$query = "delete from banned_ips where timestampdiff(second, creation, now()) > {$lock_duration}";
$dbh->exec($query); // Get useful info attached to our user... session_start();
$ip = ip2long($_SERVER['REMOTE_ADDR']);
$session_id = session_id(); // Check if IP is already banned $banned = false;
$count = $dbh->query("select count(*) from banned_ips where ip = '{$ip}'")->fetchColumn(); if ($count > 0)
{
$banned = true;
} else { // Count entries in our db for this ip $query = "select count(*) from sessions_per_ip where ip = '{$ip}'";
$count = $dbh->query($query)->fetchColumn(); if ($count >= $max_sessions)
{ // Lock website for this ip $query = "insert ignore into banned_ips ( ip ) values ( '{$ip}' )";
$dbh->exec($query);
$banned = true;
} // Insert a new entry on our db if user's session is not already recorded $query = "insert ignore into sessions_per_ip ( ip, session_id ) values ('{$ip}', '{$session_id}')";
$dbh->exec($query);
} // At this point you have a $banned if your user is banned or not. // The following code will allow us to test it... // We do not display anything now because we'll play with sessions : // to make the demo more readable I prefer going step by step like // this. ob_start(); // Displays your current sessions echo "Your current sessions keys are : <br/>";
$query = "select session_id from sessions_per_ip where ip = '{$ip}'"; foreach ($dbh->query($query) as $row) { echo "{$row['session_id']}<br/>";
} // Display and handle a way to create new sessions echo str_repeat('<br/>', 2); echo '<a href="' . basename(__FILE__) . '?new=1">Create a new session / reload</a>'; if (isset($_GET['new']))
{
session_regenerate_id();
session_destroy();
header("Location: " . basename(__FILE__)); die();
} // Display if you're banned or not echo str_repeat('<br/>', 2); if ($banned)
{ echo '<span style="color:red;">You are banned: wait 60secs to be unbanned... a captcha must be more friendly of course!</span>'; echo '<br/>'; echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
} else { echo '<span style="color:blue;">You are not banned!</span>'; echo '<br/>'; echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
ob_end_flush();
} catch (PDOException $e)
{ /*echo*/ $e->getMessage();
} ?>
2.访问计数
如果您的用户使用相同的Cookie来抓取您的页面,则可以使用其会话来阻止它。这个想法很简单:您的用户是否有可能在60秒内访问60页?
步骤:
- 在用户会话中创建一个数组,其中将包含每次访问时间。
- 删除此数组中早于X秒的访问
- 为实际访问添加新条目
- 计算此数组中的条目
- 如果用户访问了Y页,则禁止该用户
样例代码:
<?php $visit_counter_pages = 5; // maximum number of pages to load $visit_counter_secs = 10; // maximum amount of time before cleaning visits session_start(); // initialize an array for our visit counter if (array_key_exists('visit_counter', $_SESSION) == false)
{
$_SESSION['visit_counter'] = array();
} // clean old visits foreach ($_SESSION['visit_counter'] as $key => $time)
{ if ((time() - $time) > $visit_counter_secs) { unset($_SESSION['visit_counter'][$key]);
}
} // we add the current visit into our array $_SESSION['visit_counter'][] = time(); // check if user has reached limit of visited pages $banned = false; if (count($_SESSION['visit_counter']) > $visit_counter_pages)
{ // puts ip of our user on the same "banned table" as earlier... $banned = true;
} // At this point you have a $banned if your user is banned or not. // The following code will allow us to test it... echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>'; // Display counter $count = count($_SESSION['visit_counter']); echo "You visited {$count} pages."; echo str_repeat('<br/>', 2); echo <<< EOT
<a id="reload" href="#">Reload</a>
<script type="text/javascript">
$('#reload').click(function(e) {
e.preventDefault();
window.location.reload();
});
</script>
EOT; echo str_repeat('<br/>', 2); // Display if you're banned or not echo str_repeat('<br/>', 2); if ($banned)
{ echo '<span style="color:red;">You are banned! Wait for a short while (10 secs in this demo)...</span>'; echo '<br/>'; echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
} else { echo '<span style="color:blue;">You are not banned!</span>'; echo '<br/>'; echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
} ?>
3.图片下载
爬虫通常要在很短的时间内获取大量数据,一般不会下载页面上的图像,原因是:图像占用了太多带宽,会使抓取速度变慢。
这个方法的具体做法是:(我认为是最简洁,最容易实现的)
使用mod_rewrite将.jpg /.png /…等格式的图像文件隐藏在网页中。该图像应该在您要保护的每个页面上可用:它可能是您的网站LOGO,一般选择尺寸较小的图像(因为该图像不得缓存)。
步骤:
1. 将这些行添加到您的.htaccess中
RewriteEngine On RewriteBase /tests/anticrawl/ RewriteRule ^logo\.jpg$ logo.php
2.使用安全性创建您的logo.php
<?php // start session and reset counter session_start();
$_SESSION['no_logo_count'] = 0; // forces image to reload next time header("Cache-Control: no-store, no-cache, must-revalidate"); // displays image header("Content-type: image/jpg");
readfile("logo.jpg"); die();
3.在需要增加安全性的每个页面上增加no_logo_count,并检查其是否达到限制。
样例代码:
<?php $no_logo_limit = 5; // number of allowd pages without logo // start session and initialize session_start(); if (array_key_exists('no_logo_count', $_SESSION) == false)
{
$_SESSION['no_logo_count'] = 0;
} else {
$_SESSION['no_logo_count']++;
} // check if user has reached limit of "undownloaded image" $banned = false; if ($_SESSION['no_logo_count'] >= $no_logo_limit)
{ // puts ip of our user on the same "banned table" as earlier... $banned = true;
} // At this point you have a $banned if your user is banned or not. // The following code will allow us to test it... echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>'; // Display counter echo "You did not loaded image {$_SESSION['no_logo_count']} times."; echo str_repeat('<br/>', 2); // Display "reload" link echo <<< EOT
<a id="reload" href="#">Reload</a>
<script type="text/javascript">
$('#reload').click(function(e) {
e.preventDefault();
window.location.reload();
});
</script>
EOT; echo str_repeat('<br/>', 2); // Display "show image" link : note that we're using .jpg file echo <<< EOT
<div id="image_container">
<a id="image_load" href="#">Load image</a>
</div>
<br/>
<script type="text/javascript"> // On your implementation, you'llO of course use <img src="logo.jpg" /> $('#image_load').click(function(e) {
e.preventDefault();
$('#image_load').html('<img src="logo.jpg" />');
});
</script>
EOT; // Display if you're banned or not echo str_repeat('<br/>', 2); if ($banned)
{ echo '<span style="color:red;">You are banned: click on "load image" and reload...</span>'; echo '<br/>'; echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
} else { echo '<span style="color:blue;">You are not banned!</span>'; echo '<br/>'; echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
} ?>
4.Cookie检查
您可以在javascript端创建cookie,以检查您的用户是否执行了javascript(例如,使用Curl的抓取工具不会)。
这个想法很简单:这与图像检查大致相同。
- 将$ _SESSION值设置为1,并在每次访问中将其递增
- 如果存在cookie(在JavaScript中设置),请将会话值设置为0
- 如果此值达到限制,择禁止用户访问
代码:
<?php $no_cookie_limit = 5; // number of allowd pages without cookie set check // Start session and reset counter session_start(); if (array_key_exists('cookie_check_count', $_SESSION) == false)
{
$_SESSION['cookie_check_count'] = 0;
} // Initializes cookie (note: rename it to a more discrete name of course) or check cookie value if ((array_key_exists('cookie_check', $_COOKIE) == false) || ($_COOKIE['cookie_check'] != 42))
{ // Cookie does not exist or is incorrect... $_SESSION['cookie_check_count']++;
} else { // Cookie is properly set so we reset counter $_SESSION['cookie_check_count'] = 0;
} // Check if user has reached limit of "cookie check" $banned = false; if ($_SESSION['cookie_check_count'] >= $no_cookie_limit)
{ // puts ip of our user on the same "banned table" as earlier... $banned = true;
} // At this point you have a $banned if your user is banned or not. // The following code will allow us to test it... echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>'; // Display counter echo "Cookie check failed {$_SESSION['cookie_check_count']} times."; echo str_repeat('<br/>', 2); // Display "reload" link echo <<< EOT
<br/>
<a id="reload" href="#">Reload</a>
<br/>
<script type="text/javascript">
$('#reload').click(function(e) {
e.preventDefault();
window.location.reload();
});
</script>
EOT; // Display "set cookie" link echo <<< EOT
<br/>
<a id="cookie_link" href="#">Set cookie</a>
<br/>
<script type="text/javascript"> // On your implementation, you'll of course put the cookie set on a $(document).ready() $('#cookie_link').click(function(e) {
e.preventDefault(); var expires = new Date();
expires.setTime(new Date().getTime() + 3600000);
document.cookie="cookie_check=42;expires=" + expires.toGMTString();
});
</script>
EOT; // Display "unset cookie" link echo <<< EOT
<br/>
<a id="unset_cookie" href="#">Unset cookie</a>
<br/>
<script type="text/javascript"> // On your implementation, you'll of course put the cookie set on a $(document).ready() $('#unset_cookie').click(function(e) {
e.preventDefault();
document.cookie="cookie_check=;expires=Thu, 01 Jan 1970 00:00:01 GMT";
});
</script>
EOT; // Display if you're banned or not echo str_repeat('<br/>', 2); if ($banned)
{ echo '<span style="color:red;">You are banned: click on "Set cookie" and reload...</span>'; echo '<br/>'; echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
} else { echo '<span style="color:blue;">You are not banned!</span>'; echo '<br/>'; echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
5.防止代理
我们可以在网上找到有关不同种类的代理的一些信息:
- 普通代理显示有关用户连接(尤其是其IP)的信息。 (“normal” proxy)
- 匿名代理不显示IP,但在标头上提供有关代理使用的信息。(anonymous proxy)
- 高度匿名代理不显示用户IP,也不显示浏览器可能无法发送的任何信息。(high-anonyous proxy)
发现连接任何网站的代理很容易,但是很难发现high-anonymous代理。
一些$ _SERVER变量可能包含密钥,特别是如果您的用户位于代理之后(详尽列表来自this question):
- CLIENT_IP
- FORWARDED
- FORWARDED_FOR
- FORWARDED_FOR_IP
- HTTP_CLIENT_IP
- HTTP_FORWARDED
- HTTP_FORWARDED_FOR
- HTTP_FORWARDED_FOR_IP
- HTTP_PC_REMOTE_ADDR
- HTTP_PROXY_CONNECTION'
- HTTP_VIA
- HTTP_X_FORWARDED
- HTTP_X_FORWARDED_FOR
- HTTP_X_FORWARDED_FOR_IP
- HTTP_X_IMFORWARDS
- HTTP_XROXY_CONNECTION
- VIA
- X_FORWARDED
- X_FORWARDED_FOR
如果您检测到$_SERVER变量中有上述字段,就可以为反爬制定相应的反代理安全策略。
回复列表