Web Server log analyze

今天朋友又給我出考題,據說需要把網頁伺服器的IP記錄交給警方.古早以前大學專題的確有作過對apache的log分析的部分.雖然時隔以久不過的確是個重新練功的好機會.這也是我第一次進FreeBSD來操作呢!

原來朋友是負責WEB頁面的維護也是前端GUI的工作,對後端的管理與架設並不是很了解,當聽見Putty的關鍵字,很好有command line可以用肯定是unix-like的環境.要了teamviewer,果不其然停在ee叫出access_log的畫面是apache的log格式,可是問題來了log file有1.6G嚇死人的大,不像熟悉的按月份切開檔案來放.再來事主希望能擷取約一個月分量的記錄資料,但頭痛的是日期從某月中過隔月中,思考了很多方法去切卻發現這過程耗掉太多時間,最後雖然不完美也算達成一半的任務,就記錄一下過程方便日後的作業.其中以下的log都是拿我自己的機器來還原模擬的情況.

自己Server上的apache的log範例

192.0.84.33 - - [26/Mar/2015:23:36:14 +0800] "HEAD /blog HTTP/1.1" 301 204 "-" "jetmon/1.0 (Jetpack Site Uptime Monitor by WordPress.com)"
192.0.84.33 - - [26/Mar/2015:23:36:14 +0800] "HEAD /blog/ HTTP/1.1" 200 275 "-" "jetmon/1.0 (Jetpack Site Uptime Monitor by WordPress.com)"
66.249.65.173 - - [26/Mar/2015:23:38:00 +0800] "GET /blog/2014/03/29/%E6%96%B0%E7%AB%B9-sogo-%E4%B8%80%E8%8A%B1%E4%BA%AD%E4%B8%BC%E9%A3%AF%E5%AE%9A%E9%A3%9F-%E6%97%A5%E5%BC%8F%E6%96%99%E7%90%86/ HTTP/1.1" 200 15707 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.65.177 - - [26/Mar/2015:23:38:18 +0800] "GET /blog/wp-content/gallery/200501010/1127532817.jpg HTTP/1.1" 200 217337 "-" "Googlebot-Image/1.0"
123.125.71.58 - - [26/Mar/2015:23:38:27 +0800] "GET / HTTP/1.1" 302 0 "-" "Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2"

1. 把月份切出來
因為朋友系統上log通通放一起,一開始面臨問題就是不管用vi或ee去開他都是個問題,況且還是台線上的機器.於是感緊把要夾擊的兩個月份給切出來.這邊主要是針對apache log的特性把年月用grep抓出來並導到暫存區.

ubuntu:~$ grep -rn '..\/Mar\/2015' /var/log/apache2/access.log > /tmp/Mar.raw

2. Apache log 2 CSV
這邊是最花時間的部分,原本想要用awk來自己兜出以逗號當區別符號的csv file,但是用預設空白欄位要排列太多$字的欄位了.抓不到慣性的情況下索性上網找找原本就是固定格式的apache log該有網友走過的痕跡果不其然有隻perl script可以用,好死本來網頁Server也裝了perl.到這邊幾乎就是解了.而輸出的樣子就是這篇文章的head picture!

ubuntu:~$ perl accesslog2csv.pl /tmp/Mar.raw > /tmp/Mar.csv

原始碼:

/accesslog2csv/blob/master/accesslog2csv.pl
#!/usr/bin/perl
 
# 
# @file
# Converter tool, from Apache Common Log file to CSV.
# 
# All code is released under the GNU General Public License.
# See COPYRIGHT.txt and LICENSE.txt.
#
 
if ("$ARGV[0]" =~ /^-h|--help$/) {
  print "Usage: $0 access_log_file > csv_output_file.csv\n";
  print "   Or, $0 < access_log_file > csv_output_file.csv\n";
  print "   Or, $0 < access_log_file > csv_output_file.csv 2> invalid_lines.txt\n";
  exit(0);
}
 
%MONTHS = ( 'Jan' => '01', 'Feb' => '02', 'Mar' => '03', 'Apr' => '04', 'May' => '05', 'Jun' => '06',
  'Jul' => '07', 'Aug' => '08', 'Sep' => '09', 'Oct' => '10', 'Nov' => '11', 'Dec' => '12' );
 
print STDOUT "\"Host\",\"Log Name\",\"Date Time\",\"Time Zone\",\"Method\",\"URL\",\"Response Code\",\"Bytes Sent\",\"Referer\",\"User Agent\"\n";
$line_no = 0;
 
while (<>) {
  ++$line_no;
  if (/^([\w\.:-]+)\s+([\w\.:-]+)\s+([\w\.-]+)\s+\[(\d+)\/(\w+)\/(\d+):(\d+):(\d+):(\d+)\s?([\w:\+-]+)]\s+"(\w+)\s+(\S+)\s+HTTP\/1\.\d"\s+(\d+)\s+([\d-]+)((\s+"([^"]+)"\s+")?([^"]+)")?$/) {
    $host = $1;
    $other = $2;
    $logname = $3;
    $day = $4;
    $month = $MONTHS{$5};
    $year = $6;
    $hour = $7;
    $min = $8;
    $sec = $9;
    $tz = $10;
    $method = $11;
    $url = $12;
    $code = $13;
    if ($14 eq '-') {
      $bytesd = 0;
    } else {
      $bytesd = $14;
    }
    $referer = $17;
    $ua = $18;
 
    print STDOUT "\"$host\",\"$logname\",\"$year-$month-$day $hour:$min:$sec\",\"GMT$tz\",\"$method\",\"$url\",$code,$bytesd,\"$referer\"\,\"$ua\"\n";
  } else {
    print STDERR "Invalid Line at $line_no: $_";
  }
}

refer:Converting Apache/Tomcat Access Logs to CSV

3. 傳出Server
再來就是把.csv file抓出來,嘗試了tftp,可惜電腦與伺服器間似乎隔著NAT,繞不出來.最後還是把檔案放到網頁目錄下透過bowser去抓回來.只是說和Linux習慣的/var/www/不同花點時間才抓到正確的folder.

後記:
本來就在上次的面試經驗中提到需要多熟習FreeBSD,畢竟這是想玩卻一直沒有實行的,再來是這陣子找到的工具都是再GitHub分享的也該是下個學習重點,還有就是perl和python也是…好像還有好多需要學得呢!

Facebook Comments
Scottj Written by:

史考特 喜歡3C 愛拍照