博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
hadoop 网站日志分析
阅读量:5778 次
发布时间:2019-06-18

本文共 8449 字,大约阅读时间需要 28 分钟。

hot3.png

一、项目要求

  • 本文讨论的日志处理方法中的日志,仅指Web日志。其实并没有精确的定义,可能包括但不限于各种前端Web服务器——apache、lighttpd、nginx、tomcat等产生的用户访问日志,以及各种Web应用程序自己输出的日志。  

二、需求分析: KPI指标设计

 PV(PageView): 页面访问量统计

 IP: 页面独立IP的访问量统计
 Time: 用户每小时PV的统计
 Source: 用户来源域名的统计
 Browser: 用户的访问设备统计

下面我着重分析浏览器统计

三、分析过程

1、 日志的一条nginx记录内容

222.68.172.190  - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 

"http://www.angularjs.cn/A00n" 
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"

2、对上面的日志记录进行分析

remote_addr : 记录客户端的ip地址, 222.68.172.190

remote_user :  记录客户端用户名称, –
time_local:  记录访问时间与时区, [18/Sep/2013:06:49:57 +0000]
request: 记录请求的url与http协议, “GET /images/my.jpg HTTP/1.1″
status:  记录请求状态,成功是200, 200
body_bytes_sent:  记录发送给客户端文件主体内容大小, 19939
http_referer:  用来记录从那个页面链接访问过来的, “http://www.angularjs.cn/A00n”
http_user_agent:  记录客户浏览器的相关信息, “Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36″  

3、java语言分析上面一条日志记录(使用空格切分)

String line = "222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] \"GET /images/my.jpg HTTP/1.1\" 200 19939 \"http://www.angularjs.cn/A00n\" \"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36\"";	    String[] elementList = line.split(" ");	    for(int i=0;i

测试结果:

0 : 222.68.172.1901 : -2 : -3 : [18/Sep/2013:06:49:574 : +0000]5 : "GET6 : /images/my.jpg7 : HTTP/1.1"8 : 2009 : 1993910 : "http://www.angularjs.cn/A00n"11 : "Mozilla/5.012 : (Windows13 : NT14 : 6.1)15 : AppleWebKit/537.3616 : (KHTML,17 : like18 : Gecko)19 : Chrome/29.0.1547.6620 : Safari/537.36"
4、实体Kpi类的代码:
public class Kpi {	private String remote_addr;// 记录客户端的ip地址    private String remote_user;// 记录客户端用户名称,忽略属性"-"    private String time_local;// 记录访问时间与时区    private String request;// 记录请求的url与http协议    private String status;// 记录请求状态;成功是200    private String body_bytes_sent;// 记录发送给客户端文件主体内容大小    private String http_referer;// 用来记录从那个页面链接访问过来的    private String http_user_agent;// 记录客户浏览器的相关信息    private String method;//请求方法 get post    private String http_version; //http版本    	public String getMethod() {		return method;	}	public void setMethod(String method) {		this.method = method;	}	public String getHttp_version() {		return http_version;	}	public void setHttp_version(String http_version) {		this.http_version = http_version;	}	public String getRemote_addr() {		return remote_addr;	}	public void setRemote_addr(String remote_addr) {		this.remote_addr = remote_addr;	}	public String getRemote_user() {		return remote_user;	}	public void setRemote_user(String remote_user) {		this.remote_user = remote_user;	}	public String getTime_local() {		return time_local;	}	public void setTime_local(String time_local) {		this.time_local = time_local;	}	public String getRequest() {		return request;	}	public void setRequest(String request) {		this.request = request;	}	public String getStatus() {		return status;	}	public void setStatus(String status) {		this.status = status;	}	public String getBody_bytes_sent() {		return body_bytes_sent;	}	public void setBody_bytes_sent(String body_bytes_sent) {		this.body_bytes_sent = body_bytes_sent;	}	public String getHttp_referer() {		return http_referer;	}	public void setHttp_referer(String http_referer) {		this.http_referer = http_referer;	}	public String getHttp_user_agent() {		return http_user_agent;	}	public void setHttp_user_agent(String http_user_agent) {		this.http_user_agent = http_user_agent;	}	@Override	public String toString() {		return "Kpi [remote_addr=" + remote_addr + ", remote_user="				+ remote_user + ", time_local=" + time_local + ", request="				+ request + ", status=" + status + ", body_bytes_sent="				+ body_bytes_sent + ", http_referer=" + http_referer				+ ", http_user_agent=" + http_user_agent + ", method=" + method				+ ", http_version=" + http_version + "]";	}	    }
5、kpi的工具类
package org.aaa.kpi;public class KpiUtil {	/***	 * line记录转化成kpi对象	 * @param line 日志的一条记录	 * @author tianbx	 * */	public static Kpi transformLineKpi(String line){		String[] elementList = line.split(" ");		Kpi kpi = new Kpi();	    kpi.setRemote_addr(elementList[0]);	    kpi.setRemote_user(elementList[1]);	    kpi.setTime_local(elementList[3].substring(1));	    kpi.setMethod(elementList[5].substring(1));	    kpi.setRequest(elementList[6]);	    kpi.setHttp_version(elementList[7]);	    kpi.setStatus(elementList[8]);	    kpi.setBody_bytes_sent(elementList[9]);	    kpi.setHttp_referer(elementList[10]);	    kpi.setHttp_user_agent(elementList[11] + " " + elementList[12]);		return kpi;	}}

6、算法模型: 并行算法 

Browser: 用户的访问设备统计

– Map: {key:$http_user_agent,value:1}
– Reduce: {key:$http_user_agent,value:求和(sum)} 
7、map-reduce分析代码

import java.io.IOException;import java.util.Iterator;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.TextInputFormat;import org.apache.hadoop.mapred.TextOutputFormat;import org.hmahout.kpi.entity.Kpi;import org.hmahout.kpi.util.KpiUtil;import cz.mallat.uasparser.UASparser;import cz.mallat.uasparser.UserAgentInfo;public class KpiBrowserSimpleV {	public static class KpiBrowserSimpleMapper extends MapReduceBase 		implements Mapper
{ UASparser parser = null; @Override public void map(Object key, Text value, OutputCollector
out, Reporter reporter) throws IOException { Kpi kpi = KpiUtil.transformLineKpi(value.toString()); if(kpi!=null && kpi.getHttP_user_agent_info()!=null){ if(parser==null){ parser = new UASparser(); } UserAgentInfo info = parser.parseBrowserOnly(kpi.getHttP_user_agent_info()); if("unknown".equals(info.getUaName())){ out.collect(new Text(info.getUaName()), new IntWritable(1)); }else{ out.collect(new Text(info.getUaFamily()), new IntWritable(1)); } } } } public static class KpiBrowserSimpleReducer extends MapReduceBase implements Reducer
{ @Override public void reduce(Text key, Iterator
value, OutputCollector
out, Reporter reporter) throws IOException { IntWritable sum = new IntWritable(0); while(value.hasNext()){ sum.set(sum.get()+value.next().get()); } out.collect(key, sum); } } public static void main(String[] args) throws IOException { String input = "hdfs://127.0.0.1:9000/user/tianbx/log_kpi/input"; String output ="hdfs://127.0.0.1:9000/user/tianbx/log_kpi/browerSimpleV"; JobConf conf = new JobConf(KpiBrowserSimpleV.class); conf.setJobName("KpiBrowserSimpleV"); String url = "classpath:"; conf.addResource(url+"/hadoop/core-site.xml"); conf.addResource(url+"/hadoop/hdfs-site.xml"); conf.addResource(url+"/hadoop/mapred-site.xml"); conf.setMapOutputKeyClass(Text.class); conf.setMapOutputValueClass(IntWritable.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(KpiBrowserSimpleMapper.class); conf.setCombinerClass(KpiBrowserSimpleReducer.class); conf.setReducerClass(KpiBrowserSimpleReducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(input)); FileOutputFormat.setOutputPath(conf, new Path(output)); JobClient.runJob(conf); System.exit(0); }}

8、输出文件log_kpi/browerSimpleV内容

AOL Explorer 1

Android Webkit 123
Chrome 4867
CoolNovo 23
Firefox 1700
Google App Engine 5
IE 1521
Jakarta Commons-HttpClient 3
Maxthon 27
Mobile Safari 273
Mozilla 130
Openwave Mobile Browser 2
Opera 2
Pale Moon 1
Python-urllib 4
Safari 246
Sogou Explorer 157
unknown 4685

8 R制作图片

data<-read.table(file="borwer.txt",header=FALSE,sep=",") 

 names(data)<-c("borwer","num")

 qplot(borwer,num,data=data,geom="bar")

解决问题

1、排除爬虫和程序点击,对抗作弊

解决办法:页面做个检测鼠标是否动。

2、浏览量 怎么排除图片

3、浏览量排除假点击?

4、哪一个搜索引擎访问的?

5、点击哪一个关键字访问的?

6、从哪一个地方访问的?

7、使用哪一个浏览器访问的?

转载于:https://my.oschina.net/winHerson/blog/211570

你可能感兴趣的文章
Redis学习手册(内存优化)
查看>>
浅尝TensorFlow on Kubernetes
查看>>
springboot系列十 Spring-Data-Redis
查看>>
excel进行矩阵计算
查看>>
iOS: Block的循环引用
查看>>
变量声明提升1
查看>>
Magento XML cheatsheet
查看>>
haproxy mysql实例配置
查看>>
MySQL 8.0 压缩包版安装方法
查看>>
JS prototype 属性
查看>>
iphone-common-codes-ccteam源代码 CCEncoding.m
查看>>
006_mac osx 应用跨屏幕
查看>>
nginx中配置文件的讲解
查看>>
HTTP库Axios
查看>>
CentOS7下安装python-pip
查看>>
陀螺仪主要性能指标
查看>>
gen already exists but is not a source folder. Convert to a source folder or rename it 的解决办法...
查看>>
遍历Map的四种方法
查看>>
Altium Designer 小记
查看>>
赵雅智:js知识点汇总
查看>>