CCF BDCI 阿里天池商场中精确定位用户所在店铺参赛总结

发表于 2018-01-03 | 分类于 Machine learning |

赛题背景

随着互联网移动支付的迅速普及，我们享受到越来越多的生活便利。如当您走入商场的某家餐厅时，手机会自动弹出该餐厅的优惠券；当您走入商场服装店时，手机可以自动推荐这家店里您喜欢的衣服；在您路过商场一家珠宝店时，手机可以自动提示您想了很久的一款钻戒已经有货了；离开商场停车场时，手机在您的许可下可以自动交停车费。这些您所享受的贴心服务都离不开背后大数据挖掘和机器学习的支持。在正确的时间、正确的地点给用户最有效的服务，是各大互联网公司智能化拓展的新战场。

赛题

对赛题提供的数据进行店铺、用户、WIFI等各个维度进行数据挖掘和特征创建，并自行创建训练数据中的负样本，进行合适的机器学习训练。测试数据中，根据当时用户所处的位置和WIFI等环境信息，通过算法或模型准确的判断出用户当前所在的店铺。

本次大赛我们将提供在2017年8月份大概100家商场（复赛为2017年7-8月大概500家商场）的详细数据，包括用户定位行为和商场内店铺等数据（已脱敏），参赛队伍需要对其进行数据挖掘和必要的机器学习训练。另外，我们会提供2017年9月份的商场内用户数据来做评测，检测您的算法是否能准确的识别出当时用户所在的店铺。

这是一个室内定位问题。给定交易时的环境信息(包括GPS、wifi信息)，确定交易所处的商铺。

阅读全文 »

从线性回归到最小二乘法和梯度下降法

发表于 2017-12-29 | 分类于 Machine learning |

线性回归

如下图，要对样本点进行线性拟合，求得使预测尽可能准确的函数，这个过程就是线性回归

$$Y = aX + b$$

阅读全文 »

Java IO 流总结

发表于 2017-11-11 | 分类于 Java |

本文对 Java 中的 IO 流的概念和操作进行了梳理总结，并给出了对中文乱码问题的解决方法。

1. 什么是流

Java 中的流是对字节序列的抽象，我们可以想象有一个水管，只不过现在流动在水管中的不再是水，而是字节序列。和水流一样，Java 中的流也具有一个 “流动的方向”，通常可以从中读入一个字节序列的对象被称为输入流；能够向其写入一个字节序列的对象被称为输出流。

阅读全文 »

大数据增量同步实现方案

发表于 2017-10-19 | 分类于大数据 |

目前做的项目使用阿里 DataX 作为不同数据源数据同步的实现工具。数据的批量一次性导入比较简单，对于增量数据需要对不同场景设计不同的方案。

会变的数据增量同步

每天全量同步

如人员表、订单表一类的会发生变化的数据，根据数据仓库的4个特点里的反映历史变化的这个特点的要求，我们建议每天对数据进行全量同步。也就是说每天保存的都是数据的全量数据，这样历史的数据和当前的数据都可以很方便地获得。

设定日分区，每天同步全量数据。

--全量同步
create table ods_user_full(
    uid bigint,
    uname string,
    deptno bigint,
    gender string,
    optime DATETIME 
) partitioned by (ds string);

查询全量用 where 分区语句 如 where ds = "2017-10-19"

阅读全文 »

网站漏洞处理总结

发表于 2017-09-24 | 分类于计算机技巧 |

最近实验室网站检测出了漏洞，需要修复，以下对修复内容给做点总结

1. XSS 攻击

XSS 攻击全称跨站脚本攻击，是为不和层叠样式表(Cascading Style Sheets, CSS)的缩写混淆，故将跨站脚本攻击缩写为 XSS，XSS 是一种在web应用中的计算机安全漏洞，它允许恶意 web 用户将代码植入到提供给其它用户使用的页面中。

阅读全文 »

Oracle Mysql 日期格式化

发表于 2017-09-16 | 分类于 Database |

Mysql

mysql查询记录如果有时间戳字段时，查看结果不方便，不能即时看到时间戳代表的含义，现提供mysql格式换时间函数，可以方便的看到格式化后的时间。

1. DATE_FORMAT() 函数用于以不同的格式显示日期/时间数据。

DATE_FORMAT(date,format)

阅读全文 »

Datax 自定义函数 dx_groovy

发表于 2017-09-16 | 分类于 Database |

DataX 是阿里云开源离线同步工具，致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、MaxCompute(原ODPS)、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。

Datax 的数据转换支持 UserDefined Function

官方的使用说明如下：

dx_groovy

参数。
- 第一个参数： groovy code
- 第二个参数（列表或者为空）：extraPackage
备注：
- dx_groovy只能调用一次。不能多次调用。
- groovy code中支持java.lang, java.util的包，可直接引用的对象有record，以及element下的各种column（BoolColumn.class,BytesColumn.class,DateColumn.class,DoubleColumn.class,LongColumn.class,StringColumn.class）。不支持其他包，如果用户有需要用到其他包，可设置extraPackage，注意extraPackage不支持第三方jar包。
- groovy code中，返回更新过的Record（比如record.setColumn(columnIndex, new StringColumn(newValue));），或者null。返回null表示过滤此行。
- 用户可以直接调用静态的Util方式（GroovyTransformerStaticUtil），目前GroovyTransformerStaticUtil的方法列表 (按需补充)：

阅读全文 »

mapreduce 错误 The required MAP capability is more than the supported max container capability in the cluster

发表于 2017-08-05 | 分类于大数据 |

具体错误

1
2

The required MAP capability is more than the supported max container capability in the cluster. Killing the Job. mapResourceRequest: <memory:3072, vCores:1> maxContainerCapability:<memory:1460, vCores:1>
Job received Kill while in RUNNING state.

此错导致 job 被 kill

解决方法

解答1

https://stackoverflow.com/questions/25878458/rhadoop-reduce-capability-required-is-more-than-the-supported-max-container-cap

I have not used RHadoop. However I’ve had a very similar problem on my cluster, and this problem seems to be linked only to MapReduce.

The maxContainerCapability in this log refers to the yarn.scheduler.maximum-allocation-mb property of your yarn-site.xml configuration. It is the maximum amount of memory that can be used in any container.

The mapResourceReqt and reduceResourceReqt in your log refer to the mapreduce.map.memory.mb and mapreduce.reduce.memory.mb properties of your mapred-site.xml configuration. It is the memory size of the containers that will be created for a Mapper or a Reducer in mapreduce.

If the size of your Reducer’s container is set to be greater than yarn.scheduler.maximum-allocation-mb, which seems to be the case here, your job will be killed because it is not allowed to allocate so much memory to a container.

Check your configuration at http://[your-resource-manager]:8088/conf and you should normally find these values and see that this is the case.

Maybe your new environment has these values set to 4096 Mb (which is quite big, the default in Hadoop 2.7.1 being 1024).

Solution

You should either lower the mapreduce.[map|reduce].memory.mb values down to 1024, or if you have lots of memory and want huge containers, raise the yarn.scheduler.maximum-allocation-mb value to 4096. Only then MapReduce be able to create containers.

I hope this helps.

解答2

https://stackoverflow.com/questions/25753983/how-do-you-change-the-max-container-capability-in-hadoop-cluster

To do this on Hortonworks 2.1, I had to

increase VirtualBox memory from 4096 to 8192 (don’t know if that was strictly necessary)
Enabled Ambari from http://my.local.host:8000
Log into Ambari from http://my.local.host:8080
change the values of yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb from the defaults to 4096
Save and restart everything (via Ambari)
This got me past the “capability required” errors, but the actual wordcount.R doesn’t seem to want to complete. Things like hdfs.ls(“/data”) do work, however

简而言之：yarn-site.xml 中的yarn.scheduler.maximum-allocation-mb yarn.nodemanager.resource.memory-mb 配置的值 >= mapred-site.xml 中mapreduce.map.memory.mb、mapreduce.reduce.memory.mb 的值

参考：

Yarn最佳实践 http://blog.csdn.net/jiangshouzhuang/article/details/52595781

Kylin api 整理（部分官网未给出）

发表于 2017-07-31 | 分类于大数据 |

kylin 的官网没有列出保存 cube 信息，model 信息，project等 rest api，这里通过查看源码对实际项目使用中有用到的 api 进行列举

官方文档
http://kylin.apache.org/docs16/howto/howto_use_restapi.html#build-cube

保存项目

POST /kylin/api/projects

ProjectController.java

Request Body

name - required String 项目名
description - optional String 项目描述

阅读全文 »

Kylin 在 CDH 中的安装、错误解决

发表于 2017-06-21 | 分类于大数据 |

Apache Kylin

中文名麒（shen）麟（shou）是Hadoop动物园的重要成员。Apache Kylin是一个开源的分布式分析引擎，最初由eBay开发贡献至开源社区。它提供Hadoop之上的SQL查询接口及多维分析（OLAP）能力以支持大规模数据，能够处理TB乃至PB级别的分析任务，能够在亚秒级查询巨大的Hive表，并支持高并发。

安装

环境

CDH 5.10.0
apache-kylin-1.6.0-cdh5.7-bin

官网建议 CDH 5.10 安装 Kylin 2.0 ，尝试后发现部分查询有问题，后又换成 1.6.0 版本

阅读全文 »