在此附上老师教学课件地址:

Hive

  • Start Hive from a Terminal: hive
  • Execute command within Hive dfs -ls /;
  • Exit Hive: exit;
  • hive脚本可以保存为.hql文件运行hive -f /path/to/file/withqueries.hql

DDL

1
2
3
4
5
6
7
8
9
10
SHOW DATABASES;

CREATE DATABASE IF NOT EXISTS myname;
CREATE DATABASE IF NOT EXISTS myname LOCATION '/user/yanfei/hive';
-- Location here is the HDFS path where the data for the database will be stored. Default is /user/hive/warehouse/.
DROP DATABASE IF EXISTS myname;

SHOW DATABASES;.
SHOW DATABASES LIKE "d*";.
USE myname.
1
2
3
4
5
6
7
8
9
10
11
12
13
CREATE TABLE IF NOT EXISTS myname.employees (
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary'
) COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00');

SHOW TABLES;
DESCRIBE myname.employees;

ALTER TABLE employees RENAME TO employees;
ALTER TABLE employees ADD COLUMNS (age int);

DROP TABLE employees;
1
2
3
SHOW FUNCTIONS [like "str*"];
SHOW DATABASES [like "h*"];
SHOW TABLES [like "m*"];
1
2
3
DESCRIBE DATABASE myname;
DESCRIBE employees;
DESCRIBE FUNCTION like;

DML

假设现在有文件/home/usr/data/stocks.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
use myname;

create external table if not exists stocks (
symbol string,
ymd string,
price_open float,
price_high float,
price_low float,
price_close float,
volume int,
price_adj_close float )
row format delimited fields terminated by ','
LOCATION '/data/';

LOAD DATA LOCAL INPATH '/home/usr/data/stocks.txt' OVERWRITE INTO TABLE stocks;

-- LOCAL 代表文件在服务器文件系统,去掉LOCAL后,代表文件在HDFS
  • Hive stores tables files by default at /user/hive/warehouse location on HDFS.

Hive&Python

1
2
3
4
5
6
7
8
9
10
11
#! /usr/bin/env python3
# 一个吧数据中的时间变为工作日虚拟变量的过程
# 位置在:hdfs:///user/code/weekday_mapper.py
import sys
import datetime

for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print('\t'.join([userid, movieid, rating, str(weekday)]))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
use myname;

-- 创建一个新表,准备放python处理后数据
CREATE EXTERNAL TABLE if not exists u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

-- 创建一个新表,准备放原始数据
CREATE EXTERNAL TABLE if not exists u_data (
userid INT,
movieid INT,
rating INT,
unixtime INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
-- 放原始数据
LOAD DATA LOCAL INPATH '/home/usr/data/u.data'
OVERWRITE INTO TABLE u_data;

-- 添加python处理文件
add FILE hdfs:///user/code/weekday_mapper.py; #

-- 用python处理文件后加入新数据库
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python3 weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;

-- 查询新数据库数据
SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;

基本知识点

  1. hive内部表和外部表的区别:未被external修饰的是内部表,被external修饰的为外部表。其中,内部表数据由Hive自身管理,存储在hive.metastore.warehouse.dir(默认:/user/hive/warehouse);外部表数据由HDFS管理,存储位置由自己制定,默认在HDFS上的/user/hive/warehouse文件夹下。删除内部表会直接删除元数据(metadata)及存储数据;删除外部表仅仅会删除元数据,HDFS上的文件并不会被删除。
  2. sort by 和 order by 的区别:order by 会对输入做全局排序,因此只有一个reducer(多个reducer无法保证全局有序)只有一个reducer,会导致当输入规模较大时,需要较长的计算时间。sort by不是全局排序,其在数据进入reducer前完成排序. 因此,如果用sort by进行排序,并且设置mapred.reduce.tasks>1, 则sort by只保证每个reducer的输出有序,不保证全局有序。