TF 2.0 - 时间序列预测入门TF 2.0 - 时间序列预测入门

160 阅读 0 评论 106 点赞

我是靠谱客的博主危机小天鹅，这篇文章主要介绍TF 2.0 - 时间序列预测入门TF 2.0 - 时间序列预测入门，现在分享给大家，希望可以做个参考。

TF 2.0 - 时间序列预测入门

本文链接：https://blog.lucien.ink/archives/483/

最近 Google 正式将 TensorFlow 2.0 作为默认 TensorFlow 版本了，作为一名初学者，决定用相对易用的新版的 TensorFlow 来进行实践。

在接下来的内容中，我将记录我用 LSTM 和 Beijing PM2.5 Data Set 来进行时间序列预测的过程。

因为 ipynb 文件里都包含图片，所以在文章里就不上图了哈。

0. 环境

Package	Version
`tensorflow`	`2.0.0`
`numpy`	`1.17.3`
`pandas`	`0.25.3`
`matplotlib`	`3.1.1`
`sklearn`	`0.21.3`

1. 过程

1.1 数据集

Beijing PM2.5 Data Set 源自位于北京的美国大使馆在 2010 ~ 2014 年每小时采集的天气及空气污染指数。
　　数据集包括日期、PM2.5 浓度、露点、温度、风向、风速、累积小时雪量和累积小时雨量。

原始数据中完整的特征如下：

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
No 编号
year 年
month 月
day 日
hour 小时
pm2.5 PM2.5浓度
DEWP 露点
TEMP 温度
PRES 大气压
cbwd 风向
lws 风速
ls 累积雪量
lr 累积雨量

可以用此数据集搭建预测 PM 2.5 的模型，利用前 x 个小时来预测后 y 个小时的 PM 2.5 数值。

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from TensorFlow import random
import TensorFlow.keras as keras
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.metrics import r2_score

# 固定随机种子
np.random.seed(10086)
random.set_seed(10010)

csv_data = keras.utils.get_file("PRSA_data.csv", "https://archive.ics.uci.edu/ml/machine-learning-databases/00381/PRSA_data_2010.1.1-2014.12.31.csv")

raw_df = pd.read_csv(csv_data)

raw_df.head()

1.2 数据预处理

1.2.1 删除时间戳

目前的我认为，时间戳对于连续的时间序列预测来说并不重要，所以在这里先删掉。

复制代码

1
2
3
4
5
6
# 删除时间戳
df = raw_df.drop(["No", "year", "month", "day", "hour"], axis=1, inplace=False)

print(df.shape)
df.head()

1.2.2 删除 nan

pm2.5 列有的值是空值，由于数量不多，所以考虑直接将包括 nan 的行删掉。

复制代码

1
2
3
4
5
6
# 删除 pm2.5 列的 nan 值
df = df[pd.notna(df["pm2.5"])]

print(df.shape)
df.head()

1.2.3 打印当前状态的数据

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 查看每列的 unique
for i in range(df.shape[1]):
    if df.columns[i] in ["pm2.5", "TEMP", "DEWP", "PRES"]:
        continue
    print("{}: {}".format(df.columns[i], df[df.columns[i]].unique()))

# 画个图
columns = ["pm2.5", "DEWP", "TEMP", "PRES", "Iws", "Is", "Ir"]

plt.figure(figsize=(15, 15))
for i, each in enumerate(columns):
    plt.subplot(len(columns), 1, i + 1)
    plt.plot(df[each])
    plt.title(each, y=0.5, loc="right")  # center, left, right

plt.show()

1.2.4 将非数值类型的 label 转化为数值类型

复制代码

1
2
3
4
5
6
7
# 将 label id 化
def label_fit_transform(data_frame, col_name):
    data_frame[col_name] = preprocessing.LabelEncoder().fit_transform(data_frame[col_name])
    return data_frame

label_fit_transform(df, "cbwd").head()

1.2.5 将数值归一化

归一化之后模型收敛会快一些，效果大概率会好一些，从感性角度去理解的话，我觉得知乎上的这个回答说的非常好。

复制代码

1
2
3
4
5
6
7
8
9
10
11
def standardization(data_frame):
    buffer = data_frame.copy()
    min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
    standard_values = min_max_scaler.fit_transform(buffer)
    for i, col_name in enumerate(buffer.columns):
        buffer[col_name] = standard_values[:, i]

    return buffer

standardization(df).head()

1.2.6 将时间序列转化为有监督训练数据

原始的时间序列并不能直接 feed 给模型，需要处理为 input -> label 形式的数据才可以。

复制代码

# 转化为监督序列
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    """
    Frame a time series as a supervised learning dataset.
    Arguments:
        data: Sequence of observations as a list or NumPy array.
        n_in: Number of lag observations as input (X).
        n_out: Number of observations as output (y).
        dropnan: Boolean whether or not to drop rows with NaN values.
    Returns:
        Pandas DataFrame of series framed for supervised learning.
    """
    from pandas import DataFrame, concat

n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols, names = list(), list()
    
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('%s(t-%d)' % (data.columns[j], i)) for j in range(n_vars)]

# forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
        if i == 0:
            names += [('%s(t)' % (data.columns[j])) for j in range(n_vars)]
        else:
            names += [('%s(t+%d)' % (data.coumns[j], i)) for j in range(n_vars)]

# put it all together
    agg = concat(cols, axis=1)
    agg.columns = names

# drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg

# 通过过去 2 小时的数据来预测未来 1 小时的数据
look_back = 2
predict_forward = 1

# standard (supervised) data frame
sdf = series_to_supervised(
    standardization(
        label_fit_transform(df, "cbwd")), look_back, predict_forward).drop(
        [
         "DEWP(t)", "TEMP(t)", "PRES(t)", "cbwd(t)", "Iws(t)", "Is(t)", "Ir(t)"
         ], axis=1, inplace=False).astype('float32')

sdf.head()

sdf.info()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# 转化为监督序列
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    """
    Frame a time series as a supervised learning dataset.
    Arguments:
        data: Sequence of observations as a list or NumPy array.
        n_in: Number of lag observations as input (X).
        n_out: Number of observations as output (y).
        dropnan: Boolean whether or not to drop rows with NaN values.
    Returns:
        Pandas DataFrame of series framed for supervised learning.
    """
    from pandas import DataFrame, concat

    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols, names = list(), list()
    
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('%s(t-%d)' % (data.columns[j], i)) for j in range(n_vars)]

    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
        if i == 0:
            names += [('%s(t)' % (data.columns[j])) for j in range(n_vars)]
        else:
            names += [('%s(t+%d)' % (data.coumns[j], i)) for j in range(n_vars)]

    # put it all together
    agg = concat(cols, axis=1)
    agg.columns = names

    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg

# 通过过去 2 小时的数据来预测未来 1 小时的数据
look_back = 2
predict_forward = 1

# standard (supervised) data frame
sdf = series_to_supervised(
    standardization(
        label_fit_transform(df, "cbwd")), look_back, predict_forward).drop(
        [
         "DEWP(t)", "TEMP(t)", "PRES(t)", "cbwd(t)", "Iws(t)", "Is(t)", "Ir(t)"
         ], axis=1, inplace=False).astype('float32')

sdf.head()

sdf.info()

1.2.7 划分数据集

从网上了解到， $t r a i n$ 、 $v a l i d$ 、 $t e s t$ 三个集合的比例一般为 $6 : 2 : 2$ 。

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# train, valid, test 6:2:2 划分
total = sdf.shape[0]
split_point = [total * 60 // 100, total * 80 // 100]

print("total = {}, split_point = {}".format(total, split_point))

def transform(values):
    return values.reshape(values.shape[0], 1, values.shape[1])

train_data = sdf[:split_point[0]].values

valid_data = sdf[split_point[0]: split_point[1]].values

test_data = sdf[split_point[1]: ].values

print("train_data.shape = {}, valid_data.shape = {}, test_data.shape = {}".format(
    train_data.shape, valid_data.shape, test_data.shape))

train_x, train_y = transform(train_data[:, : -1]), train_data[:, -1]

valid_x, valid_y = transform(valid_data[:, : -1]), valid_data[:, -1]

test_x, test_y = transform(test_data[:, : -1]), test_data[:, -1]

print("train_x.shape = {}, train_y = {}".format(train_x.shape, train_y.shape))
print("valid_x.shape = {}, valid_y = {}".format(valid_x.shape, valid_y.shape))
print("test_x.shape = {}, test_y = {}".format(test_x.shape, test_y.shape))

1.3 模型

1.3.1 构建网络

复制代码

1
2
3
4
5
6
model = keras.Sequential()
model.add(keras.layers.LSTM(64, input_shape=(train_x.shape[1], train_x.shape[2])))
model.add(keras.layers.Dense(1))

model.compile(loss="mae", optimizer="adam")

1.3.2 训练并记录历史

复制代码

1
2
3
4
5
6
7
8
history = model.fit(train_x,
                    train_y,
                    validation_data=(valid_x, valid_y),
                    epochs=32,
                    batch_size=32,
                    verbose=1,
                    shuffle=False)

1.4 模型效果评估

1.4.1 loss 图

先画一下 train 和 valid 数据集的 loss 图，看起来没有 overfitting。

复制代码

1
2
3
4
5
plt.plot(history.history["loss"], label="train loss")
plt.plot(history.history["val_loss"], label="valid loss")
plt.legend()
plt.show()

1.4.2 在 test 数据集上进行评估

1.4.2.1 loss

复制代码

1
2
3
# test 集上的 loss
model.evaluate(test_x, test_y, verbose=0)

看起来很低的样子。

1.4.2.2 将预测值和真值进行比较

1.4.2.2.1 获取预测结果

复制代码

1
2
prediction = model.predict(test_x)

1.4.2.2.2 对预测出来的结果进行反归一化

由于用的是 MinMaxScaler，所以直接按照公式逆着计算一下就可以。

复制代码

1
2
3
4
max_value = np.max(df["pm2.5"])
min_value = np.min(df["pm2.5"])
prediction = prediction[:, 0] * (max_value - min_value) + min_value

1.4.2.2.3 评估拟合能力

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 因为 look_back 处理时会去掉值为 nan 的 input，所以这里要加上 look_back
expectation = df["pm2.5"][split_point[1] + look_back: ].values

print("prediction's shape = {}, expectation's shape = {}".format(prediction.shape, expectation.shape))

# 计算一下 R-square
print(r2_score(expectation, prediction, multioutput="raw_values"))

plt.figure(figsize=(30, 17))
plt.plot(expectation, label="expectation")
plt.plot(prediction, label="predict", color="yellow", alpha=0.5)
plt.legend()
plt.show()

2. All in One

ipynb 文件可在我的 GitHub 或者 Google CoLab 看到。

py 文件可在 pasteme.cn/21403 查看。

最后

以上就是危机小天鹅最近收集整理的关于TF 2.0 - 时间序列预测入门TF 2.0 - 时间序列预测入门的全部内容，更多相关TF内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：算法
浏览次数：160 次浏览
发布日期：2023-09-10 13:45:19
本文链接：https://www.kaopuke.com/article/k-p-k_14_uzo_6_fy_13__23__22_w.html

TF 2.0 - 时间序列预测入门TF 2.0 - 时间序列预测入门

TF 2.0 - 时间序列预测入门

0. 环境

1. 过程

1.1 数据集

1.2 数据预处理

1.2.1 删除时间戳

1.2.2 删除 nan

1.2.3 打印当前状态的数据

1.2.4 将非数值类型的 label 转化为数值类型

1.2.5 将数值归一化

1.2.6 将时间序列转化为有监督训练数据

1.2.7 划分数据集

1.3 模型

1.3.1 构建网络

1.3.2 训练并记录历史

1.4 模型效果评估

1.4.1 loss 图

1.4.2 在 test 数据集上进行评估

1.4.2.1 loss

1.4.2.2 将预测值和真值进行比较

1.4.2.2.1 获取预测结果

1.4.2.2.2 对预测出来的结果进行反归一化

1.4.2.2.3 评估拟合能力

2. All in One

最后

评论列表共有 0 条评论

发表评论取消回复

TF 2.0 - 时间序列预测入门TF 2.0 - 时间序列预测入门

TF 2.0 - 时间序列预测入门

0. 环境

1. 过程

1.1 数据集

1.2 数据预处理

1.2.1 删除时间戳

1.2.2 删除 nan

1.2.3 打印当前状态的数据

1.2.4 将非数值类型的 label 转化为数值类型

1.2.5 将数值归一化

1.2.6 将时间序列转化为有监督训练数据

1.2.7 划分数据集

1.3 模型

1.3.1 构建网络

1.3.2 训练并记录历史

1.4 模型效果评估

1.4.1 loss 图

1.4.2 在 test 数据集上进行评估

1.4.2.1 loss

1.4.2.2 将预测值和真值进行比较

1.4.2.2.1 获取预测结果

1.4.2.2.2 对预测出来的结果进行反归一化

1.4.2.2.3 评估拟合能力

2. All in One

最后

微信扫一扫：分享

相关文章

评论列表共有 0 条评论

发表评论 取消回复

微信扫一扫：分享

发表评论取消回复