炒股一路亏，人工智能技术可以帮我们预测股价吗？

2020-04-13

十分钟实现人工智能股价预测，是一个深度学习的练习项目。其通过机器学习算法，根据过去几年与某只股票相关的K线走势、公司相关报道的情感分析作为数据集，通过训练来得到可以预测股价的机器学习模型，并用该模型对股价进行预测。本项目使用几种不同的算法（线性回归、神经网络和随机森林）对股票进行预测，并对各自的效果进行比较。运行本项目需要Python编程的基础，理解本项目的代码则需要对机器学习的相关知识。

自然人是如何投资股市的

在编写人工智能的程序之前，我们需要分析人类是怎样决定如何投资的。有过炒股经历的人会更快地理解。投资股市的目的是盈利，因此在决定购买哪只股票之前我们会查阅与该公司相关的信息，搜索最近甚至之前与该公司有关的新闻，逛逛炒股方面的贴吧，看看微博上面与该公司有关的消息。如果这个公司的前景明朗（正面报道很多），那么投资该股票的回报率也许会高一些。

股票的K线

此外，投资股市，还需要会看各种数据，如K线等。有时我们看到某只股票持续走低，并且有上涨的势头了，也许此时是最佳的购入时机，因为该股票有很大可能会触底反弹了。通过上述分析，我们明确了训练这样的一个机器学习模型需要哪些数据： 1、股价数据 2、对该股票（公司）的情感数据

获取历史数据并简单处理

数据对于机器学习十分重要。没有合适的数据，我们就无法训练机器学习模型，从而使其可以进行相应地预测。在该项目中，我们需要获取2部分的数据。1：股价数据，2：情感数据。对于处理股价数据，我们需要对于股价数据，需要使用Pandas进行分析。对于情感数据则使用NLTK（Natural Language Toolkit）来进行处理。

关于Pandas的使用入门，我曾写过一篇教程：从零开始机器学习-8 五分钟学会Pandas

首先，我们导入相应地Python包。

import numpy as np
import pandas as pd
import unicodedata
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

from nltk.sentiment.vader import SentimentIntensityAnalyzer

from treeinterpreter import treeinterpreter as ti
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

然后再读取往年的股价的数据，对其处理并生成Pandas的DataFrame格式。

df_stocks = pd.read_pickle('data/pickled_ten_year_filtered_data.pkl')
df_stocks['prices'] = df_stocks['adj close'].Apply(np.int64)
df_stocks = df_stocks[['prices', 'articles']]
df_stocks['articles'] = df_stocks['articles'].map(lambda x: x.lstrip('.-'))

注：此处的数据是已经序列化成为文件的Python对象。通过

print(df_stocks)

来查看我们的df_stocks DataFrame对象。其输出如下：

            prices                                           articles
2007-01-01   12469   What Sticks from '06. Somalia Orders Islamist...
2007-01-02   12472   Heart Health: Vitamin Does Not Prevent Death ...
2007-01-03   12474   google Answer to Filling Jobs Is an Algorithm...
2007-01-04   12480   Helping Make the Shift From Combat to Commerc...
2007-01-05   12398   Rise in Ethanol Raises Concerns About Corn as...
2007-01-06   12406   A Status Quo Secretary General. Best Buy and ...
2007-01-07   12414   THE COMMON APPLICATION; Typo.com. Jumbo Bonus...
...            ...              ...
2016-12-31   19762  Terrorist Attack at Nightclub in Istanbul Kill...

[3653 rows x 2 columns]

Process finished with exit code 0

可以看到，我们已经成功获取到了股票的股价以及相关的文章的内容，下一步我们开始对股票情感数据与股价数据联合起来进行分析处理。先将df_stocks中的price Series独立出来，成为一个单独的DataFrame对象。因为我们对股票数据进行分析，并且不想破坏原DataFrame。在独立出来Price之后，我们再添加几个新的Series，接下来就是使用NLTK对文章进行情感分析了。

df = df_stocks[['prices']].copy()

df["compound"] = ''#合成
df["neg"] = ''#负面
df["neu"] = ''#中立
df["pos"] = ''#积极

我们使用NLTK的情感强度分析器对文章情感进行分析。并将情感的强度写入新独立出来的DataFrame df中。其中neg Series用来存放该新闻的负面指数，neu Series用来存放该新闻的中立指数，pos Series用来存放该新闻的正面（积极）指数，Compound用来存放该新闻的合成（将neg neu pos结合）指数。

nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()
for date, row in df_stocks.T.iteritems():
    try:
        sentence = unicodedata.normalize('NFKD', df_stocks.loc[date, 'articles'])
        ss = sid.polarity_scores(sentence)
        df.at[date, 'compound'] = ss['compound']
        df.at[date, 'neg'] = ss['neg']
        df.at[date, 'neu'] = ss['neu']
        df.at[date, 'pos'] = ss['pos']
    except TypeError:
        print(df_stocks.loc[date, 'articles'])
        print(date)

其输出如下：

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:...nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
            prices compound    neg    neu    pos
2007-01-01   12469  -0.9814  0.159  0.749  0.093
2007-01-02   12472  -0.8179  0.114  0.787  0.099
2007-01-03   12474  -0.9993  0.198  0.737  0.065
...          ...          ...     ...     ...      ...
2016-12-28   19833   0.2869  0.128  0.763  0.108
2016-12-29   19819  -0.9789  0.138  0.764  0.097
2016-12-30   19762   -0.995  0.168  0.734  0.098
2016-12-31   19762  -0.2869  0.173  0.665  0.161

[3653 rows x 5 columns]

Process finished with exit code 0

得到上述输出之后，我们便成功地获得了历史文章的情感分析数据。

划分数据集

从上面的输出，我们可以看到开始时间是2007年1月1日，而结束时间是2016年12月31日。我们以8：2的比例来划分训练集与测试集。

train_start_date = '2007-01-01'
train_end_date = '2014-12-31'
test_start_date = '2015-01-01'
test_end_date = '2016-12-31'
train = df.ix[train_start_date : train_end_date]
test = df.ix[test_start_date:test_end_date]

对df进行划分完成之后，再新建一个对每个时间点情感评分的List，并将训练集与测试集的数据加入其中。

sentiment_score_list = []

for date, row in train.T.iteritems():
    sentiment_score = np.asarray([df.loc[date, 'neg'], df.loc[date, 'pos']])
    sentiment_score_list.append(sentiment_score)
numpy_df_train = np.asarray(sentiment_score_list)

sentiment_score_list = []
for date, row in train.T.iteritems():
    sentiment_score = np.asarray([df.loc[date, 'neg'], df.loc[date, 'pos']])
    sentiment_score_list.append(sentiment_score)
numpy_df_train = np.asarray(sentiment_score_list)

由于我们程序预测的目标是股价，因此y标签也就是股价。

y_train = pd.DataFrame(train['prices'])
y_test = pd.DataFrame(test['prices'])

使用随机森林算法对股价进行预测

使用Scikit Learn封装好了的的随机森林算法对股票进行预测。

rf = RandomForestRegressor()
rf.fit(numpy_df_train, y_train)

#print(rf.feature_importances_)
prediction, bias, contributions = ti.predict(rf, numpy_df_test)
print(preditcion)

在看到控制台有输出之后，如果输出正确则证明使用随机森林算法对股票预测成功了。为了更加直观地观察我们的预测与实际情况有多少偏差，则需要使用Matplotlib来进行绘图。

#Matplot
idx = pd.date_range(test_start_date, test_end_date)
predictions_df = pd.DataFrame(data=prediction[0:731], index=idx, columns=['prices'])
print(predictions_df)
predictions_plot = predictions_df.plot()

fig = y_test.plot(ax=predictions_plot).get_figure()

ax = predictions_df.rename(columns={"Price": "Predicted Price"}).plot(title='Random Forest Predict Stock Price')
ax.set_xlabel("Date")
ax.set_ylabel("Price")
fig = y_test.rename(columns={"Price": "Actual Price"}).plot(ax=ax).get_figure()
fig.savefig("RF_noSmoothing.png")

通过上述代码，我们绘制了没有平滑的随机森林算法预测的股价走势，并保存为"RF_noSmoothing.png"。

预测结果可视化

上图中蓝色的折线是预测的股价，而橙色的折现是真实的股票走势。很明显我们的预测与实际产生了巨大的偏差，因此我们需要对数据进行进一步处理，将股价加上一个常数来表示测试时的闭市股价。

temp_date = test_start_date
average_last_5_days_test = 0
total_days = 10
for i in range(total_days):
    average_last_5_days_test += test.loc[temp_date, 'prices']
    temp_date = datetime.strptime(temp_date, "%Y-%m-%d").date()
    difference = temp_date + timedelta(days=1)
    temp_date = difference.strftime('%Y-%m-%d')
average_last_5_days_test = average_last_5_days_test / total_days
print(average_last_5_days_test)

temp_date = test_start_date
average_upcoming_5_days_predicted = 0
for i in range(total_days):
    average_upcoming_5_days_predicted += predictions_df.loc[temp_date, 'prices']
    temp_date = datetime.strptime(temp_date, "%Y-%m-%d").date()
    difference = temp_date + timedelta(days=1)
    temp_date = difference.strftime('%Y-%m-%d')
    print(temp_date)
average_upcoming_5_days_predicted = average_upcoming_5_days_predicted / total_days
print(average_upcoming_5_days_predicted)
difference_test_predicted_prices = average_last_5_days_test - average_upcoming_5_days_predicted
print(difference_test_predicted_prices)

predictions_df['prices'] = predictions_df['prices'] + difference_test_predicted_prices

再次使用Matplotlib对修正过后的预测进行绘图。

# RF plot aligned
ax = predictions_df.rename(columns={"prices": "predicted_price"}).plot(title='Random Forest Predict Stock Price Aligned')
ax.set_xlabel("Dates")
ax.set_ylabel("Stock Prices")
fig = y_test.rename(columns={"prices": "actual_price"}).plot(ax = ax).get_figure()
fig.savefig("RF_aligned.png")

修正后的预测折线与实际折线

通过对预测数据进行修正，我们发现预测折线开始向实际折线靠拢了，但预测折线上下抖动太过明显，因此需要对其进行平滑处理。在平滑处理方面，我们使用Pandas的EWMA（Exponentially Weighted Moving-Average，指数加权移动平均值的控制图）方法来进行。

# Pandas EWMA
# predictions_df['ewma'] = pd.ewma(predictions_df["prices"], span=60, freq="D").mean()
predictions_df['ewm'] = 
    predictions_df["prices"].ewm(span=60, min_periods=0, freq='D', adjust=True, ignore_na=False).mean()

predictions_df['actual_value'] = test['prices']
# predictions_df['actual_value_ewma'] = pd.ewma(predictions_df["actual_value"], span=60, freq="D").mean()
predictions_df['actual_value_ewm'] = 
    predictions_df["actual_value"].ewm(span=60, min_periods=0, freq='D', adjust=True, ignore_na=False).mean()
predictions_df.columns = ['predicted_price', 'average_predicted_price', 'actual_price', 'average_actual_price']

再次对我们随机森林算法预测的结果进行绘图。

# RF smoothed
predictions_plot = predictions_df.plot(title='Random Forest Predict Stock Price Aligned and Smoothed')
predictions_plot.set_xlabel("Dates")
predictions_plot.set_ylabel("Stock Prices")
fig = predictions_plot.get_figure()
fig.savefig("RF_smoothed.png")

使用随机森林算法预测的股票走势

我们可以看到，随机森林算法并没有很好地拟合股票走势的曲线。上图中，绿色和红色的是实际股票的走势。而橙色的平滑后的预测走势与最后部分真实股票的走向甚至相反。让我们只绘制平滑后的实际股市走势与预测走势的折现。

# 只绘制平滑后的实际股市走势与预测走势的折现
predictions_df_average = predictions_df[['Average_predicted_price', 'Average_actual_price']]
predictions_plot = predictions_df_average.plot(title='Random Forest Predict Stock Price Aligned and Smoothed')
predictions_plot.set_xlabel("Dates")
predictions_plot.set_ylabel("Prices")
fig = predictions_plot.get_figure()
fig.savefig("RF_smoothed_and_actual_price.png")

预测走势与实际走势

很明显，随机森林算法的预测效果并没有理想中的那么好。那么下一步，我们将尝试使用最普遍的线性回归模型来进行预测。

使用线性回归算法对股价进行预测

线性回归模型具有效率高的特点，我的“从零开始机器学习”系列文章中从零开始机器学习-10 TensorFlow的基本使用方法便是以线性回归为例子讲的TensorFlow使用方法。这里我们使用线性回归模型进行预测的过程不再赘述。

def LR_prediction():
    years = [2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
    prediction_list = []
    for year in years:
        # 划分训练集测试集
        train_start_date = str(year) + '-01-01'
        train_end_date = str(year) + '-10-31'
        test_start_date = str(year) + '-11-01'
        test_end_date = str(year) + '-12-31'
        train = df.ix[train_start_date: train_end_date]
        test = df.ix[test_start_date:test_end_date]

        # 计算情感分数
        sentiment_score_list = []
        for date, row in train.T.iteritems():
            sentiment_score = np.asarray(
                [df.loc[date, 'compound'], df.loc[date, 'neg'], df.loc[date, 'neu'], df.loc[date, 'pos']])
            sentiment_score_list.append(sentiment_score)
        numpy_df_train = np.asarray(sentiment_score_list)

        sentiment_score_list = []
        for date, row in test.T.iteritems():
            sentiment_score = np.asarray(
                [df.loc[date, 'compound'], df.loc[date, 'neg'], df.loc[date, 'neu'], df.loc[date, 'pos']])
            sentiment_score_list.append(sentiment_score)
        numpy_df_test = np.asarray(sentiment_score_list)

        # 线性回归模型
        lr = LogisticRegression()
        lr.fit(numpy_df_train, train['prices'])

        prediction = lr.predict(numpy_df_test)
        prediction_list.append(prediction)
        idx = pd.date_range(test_start_date, test_end_date)
        predictions_df_list = pd.DataFrame(data=prediction[0:], index=idx, columns=['prices'])

        difference_test_predicted_prices = offset_value(test_start_date, test, predictions_df_list)
        # 对齐
        predictions_df_list['prices'] = predictions_df_list['prices'] + difference_test_predicted_prices
        predictions_df_list

        # 平滑
        predictions_df_list['ewm'] = predictions_df_list["prices"].ewm(span=10,freq='D').mean()
        predictions_df_list['actual_value'] = test['prices']
        predictions_df_list['actual_value_ewma'] = predictions_df_list["actual_value"].ewm(span=10, freq='D').mean()
        # 更改Series名称
        predictions_df_list.columns = ['predicted_price', 'average_predicted_price', 'actual_price',
                                       'average_actual_price']
        predictions_df_list.plot()
        predictions_df_list_average = predictions_df_list[['average_predicted_price', 'average_actual_price']]
        predictions_df_list_average.plot()

        # 只绘制平滑后的实际股市走势与预测走势的折现
        predictions_plot = predictions_df_list_average.plot(title='Linear Regression Predict Stock Price Aligned and Smoothed')
        predictions_plot.set_xlabel("Dates")
        predictions_plot.set_ylabel("Prices")
        fig = predictions_plot.get_figure()
        fig.savefig("LR_smoothed_and_actual_price.png")

        plt.show()

线性回归模型预测结果

通过对所有输出的图（针对很长的时间，分段预测并绘图）的观察，我们可以看到线性回归的预测甚至要比随机森林要好一些，但是并不能给我们太多的参考价值。

使用神经网络算法对股价进行预测

关于神经网络相关的知识，我的“从零开始机器学习”系列文章中讲到。下面是使用Scikit Learn的MLP（多层感知机）对股价进行预测的代码：

def MLP_prediction():
    years = [2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
    prediction_list = []
    for year in years:
        # 分割数据集与测试集
        train_start_date = str(year) + '-01-01'
        train_end_date = str(year) + '-10-31'
        test_start_date = str(year) + '-11-01'
        test_end_date = str(year) + '-12-31'
        train = df.ix[train_start_date: train_end_date]
        test = df.ix[test_start_date:test_end_date]

        # 计算情感分数
        sentiment_score_list = []
        for date, row in train.T.iteritems():
            sentiment_score = np.asarray(
                [df.loc[date, 'compound'], df.loc[date, 'neg'], df.loc[date, 'neu'], df.loc[date, 'pos']])
            sentiment_score_list.append(sentiment_score)
        numpy_df_train = np.asarray(sentiment_score_list)

        sentiment_score_list = []
        for date, row in test.T.iteritems():
            sentiment_score = np.asarray(
                [df.loc[date, 'compound'], df.loc[date, 'neg'], df.loc[date, 'neu'], df.loc[date, 'pos']])
            sentiment_score_list.append(sentiment_score)
        numpy_df_test = np.asarray(sentiment_score_list)

        # 创建MLP模型
        mlpc = MLPClassifier(hidden_layer_sizes=(100, 200, 100), activation='relu',
                             solver='lbfgs', alpha=0.005, learning_rate_init=0.001, shuffle=False)  # span = 20 # best 1
        mlpc.fit(numpy_df_train, train['prices'])
        prediction = mlpc.predict(numpy_df_test)

        prediction_list.append(prediction)
        idx = pd.date_range(test_start_date, test_end_date)
        predictions_df_list = pd.DataFrame(data=prediction[0:], index=idx, columns=['prices'])

        difference_test_predicted_prices = offset_value(test_start_date, test, predictions_df_list)
        predictions_df_list['prices'] = predictions_df_list['prices'] + difference_test_predicted_prices
        predictions_df_list

        # 平滑
        predictions_df_list['ewma'] = predictions_df_list["prices"].ewm(span=20, freq='D').mean()
        predictions_df_list['actual_value'] = test['prices']
        predictions_df_list['actual_value_ewma'] = predictions_df_list["actual_value"].ewm(span=20, freq='D').mean()

        predictions_df_list.columns = ['predicted_price', 'average_predicted_price', 'actual_price',
                                       'average_actual_price']
        predictions_df_list.plot()
        predictions_df_list_average = predictions_df_list[['average_predicted_price', 'average_actual_price']]
        predictions_df_list_average.plot()

        plt.show()

使用神经网络训练的模型预测效果

通过对预测效果之间的观察，我们发现神经网络预测的效果最好，这是因为神经网络具有强大的表示能力。关于神经网络的相关知识，可以参考文章从零开始机器学习-16 初探神经网络（Neural Network）和从零开始机器学习-17 神经网络的训练过程。

结语

人工智能的应用近些年来愈加广泛。因为计算力和数据的爆发，机器学习也迎来了极大的发展。本文以对股票的预测为引，展示了机器学习在Data Science方面的强大能力。在生活中，我们可以通过选择合适的算法，编写如微博情感分析、聊天机器人、图像识别、语音识别、天气预测等便及生活的人工智能应用。