用Python分析時序數據264972345820070s;

預測給定輸入序列中的下一個是機器學習中的另一個重要概念。本章詳細介紹了時間序列數據的分析。

Introduction

時間序列數據是指在一系列特定時間間隔內的數據。如果要在機器學習中建立序列預測，就必須處理序列數據和時間。序列數據是序列數據的抽象。數據排序是序列數據的一個重要特徵。

Basic Concept of Sequence Analysis or Time Series Analysis

序列分析或時間序列分析是指在給定的輸入序列中，根據先前觀察到的數據預測下一個輸入序列。預測可以是任何可能接下來發生的事情：一個符號、一個數字、第二天天氣、下一個詞條等等。序列分析在股市分析、天氣預報和產品推薦等應用中非常方便。

示例

考慮下面的例子來理解序列預測。這裡A，b，C，D是給定值，您必須使用序列預測模型預測值E。

Installing Useful Packages

對於使用Python的時間序列數據分析，我們需要安裝以下軟體包−

Pandas

Pandas是一個開源的BSD許可庫，它爲Python提供了高性能、易於使用的數據結構和數據分析工具。您可以使用以下命令安裝Pandas−

pip install pandas

如果您正在使用Anaconda，並希望通過使用conda包管理器進行安裝，那麼可以使用以下命令−

conda install -c anaconda pandas

hmmlearn

它是一個開源的BSD許可庫，由簡單的算法和模型組成，用於在Python中學習隱馬爾可夫模型（HMM）。您可以使用以下命令安裝它−

pip install hmmlearn

如果您正在使用Anaconda，並希望通過使用conda包管理器進行安裝，那麼可以使用以下命令−

conda install -c omnia hmmlearn

PyStruct

它是一個結構化的學習和預測庫。PyStruct中實現的學習算法有條件隨機場（CRF）、最大邊際馬爾可夫隨機場（M3N）或結構支持向量機等名稱。您可以使用以下命令安裝它−

pip install pystruct

CVXOPT

它用於基於Python程式語言的凸優化。它也是一個自由軟體包。您可以使用以下命令安裝它−

pip install cvxopt

如果您正在使用Anaconda，並希望通過使用conda包管理器進行安裝，那麼可以使用以下命令−

conda install -c anaconda cvdoxt

Pandas: Handling, Slicing and Extracting Statistic from Time Series Data

如果必須處理時間序列數據，Pandas是一個非常有用的工具。在熊貓的幫助下，您可以執行以下操作;

使用pd.date\u range包創建日期範圍
使用pd.Series包爲熊貓的日期編制索引
使用ts.resample包執行重新採樣
改變頻率

Example

下面的示例演示如何使用Pandas處理和切片時間序列數據。請注意，這裡我們使用的是每月北極濤動數據，可從Monthly.ao.index.b50.current.ascii下載，並可轉換爲文本格式供我們使用。

Handling time series data

要處理時間序列數據，您必須執行以下步驟&負;

第一步是導入以下包−

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

接下來，定義一個函數，它將從輸入文件中讀取數據，如下面給出的代碼所示;

def read_data(input_file):
   input_data = np.loadtxt(input_file, delimiter = None)

現在，把這些數據轉換成時間序列。爲此，請創建時間序列的日期範圍。在本例中，我們保留一個月作爲數據頻率。我們的檔案里有1950年1月開始的數據。

dates = pd.date_range('1950-01', periods = input_data.shape[0], freq = 'M')

在這一步中，我們使用Pandas序列創建時間序列數據，如下所示;

output = pd.Series(input_data[:, index], index = dates)	
return output
	
if __name__=='__main__':

輸入輸入文件的路徑，如下所示−

input_file = "/Users/admin/AO.txt"

現在，將列轉換爲timeseries格式，如下所示−

timeseries = read_data(input_file)

最後，使用顯示的命令繪製和可視化數據;

plt.figure()
timeseries.plot()
plt.show()

您將觀察如下圖所示的繪圖;

Slicing time series data

Slicing involves retrieving only some part of the time series data. As a part of the example, we are slicing the data only from 1980 to 1990. Observe the following code that performs this task −

timeseries['1980':'1990'].plot()
   <matplotlib.axes._subplots.AxesSubplot at 0xa0e4b00>

plt.show()

When you run the code for slicing the time series data, you can observe the following graph as shown in the image here −

Extracting Statistic from Time Series Data

You will have to extract some statistics from a given data, in cases where you need to draw some important conclusion. Mean, variance, correlation, maximum value, and minimum value are some of such statistics. You can use the following code if you want to extract such statistics from a given time series data −

Mean

You can use the mean() function, for finding the mean, as shown here −

timeseries.mean()

然後您將觀察到的示例輸出是&負;

-0.11143128165238671

Maximum

您可以使用max（）函數來查找最大值，如下所示−

timeseries.max()

然後您將觀察到的示例輸出是&負;

3.4952999999999999

Minimum

您可以使用min（）函數來查找最小值，如下所示−

timeseries.min()

然後您將觀察到的示例輸出是&負;

-4.2656999999999998

Getting everything at once

如果您想一次計算所有統計數據，可以使用describe（）函數，如下所示−

timeseries.describe()

然後您將觀察到的示例輸出是&負;

count   817.000000
mean     -0.111431
std       1.003151
min      -4.265700
25%      -0.649430
50%      -0.042744
75%       0.475720
max       3.495300
dtype: float64

Re-sampling

您可以將數據重新採樣到不同的時間頻率。用於執行重新採樣的兩個參數是&負;

Time period
Method

Re-sampling with mean()

您可以使用以下代碼使用mean（）方法（這是默認方法−

timeseries_mm = timeseries.resample("A").mean()
timeseries_mm.plot(style = 'g--')
plt.show()

然後，可以使用mean（）−觀察以下圖形作爲重採樣的輸出;

Re-sampling with median()

您可以使用以下代碼使用median（）方法對數據重新採樣−

timeseries_mm = timeseries.resample("A").median()
timeseries_mm.plot()
plt.show()

然後，您可以觀察下面的圖表作爲使用median（）−重新採樣的輸出;

Rolling Mean

您可以使用以下代碼計算滾動（移動）平均值;

timeseries.rolling(window = 12, center = False).mean().plot(style = '-g')
plt.show()

然後，您可以觀察以下圖形作爲滾動（移動）平均值的輸出;

Analyzing Sequential Data by Hidden Markov Model (HMM)

HMM是一種統計模型，廣泛應用於時間序列股市分析、健康檢查、語音識別等具有連續性和可擴展性的數據。本節詳細介紹了使用隱馬爾可夫模型（HMM）分析序列數據。

Hidden Markov Model (HMM)

HMM是建立在Markov鏈概念基礎上的一種隨機模型，它假設未來狀態的機率只取決於當前的過程狀態，而不是之前的任何狀態。例如，擲硬幣時，我們不能說第五次擲硬幣的結果是一個頭。這是因爲硬幣沒有任何內存，下一個結果不依賴於上一個結果。

從數學上講，HMM由以下變量組成&負;

States (S)

它是隱馬爾可夫模型中存在的一組隱藏或潛在狀態，用S表示。

Output symbols (O)

它是HMM中存在的一組可能的輸出符號，用O表示。

State Transition Probability Matrix (A)

它是從一個狀態過渡到另一個狀態的機率。它用A表示。

Observation Emission Probability Matrix (B)

它是在特定狀態下發射/觀察符號的機率。它用B表示。

Prior Probability Matrix (Π)

它是從系統的不同狀態開始在特定狀態的機率。用∏表示。

因此，HMM可以定義爲𝝀=（S，O，a，b，𝝅），

哪裡，

S = {s₁,s₂,…,s_N} is a set of N possible states,
O = {o₁,o₂,…,o_M} is a set of M possible observation symbols,
A is an N𝒙N state Transition Probability Matrix (TPM),
B is an N𝒙M observation or Emission Probability Matrix (EPM),
π is an N dimensional initial state probability distribution vector.

Example: Analysis of Stock Market data

在這個例子中，我們將逐步分析股票市場的數據，以了解HMM如何處理序列或時間序列數據。請注意，我們正在用Python實現這個示例。

導入必要的包，如下所示;

import datetime
import warnings

現在，使用matpotlib.finance包中的股票市場數據，如下所示&負;

import numpy as np
from matplotlib import cm, pyplot as plt
from matplotlib.dates import YearLocator, MonthLocator
try:
   from matplotlib.finance import quotes_historical_yahoo_och1
except ImportError:
   from matplotlib.finance import (
      quotes_historical_yahoo as quotes_historical_yahoo_och1)

from hmmlearn.hmm import GaussianHMM

從開始日期和結束日期（即此處顯示的兩個特定日期之間）加載數據;

start_date = datetime.date(1995, 10, 10)
end_date = datetime.date(2015, 4, 25)
quotes = quotes_historical_yahoo_och1('INTC', start_date, end_date)

在這一步中，我們將每天提取收盤報價。爲此，請使用以下命令−

closing_quotes = np.array([quote[2] for quote in quotes])

現在，我們將提取每天的交易量。爲此，請使用以下命令−

volumes = np.array([quote[5] for quote in quotes])[1:]

這裡，使用下面顯示的代碼取收盤股價的百分比差;

diff_percentages = 100.0 * np.diff(closing_quotes) / closing_quotes[:-]
dates = np.array([quote[0] for quote in quotes], dtype = np.int)[1:]
training_data = np.column_stack([diff_percentages, volumes])

在此步驟中，創建並訓練高斯隱馬爾可夫模型。爲此，請使用以下代碼−

hmm = GaussianHMM(n_components = 7, covariance_type = 'diag', n_iter = 1000)
with warnings.catch_warnings():
   warnings.simplefilter('ignore')
   hmm.fit(training_data)

現在，使用HMM模型生成數據，使用顯示的命令-minus;

num_samples = 300
samples, _ = hmm.sample(num_samples)

最後，在這一步中，我們以圖表的形式繪製並可視化不同百分比和交易量的股票作爲輸出。

使用下面的代碼繪製並可視化差異百分比;

plt.figure()
plt.title('Difference percentages')
plt.plot(np.arange(num_samples), samples[:, 0], c = 'black')

使用以下代碼繪製並可視化股票交易量;

極客書

AI with Python – Analyzing Time Series Data

Introduction

Basic Concept of Sequence Analysis or Time Series Analysis

Installing Useful Packages

Pandas

hmmlearn

PyStruct

CVXOPT

Pandas: Handling, Slicing and Extracting Statistic from Time Series Data

Example

Handling time series data

Slicing time series data

Extracting Statistic from Time Series Data

Mean

Maximum

Minimum

Getting everything at once

Re-sampling

Re-sampling with mean()

Re-sampling with median()

Rolling Mean

Analyzing Sequential Data by Hidden Markov Model (HMM)

Hidden Markov Model (HMM)

States (S)

Output symbols (O)

State Transition Probability Matrix (A)

Observation Emission Probability Matrix (B)

Prior Probability Matrix (Π)

Example: Analysis of Stock Market data