Non-engineer memoblog

エンジニアではない人間のメモブログ（備忘録とアウトプット欲のために）

Pythonデータサイエンス入門以前のメモ（その２）

前回の続き。学習コンテンツを眺めながら自分用のメモを記していく。
www.udemy.com

PandasのDataFrameについて

# 以下はお決まり。
import numpy as np
import pandas as pd

# pd.Seriesとpd.DataFrameは良く使うので「Series」「DataFrame」の単語のみで使えるようにインポートする
from pandas import Series, DataFrame

# DataFrameを新規に作るには、DataFrame(hoge) とする（下の方で出て来る）。
## NFLのデータをサンプルとして使用。

import webbrowser
website = 'http://en.wikipedia.hogehoge'
webbrowser.open(website)

## クリップボードから読み込むことが可能。

nfl_frame = pd.read_clipboard()

# nfl_frameには「表（table）」が格納されているためnotebook上に「表（table）」が表示される
# 列（カラム）の名前が「.columns」で確認できる。

nfl_frame.columns

# 個別の列（カラム）にアクセスする場合は、角括弧[ ] に列名（カラム名）を記述する。

# nfl_frame['First Season']　とすると以下のDataFrameが返ってくる。
    0    1960
    1    1920
    2    1921
    3    1966
    4    1996
    5    1950
    Name: First Season, dtype: int64

'''
「.Team」のように「.」の後に列名（カラム名）でもOK。ただしFirst Seasonは2単語のスペース有りだから角括弧でなければならない
'''

# nfl_frame.Team　とすると以下のDataFrameが返ってくる。
    0       Dallas Cowboys
    1        Chicago Bears
    2    Green Bay Packers
    3       Miami Dolphins
    4     Baltimore Ravens
    Name: Team, dtype: object

## 複数のカラム（列）のみを抽出してDataFrameを作る

nfl_frame[['Team', 'First Season']]

'''
上記のようにすると「Team」と「First Season」の2列のDataFrame（表：Table）ができる。
特定のいくつかのカラムのみを抽出し「新しいDataFrame」を作る（元のDataFrameを上書きしないで済む）。
'''

DataFrame(nfl_frame,columns=['Team','First Season','Total Games'])

'''
上記のようにすると、「Team」「First Season」「Total Games」の3列名でのDataFrame（表：Table）ができる。
'''

## さらに新しい列（カラム）を追加できる。「Stadium」を追加。

DataFrame(nfl_frame,columns=['Team','First Season','Total Games','Stadium'])

# ただし、「Stadium」には何もデータが無いのでデータの箇所は「NaN（Null）」になる。

## notebookが縦に長くなるため、「.head()」を使うと先頭から○○行だけを表示できる。

# nfl_frame.head(3)  とすると、先頭（1行目）から、3行目までの表（Table）が表示される。

# 「.tail()は逆の下から○○行」

## indexの番号でアクセス「 .ix[] 」

'''
nfl_frame.ix[3]　とすると、インデックス番号3（0から数えて4番目）のDataFrameにアクセスできる。

データを挿入するには、nfl_frame['Stadium'] = "Levi’s Stadium"　とすると列全体に値を代入できる。

「Levi’s Stadium」に「’」があるので、全体は「”」にする。

ただし、上記だと列すべて「Levi’s Stadium」なので、nfl_frame["Stadium"] = np.arange(6)　とすることで、0から5の数字が連番で代入される。

ただし「np.arange()」の引数と「列」の長さを合わせる用に注意
'''

## PandasのSeriesをDataFrameに追加する

#Seriesで「Levi’s Stadium」「AT&T Stadium」を作成して、そのindex番号を4番と0番に指定。
stadiums = Series(["Levi’s Stadium","AT&T Stadium"],index=[4,0])

# stadiums　には以下が格納されている。
    4    Levi’s Stadium
    0      AT&T Stadium
    dtype: object

# その「stadiums」を先ほどのDataFrame（nfl_frame）に代入すると、

nfl_frame['Stadium']=stadiums

# nfl_frame　の「列Stadium」の0番目は「AT&T Stadium」、4番目は「Levi’s Stadium」のデータが挿入されている（他は全て「Nan」）。

### 列を消す

'''
del nfl_frame['Stadium'] とすると、列Stadiumを列ごと削除できる。
PythonのDict型からDataFrameを作れる
'''

# 以下のようなPythonのDictを「変数data」に格納する。

data = {'City':['SF','LA','NYC'],
        'Population':[837000,3880000,8400000]}

# city_frame = DataFrame(data)　とすると、DataFrameが作成されて「city_frame」に代入。
# city_frame　には以下のようなデータフレームが格納されている。
        City　	Population
    0	SF　　	837000
    1	LA　　	3880000
    2	NYC　	8400000

Pandasのindexについて

# まず、PandasのSerieseを以下のように作成する。my_ser = Series([1,2,3,4],index=['A','B','C','D'])
　
# そのSerieseから、indexだけを取り出す。

my_index = my_ser.index

# my_index　には以下のようにindexだけが格納されている。

Index(['A', 'B', 'C', 'D'], dtype='object')

# my_index[2]　とすれば、3番目（0から数える）のindexを取り出せる。
'C'

# Pythonのスライスも可能（3番目以降を全て~）。
    my_index[2:]
    Index(['C', 'D'], dtype='object')

# ただしPandasは、indexの値を変更は不可（データの堅牢性のため）。

my_index[0] = 'Z'

TypeError: Indexes does not support mutable operations

# 上記のように「Z」を代入しようとするとエラーになる。

## Pandasのindexの変更について

# 簡単なSeriesを作り、「ser1」に代入。

ser1 = Series([1,2,3,4],index=['A','B','C','D'])

# ser1　は以下のようになっている。
    A    1
    B    2
    C    3
    D    4
    dtype: int64

# Pandasの「.reindex()」を使って、indexを変更。

ser2 = ser1.reindex(['A','B','C','D','E','F'])

# ser2　は以下のようになっている。
    A     1
    B     2
    C     3
    D     4
    E   NaN　# 自動的にnullが入る
    F   NaN　# 自動的にnullが入る
    dtype: float64

# 以下は値（Value）に、0、5、10のインデックスを付けている。

ser3 = Series(['USA','Mexico','Canada'],index=[0,5,10])

# ser3　は以下のようになっている。
    0        USA
    5     Mexico
    10    Canada
    dtype: object

# 一気に「NaN（Null）」を埋めてくれる「ffill（forward fill）」というものがある。

# ser3.reindex(range(15),method='ffill')　とすると以下のようになる。
    0        USA
    1        USA　# 自動的にUSAが入る
    2        USA　# 自動的にUSAが入る
    3        USA　# 自動的にUSAが入る
    4        USA　# 自動的にUSAが入る
    5     Mexico
    6     Mexico　# 自動的にMexicoが入る
    7     Mexico　# 自動的にMexicoが入る
    8     Mexico　# 自動的にMexicoが入る
    9     Mexico　# 自動的にMexicoが入る
    10    Canada
    11    Canada　# 自動的にCanadaが入る
    12    Canada　# 自動的にCanadaが入る
    13    Canada　# 自動的にCanadaが入る
    14    Canada　# 自動的にCanadaが入る
    dtype: object


## reshapeを使ってDataFrameを作ってみます。

#「randn」は正規分布の乱数で5行5列のDataFrameを作る。Cをわざと抜いてindex名を代入。列名（カラム）を代入。
dframe = DataFrame(randn(25).reshape((5,5)),　　
                    index=['A','B','D','E','F'],　　
                    columns=['col1','col2','col3','col4','col5'])　　

# dframe　は以下のようになっている。
	col1　　	　col2　　　	col3　　　	col4　　　	col5　　
A	0.650892	1.571965	1.726236	-0.423697	0.110843
B	1.037495	0.037039	-1.368314	-1.041347	1.337007
D	-1.289793	0.652301	0.548770	0.636829	-0.349049
E	-0.573696	0.108592	0.863125	-0.109940	-0.577262
F	1.160227	-0.546492	1.172700	0.071298	1.222122


## Cが抜けているため以下のように値（データ）は「NaN（Null）」なる。

new_index = ['A','B','C','D','E','F']
dframe2 = dframe.reindex(new_index)

# dframe　は以下のようになっている。
    	col1　　	　col2　　　	col3　　　	col4　　　	col5　　
    A	0.650892	1.571965	1.726236	-0.423697	0.110843
    B	1.037495	0.037039	-1.368314	-1.041347	1.337007
    C	NaN　　　　　	NaN　　　　	NaN　　　　	NaN　　　　	NaN
    D	-1.289793	0.652301	0.548770	0.636829	-0.349049
    E	-0.573696	0.108592	0.863125	-0.109940	-0.577262
    F	1.160227	-0.546492	1.172700	0.071298	1.222122


## 「ix」を使うと、素早く「Reindex（再インデックス化）」が可能
dframe.ix[new_index, new_columns]

	col1　　	　col2　　　	col3　　　	col4　　　	col5　　　　　col6 　　
A	0.650892	1.571965	1.726236	-0.423697	0.110843	NaN
B	1.037495	0.037039	-1.368314	-1.041347	1.337007	NaN
C	NaN　　　　　	NaN　　　　	NaN　　　　	NaN　　　　	NaN　　　       NaN
D	-1.289793	0.652301	0.548770	0.636829	-0.349049	NaN
E	-0.573696	0.108592	0.863125	-0.109940	-0.577262	NaN
F	1.160227	-0.546492	1.172700	0.071298	1.222122	NaN

SerieseやDataFrameの行や列を削除について

### お約束
import numpy as np
from pandas import Series,DataFrame
import pandas as pd

ser1 = Series(np.arange(3),index=['a','b','c'])

# ser1　とすると以下のようになる。
    a    0
    b    1
    c    2
    dtype: int64

## Serieseのindexを消すには「.drop()」を使用

# ser1.drop('b')　とすると以下のようになる。
    a    0
    c    2
    dtype: int64


## DataFrameの場合

#「reshape」で3行3列のDataFrameを作成。index名を代入（都市の略称）。列名（カラム）を代入。
dframe1 = DataFrame(np.arange(9).reshape((3,3)),
                    index=['SF','LA','NY'],　
                    columns=['pop','size','year'])

# dframe1　とすると以下のようになる。

    	pop	size	year
    SF	0	1	2
    LA	3	4	5
    NY	6	7	8

## DataFrameのindexを指定して行を削除（axis=0が省略されている。「axisは軸」）

# dframe1.drop('LA') 　とすると以下のようになる。

    	pop	size	year
    SF	0	1	2
    NY	6	7	8

# ただし大元のDataFrameの値が削除（変更）されたのではなく、index「LA」の行が削除されたDataFrameが生成されて返ってきているだけ。

## DataFrameの列（カラム）を削除することも可能（その場合、列の軸を示す、axis=1が必要。「axisは軸」）

dframe1.drop('year',axis=1)

    	pop	size
    SF	0	1
    LA	3	4
    NY	6	7

SerieseやDataFrameからデータを取り出す

import numpy as np
from pandas import Series,DataFrame
import pandas as pd

ser1 = Series(np.arange(3),index=['A','B','C'])

# 分かりやすい数値にするため各要素を２倍。
ser1 = 2*ser1

# ser1　とすると以下のようになる。
    A    0
    B    2
    C    4
    dtype: int64

ser1['B']　###　2 が返ってくる
ser1[1]　###　2 が返ってくる

ser1[0:3]    # 数字の添え字で範囲も指定可能。
    A    0
    B    2
    C    4
    dtype: int64

ser1[['A','B','C']]　　# ABCの文字列のindexをリスト形式で与えることも可能。
    A    0
    B    2
    C    4
    dtype: int64


# 論理式を与えることも可能。
ser1[ser1>3]
    C    4
    dtype: int64

# 条件に合った場所の値を変更できます。
ser1[ser1>3] = 10

# ser1　とすると以下のようになる。
    A     0
    B     2
    C    10　　# 10に書き換わった。
    dtype: int64

## DataFrameの場合

#「reshape」で5行5列のDataFrameを作成。index名を代入（都市の略称）。列名（カラム）を代入。
dframe = DataFrame(np.arange(25).reshape((5,5)),　　
                    index=['NYC','LA','SF','DC','Chi'],
                    columns=['A','B','C','D','E'])

# dframe　とすると以下のようになる。
    	A	B	C	D	E
    NYC	0	1	2	3	4
    LA	5	6	7	8	9
    SF	10	11	12	13	14
    DC	15	16	17	18	19
    Chi	20	21	22	23	24


# 列（カラム）の名前で選択（ここでは「列B」）
dframe['B']
    NYC     1
    LA      6
    SF     11
    DC     16
    Chi    21
    Name: B, dtype: int64

# リストで複数のカラムを選択可能（「列B」と「列E」を選択）
dframe[['B','E']]
    	B	E
    NYC	1	4
    LA	6	9
    SF	11	14
    DC	16	19
    Chi	21	24


# 論理式を与えることも可能（「列C」で値が「8」より大きい「行」を表示）
dframe[dframe['C']>8]

    	A	B	C	D	E
    SF	10	11	12	13	14
    DC	15	16	17	18	19
    Chi	20	21	22	23	24


# 真偽値（ブーリアン）をそのまま表示することも可能。
dframe> 10

    	A	B	C	D	E
    NYC	False	False	False	False	False
    LA	False	False	False	False	False
    SF	False	True	True	True	True
    DC	True	True	True	True	True
    Chi	True	True	True	True	True


# 「ix」を使うと「行」のデータを縦（列ではない）のように表示する。
dframe.ix['LA']
    A    5
    B    6
    C    7
    D    8
    E    9
    Name: LA, dtype: int64


# 「ix」には「数字」の添え字も渡せる（ここでは2番目の行。つまり「LA」と同じ）。
dframe.ix[1]
    A    5
    B    6
    C    7
    D    8
    E    9
    Name: LA, dtype: int64