20ํ
20ํ#
Attention
์บ๊ธ์ ์
๋ก๋๋ ๋ค๋ฅธ ๋ถ๋ค ์ฝ๋ ๋ณด๋ฌ๊ฐ๊ธฐ
๋ฌธ์ ์ค๋ฅ, ์ฝ๋์ค๋ฅ ๋๊ธ๋ก ํผ๋๋ฐฑ์ฃผ์ธ์
Attention
1๋ฒ
๋ ์จ ์จ๋ ์์ธก, ์ข
์๋ณ์ :actual(์ต๊ณ ์จ๋)
๋ฐ์ดํฐ ์ถ์ฒ : https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
๋ฐ์ดํฐ ๊ฒฝ๋ก : /kaggle/input/adp-kr-p2/problem1.csv
temp_1 : ์ ๋ ์ต๊ณ ์จ๋
temp_2 : ์ ์ ๋ ์ต๊ณ ์จ๋
friend : ์น๊ตฌ์ ์์ธก์จ๋
1-1๋ฒ
๋ฐ์ดํฐ ํ์ธ ๋ฐ ์ ์ฒ๋ฆฌ
๋ฐ์ดํฐ EDA ์ํ
๊ฒฐ์ธก์น๋ฅผ ํ์ธํ๊ณ ์ฒ๋ฆฌ ๋ฐฉ์์ ๋ํด ๋ ผ์ํ๋ผ
๋ฐ์ดํฐ ๋ถํ ๋ฐฉ๋ฒ ์ค๋ช
์ต์ข ๋ฐ์ดํฐ์ ์ด ์ ์ ํจ์ ์ฃผ์ฅํ๋ผ
import pandas as pd
import seaborn as sns
df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p2/problem1.csv')
print(df.head()) # ์์ 5๊ฐ
print(df.shape) # ๋ฐ์ดํฐ ํํ
display(sns.pairplot(df)) # ๋ณ์๋ณ ์๊ด๊ณ์ค
print(df.info()) # ๊ฐ ์ปฌ๋ผ ๋ฐ์ดํฐ ํ์
print(df.describe()) # ๊ธฐ์ด ํต๊ณ๋
print(df.isnull().sum()) #๊ฒฐ์ธก์น ํ์ธ
df['date'] =df['year'].astype('str')+'-'+df['month'].astype('str')+'-'+df['day'].astype('str')
df['date'] = pd.to_datetime(df['date'])
v = pd.DataFrame(pd.date_range(start=df['date'].dt.strftime('%Y-%m-%d').min(), end=df['date'].dt.strftime('%Y-%m-%d').max()))[0].dt.strftime('%Y-%m-%d').values
a=set(v) - set(df['date'].dt.strftime('%Y-%m-%d'))
print(a)
len(a)
display(df.corr())
# ๋ฐ์ดํฐ ๋ถํ
dfd = pd.get_dummies(df)
df_drop = dfd.drop(columns=['year','month','day','friend','date'])
X = df_drop.drop(columns=['actual'])
y = df_drop['actual']
from sklearn.model_selection import train_test_split
X_train,X_test , y_train,y_test = train_test_split(X,y,random_state=2,test_size=0.2)
plt.show()
print('''
Answer
๋ฐ์ดํฐ ์์์ ์์น ๊ฒฐ์ธก์น๋ ์กด์ฌํ์ง ์๋๋ค. ์๊ณ์ด ๋ฐ์ดํฐ ๊ด์ ์ผ๋ก ๋ดค์๋, 18์ผ์น์ ์ผ์ ๋ฐ์ดํฐ๊ฐ ๊ฒฐ์ธก์น๋ก ์กด์ฌํ๋ค.
๋ฌธ์ ํด๊ฒฐ์ ์๊ณ์ด ๋ฐฉ์์ผ๋ก ์ ๊ทผ ํ์ง ์์ ๊ฒ์ด๊ธฐ์ ๋๋ฝ๋ ์ผ์์ ๋ํด์ ๋ฐ๋ก ๊ฒฐ์ธก์น ์ฒ๋ฆฌ๋ฅผ ํด์ฃผ์ง ์์ ๊ฒ์ด๋ค.
์๊ณ์ด ๊ด์ ์ผ๋ก ํด์์ ํ ๊ฒฝ์ฐ ๋๋ฝ๋ ๋ฐ์ดํฐ๋ ํ๊ท ๋ณด๊ฐ์ ์ค์ ํ์ฌ ์ฒ๋ฆฌํ ์ ์๋ค.
๋ฐ์ดํฐ ์๊ฐํ ๊ฒฐ๊ณผ ์๊ด๊ด๊ณ๋ฅผ ๋ณด์ด๋ ์ปฌ๋ผ๋ค์ด ํ์ธ๋๋ฉฐ ์ฃผ๊ธฐ์ ๊ฒฝํฅ์ ๋ณด์ด๋ ๋ฐ์ดํฐ๋ค์ด ํ์ธ๋๋ค.
year, month, day, week, ๊ฐ์ ๋ถํ์ ์ปฌ๋ผ์ผ๋ก ์ ์ธํ๋ค. week์ ๊ฒฝ์ฐ ์ํซ์ธ์ฝ๋ฉ์ ์งํํด์ ์ถ๊ฐํ๋ค.
train์
๊ณผ test์
์ 8:2๋น์จ๋ก ๋๋ ์ ๋ชจ๋ธ๋ง์ ์งํํ๋ค.
friend ์ปฌ๋ผ์ ๊ฒฝ์ฐ ์๊ด๊ด๊ณ๋ฅผ ํ์ธํ์๋ ์๋์ ์ผ๋ก ๋ฎ์ ๊ฐ์ ๊ฐ์ง๊ธฐ์ ์ ์ธํ๊ณ ํ์ต์ ์งํํ๋ค.
''')
year month day week temp_2 temp_1 average actual forecast_noaa \
0 2016 1 1 Fri 45 45 45.6 45 43
1 2016 1 2 Sat 44 45 45.7 44 41
2 2016 1 3 Sun 45 44 45.8 41 43
3 2016 1 4 Mon 44 41 45.9 40 44
4 2016 1 5 Tues 41 40 46.0 44 46
forecast_acc forecast_under friend
0 50 44 29
1 50 44 61
2 46 47 56
3 48 46 53
4 46 46 41
(348, 12)
<seaborn.axisgrid.PairGrid at 0x7f93869a7820>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 348 entries, 0 to 347
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 348 non-null int64
1 month 348 non-null int64
2 day 348 non-null int64
3 week 348 non-null object
4 temp_2 348 non-null int64
5 temp_1 348 non-null int64
6 average 348 non-null float64
7 actual 348 non-null int64
8 forecast_noaa 348 non-null int64
9 forecast_acc 348 non-null int64
10 forecast_under 348 non-null int64
11 friend 348 non-null int64
dtypes: float64(1), int64(10), object(1)
memory usage: 32.8+ KB
None
year month day temp_2 temp_1 average \
count 348.0 348.000000 348.000000 348.000000 348.000000 348.000000
mean 2016.0 6.477011 15.514368 62.652299 62.701149 59.760632
std 0.0 3.498380 8.772982 12.165398 12.120542 10.527306
min 2016.0 1.000000 1.000000 35.000000 35.000000 45.100000
25% 2016.0 3.000000 8.000000 54.000000 54.000000 49.975000
50% 2016.0 6.000000 15.000000 62.500000 62.500000 58.200000
75% 2016.0 10.000000 23.000000 71.000000 71.000000 69.025000
max 2016.0 12.000000 31.000000 117.000000 117.000000 77.400000
actual forecast_noaa forecast_acc forecast_under friend
count 348.000000 348.000000 348.000000 348.000000 348.000000
mean 62.543103 57.238506 62.373563 59.772989 60.034483
std 11.794146 10.605746 10.549381 10.705256 15.626179
min 35.000000 41.000000 46.000000 44.000000 28.000000
25% 54.000000 48.000000 53.000000 50.000000 47.750000
50% 62.500000 56.000000 61.000000 58.000000 60.000000
75% 71.000000 66.000000 72.000000 69.000000 71.000000
max 92.000000 77.000000 82.000000 79.000000 95.000000
year 0
month 0
day 0
week 0
temp_2 0
temp_1 0
average 0
actual 0
forecast_noaa 0
forecast_acc 0
forecast_under 0
friend 0
dtype: int64
{'2016-08-25', '2016-02-14', '2016-08-26', '2016-02-13', '2016-08-29', '2016-08-21', '2016-08-18', '2016-08-27', '2016-09-02', '2016-08-24', '2016-10-30', '2016-08-17', '2016-02-29', '2016-09-01', '2016-08-19', '2016-08-31', '2016-08-22', '2016-08-20'}
year | month | day | temp_2 | temp_1 | average | actual | forecast_noaa | forecast_acc | forecast_under | friend | |
---|---|---|---|---|---|---|---|---|---|---|---|
year | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
month | NaN | 1.000000 | -0.000412 | 0.047651 | 0.032664 | 0.120806 | 0.004529 | 0.131141 | 0.127436 | 0.119786 | 0.048145 |
day | NaN | -0.000412 | 1.000000 | -0.046194 | -0.000691 | -0.021136 | -0.021675 | -0.021393 | -0.030605 | -0.013727 | 0.024592 |
temp_2 | NaN | 0.047651 | -0.046194 | 1.000000 | 0.857800 | 0.821560 | 0.805835 | 0.813134 | 0.817374 | 0.819576 | 0.583758 |
temp_1 | NaN | 0.032664 | -0.000691 | 0.857800 | 1.000000 | 0.819328 | 0.877880 | 0.810672 | 0.815162 | 0.815943 | 0.541282 |
average | NaN | 0.120806 | -0.021136 | 0.821560 | 0.819328 | 1.000000 | 0.848365 | 0.990340 | 0.990705 | 0.994373 | 0.689278 |
actual | NaN | 0.004529 | -0.021675 | 0.805835 | 0.877880 | 0.848365 | 1.000000 | 0.838639 | 0.842135 | 0.838946 | 0.569145 |
forecast_noaa | NaN | 0.131141 | -0.021393 | 0.813134 | 0.810672 | 0.990340 | 0.838639 | 1.000000 | 0.979863 | 0.985670 | 0.669221 |
forecast_acc | NaN | 0.127436 | -0.030605 | 0.817374 | 0.815162 | 0.990705 | 0.842135 | 0.979863 | 1.000000 | 0.983910 | 0.696054 |
forecast_under | NaN | 0.119786 | -0.013727 | 0.819576 | 0.815943 | 0.994373 | 0.838946 | 0.985670 | 0.983910 | 1.000000 | 0.691177 |
friend | NaN | 0.048145 | 0.024592 | 0.583758 | 0.541282 | 0.689278 | 0.569145 | 0.669221 | 0.696054 | 0.691177 | 1.000000 |
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-8cb9cf578946> in <module>
32
33
---> 34 plt.show()
35 print('''
36 Answer
NameError: name 'plt' is not defined
1-2๋ฒ
Random Forest ๋ชจ๋ธ ์ ํฉ ๋ฐ ๊ฒ์ฆ
Random Forest ํ์ต ๋ฐ ์์ธก ๊ฒฐ๊ณผ ํด์
์์ธก ๊ฒฐ๊ณผ ๊ฒ์ ํด์, ์ค์๋ณ์ ๋์ถ
๋ณ์ ์ค์์ฑ ๋ถ์ ๋ฐ ๊ทธ๋ํ ์ถ๋ ฅ
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
import time
import matplotlib.pyplot as plt
result = []
rf = RandomForestRegressor(random_state=22)
start = time.time()
rf.fit(X_train,y_train)
end = time.time()
pred = rf.predict(X_test)
print('RandomForest r2_score : ',r2_score(y_test,pred))
print('learning time ',end-start)
importances = rf.feature_importances_
forest_importances = pd.Series(importances, index=X_train.columns)
fig, ax = plt.subplots()
forest_importances.plot.bar( ax=ax)
ax.set_title("Feature importances")
fig.tight_layout()
print('temp_1 ,average , forecast_acc ์์ผ๋ก ๋ณ์ ์ค์๋๋ฅผ ํ์ธ ํ ์ ์๋ค')
result.append([end-start,r2_score(y_test,pred)])
RandomForest r2_score : 0.8399186619591019
learning time 0.1347370147705078
temp_1 ,average , forecast_acc ์์ผ๋ก ๋ณ์ ์ค์๋๋ฅผ ํ์ธ ํ ์ ์๋ค
1-3๋ฒ
SVM(Support Vector Machine) ๋ชจ๋ธ ์ ํฉ ๋ฐ ๊ฒ์ฆ
svm ํ์ต ๋ฐ ์์ธก ๊ฒฐ๊ณผ ํด์
์์ธก ๊ฒฐ๊ณผ ๊ฒ์ ํด์, ์ค์๋ณ์ ๋์ถ
๋ณ์ ์ค์์ฑ ๋ถ์ ๋ฐ ๊ทธ๋ํ ์ถ๋ ฅ
from sklearn.svm import SVR
from sklearn.metrics import r2_score
import time
svm = SVR()
start = time.time()
svm.fit(X_train,y_train)
end = time.time()
pred = svm.predict(X_test)
print('svm r2_score : ',r2_score(y_test,pred))
print('learning time ',end-start)
print('svm์ ๋ณ์ ์ค์๋๋ฅผ ๋ฐ๋ก ์ถ์ถ ํ ์ ์๋ค. r2_score์ ๊ฒฝ์ฐ RandomForest์ ๋นํด ๋ฎ๋ค')
result.append([end-start,r2_score(y_test,pred)])
svm r2_score : 0.8138036782618503
learning time 0.008105278015136719
svm์ ๋ณ์ ์ค์๋๋ฅผ ๋ฐ๋ก ์ถ์ถ ํ ์ ์๋ค. r2_score์ ๊ฒฝ์ฐ RandomForest์ ๋นํด ๋ฎ๋ค
1-4๋ฒ
๋ชจ๋ธ ๋น๊ต ๋ฐ ํฅํ ๊ฐ์ ๋ฐฉํฅ ๋์ถ
Random Forest, SVM ๋ชจ๋ธ์ ๊ฒฐ๊ณผ ๋น๊ต ํ ์ต์ข ๋ชจ๋ธ ์ ํ
๋ ๋ชจ๋ธ์ ์ฅ๋จ์ ๋ถ์, ์ถํ ์ด์ ๊ด์ ์์ ์ด๋ค ๋ชจ๋ธ์ ์ ํํ ๊ฒ์ธ๊ฐ?
๋ชจ๋ธ๋ง ๊ด๋ จ ์ถํ ๊ฐ์ ๋ฐฉํฅ ์ ์
result_df = pd.DataFrame(result,columns = ['learning time','r2_score'])
result_df.index = ['RandomForest','Svm']
display(result_df)
print('''
ํ๋ผ๋ฏธํฐ ํ๋์ ์ ์ธํ ๊ธฐ๋ณธ๋ชจ๋ธ์ ๊ฒฝ์ฐ ๋ชจ๋ธํ์ต์๊ฐ์ ๋๋คํฌ๋ ์คํธ๊ฐ svm์ ๋นํด ๋ ์ค๋ ๊ฑธ๋ฆฐ๋ค.
test์
์ ๋ํ ๋ชจ๋ธ r2score๋ ๋๋คํฌ๋ ์คํธ๊ฐ ๋ ๋๋ค.
๋ชจ๋ธ ํ์ต์๊ฐ์ ์ค์ ๋๋ค๋ฉด svm์ด ๋ ์ ๋ฆฌํ๋ค. ํ์ง๋ง ๋๋คํฌ๋ ์คํธ์ ๊ฒฝ์ฐ ๋ณ์์ค์๋๋ฅผ ํ์ธ ํ ์ ์๊ณ , ์ ํ๋๊ฐ ๋ ๋๊ธฐ ๋๋ฌธ์
์ต์ข
์ ์ผ๋ก๋ ๋๋คํฌ๋ ์คํธ๋ฅผ ์ ํํ๋ค.
''')
learning time | r2_score | |
---|---|---|
RandomForest | 0.134737 | 0.839919 |
Svm | 0.008105 | 0.813804 |
ํ๋ผ๋ฏธํฐ ํ๋์ ์ ์ธํ ๊ธฐ๋ณธ๋ชจ๋ธ์ ๊ฒฝ์ฐ ๋ชจ๋ธํ์ต์๊ฐ์ ๋๋คํฌ๋ ์คํธ๊ฐ svm์ ๋นํด ๋ ์ค๋ ๊ฑธ๋ฆฐ๋ค.
test์
์ ๋ํ ๋ชจ๋ธ r2score๋ ๋๋คํฌ๋ ์คํธ๊ฐ ๋ ๋๋ค.
๋ชจ๋ธ ํ์ต์๊ฐ์ ์ค์ ๋๋ค๋ฉด svm์ด ๋ ์ ๋ฆฌํ๋ค. ํ์ง๋ง ๋๋คํฌ๋ ์คํธ์ ๊ฒฝ์ฐ ๋ณ์์ค์๋๋ฅผ ํ์ธ ํ ์ ์๊ณ , ์ ํ๋๊ฐ ๋ ๋๊ธฐ ๋๋ฌธ์
์ต์ข
์ ์ผ๋ก๋ ๋๋คํฌ๋ ์คํธ๋ฅผ ์ ํํ๋ค.
Attention
2๋ฒ
5๋ถ๊ฐ๊ฒฉ์ ๊ฐ๊ตฌ๋ณ ์ ๋ ฅ ์ฌ์ฉ๋์ ๋ฐ์ดํฐ
๋ฐ์ดํฐ ์ถ์ฒ : ์์ฒด์์ฑ
๋ฐ์ดํฐ ๊ฒฝ๋ก : https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p2/problem2.csv
import matplotlib.pyplot as plt
ttt= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p2/problem2.csv')
2-1๋ฒ
๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ
๊ฐ ๊ฐ๊ตฌ์ 15๋ถ๊ฐ๊ฒฉ์ ์ ๋ ฅ๋์ ํฉ์ ๊ตฌํ๊ณ ํด๋น๋ฐ์ดํฐ๋ฅผ ๋ฐํ์ผ๋ก ์ด 5๊ฐ์ ๊ตฐ์ง์ผ๋ก ๊ตฐ์งํ๋ฅผ ์งํํ ํ ์๋์ ๊ทธ๋ฆผ๊ณผ ๊ฐ์ ํํ๋ก ์ถ๋ ฅํ๋ผ.
๊ตฐ์งํ๋ฅผ ์ํ ๋ฐ์ดํฐ ๊ตฌ์ฑ์ ์ด์ ๋ฅผ ์ค๋ช
ํ๋ผ
(๊ตฐ์ง ๋ฐฉ์์ ๋ฐ๋ผ Cluster์ปฌ๋ผ์ ๊ฐ์ ๋ฌ๋ผ์ง์ ์์)
tt = ttt.sort_values(['houseCode','date']).reset_index(drop=True)
tt['date'] = pd.to_datetime(tt['date'])
tg = tt.groupby(['houseCode']).resample('15min', on='date')['power consumption'].sum().reset_index()
tg = tg.rename(columns= {'power consumption':'power consumption sum'})
tgg = tg.copy()
tgg['c'] =tgg['houseCode'].str[-2:].astype('int')
tgg['d'] =tgg['date'].dt.hour
tgg['e'] =tgg['date'].dt.day
from sklearn.cluster import KMeans
# k-means clustering ์คํ
kmeans = KMeans(n_clusters=5)
kmeans.fit(tgg.iloc[:,2:].values)
tg['Cluster'] =kmeans.labels_
tg
houseCode | date | power consumption sum | Cluster | |
---|---|---|---|---|
0 | house_00 | 2050-01-01 00:00:00 | 136.249952 | 4 |
1 | house_00 | 2050-01-01 00:15:00 | 98.283387 | 4 |
2 | house_00 | 2050-01-01 00:30:00 | 53.967679 | 4 |
3 | house_00 | 2050-01-01 00:45:00 | 204.821270 | 1 |
4 | house_00 | 2050-01-01 01:00:00 | 150.760786 | 1 |
... | ... | ... | ... | ... |
133915 | house_44 | 2050-01-31 22:45:00 | 334.675717 | 0 |
133916 | house_44 | 2050-01-31 23:00:00 | 463.419892 | 3 |
133917 | house_44 | 2050-01-31 23:15:00 | 369.930740 | 0 |
133918 | house_44 | 2050-01-31 23:30:00 | 237.713030 | 2 |
133919 | house_44 | 2050-01-31 23:45:00 | 184.888439 | 1 |
133920 rows ร 4 columns
2-2๋ฒ
ํํธ๋งต ์๊ฐํ
2-1์ ๋ฐ์ดํฐ๋ฅผ ๋ฐํ์ผ๋ก ๊ฐ ๊ตฐ์ง์ ์์ผ, 15๋ถ๊ฐ๊ฒฉ๋ณ ์ ๋ ฅ์ฌ์ฉ๋์ ํฉ์ ๊ตฌํ ํ ์๋์ ๊ฐ์ด ์๊ฐํ ํ์ฌ๋ผ
(์์น๋ ๋์ผํ์ง ์์ ์ ์์ 2-1์ ๋ฐ์ดํฐ๊ฐ ์ ํํ๊ฒ ์๋์ ๊ฐ์ ์ด๋ฏธ์ง๋ก ๋ณํ ๋๋์ง ์ฃผ๋ก ํ์ธ)
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
tg['day'] = tg.date.dt.day_name()
tg['min'] = tg.date.dt.strftime('%H:%M')
pv = tg.groupby(['Cluster','day','min'],as_index=False).sum()
for v in range(5):
plt.figure(figsize=(20,3))
target = pv.loc[pv.Cluster==v]
pvt = target.pivot(index='day',columns='min',values='power consumption sum').reindex(['Sunday','Saturday','Friday','Thursday','Wednesday','Tuesday','Monday'])
plt.pcolor(pvt)
plt.title('Cluster'+str(v))
plt.xticks(range(len(pvt.columns)),pvt.columns,rotation=90)
plt.yticks(np.arange(len(pvt.index))+0.5,pvt.index)
Attention
3๋ฒ
ํ์๊ด ๋ฐ์ดํฐ
๋ฐ์ดํฐ ์ถ์ฒ : https://www.kaggle.com/cheedcheed/california-renewable-production-20102018
๋ฐ์ดํฐ ๊ฒฝ๋ก : https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p2/problem3.csv
์์ธก ๋ณ์ :SOLAR PV
import pandas as pd
df= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p2/problem3.csv')
df.head()
TIMESTAMP | BIOGAS | BIOMASS | GEOTHERMAL | Hour | SMALL HYDRO | SOLAR | SOLAR PV | SOLAR THERMAL | WIND TOTAL | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-11-26 00:00:00 | 208.0 | 354.0 | 926.0 | 1.0 | 208.0 | NaN | 0.0 | 0.0 | 57.0 |
1 | 2012-11-26 01:00:00 | 207.0 | 354.0 | 927.0 | 2.0 | 207.0 | NaN | 0.0 | 0.0 | 76.0 |
2 | 2012-11-26 02:00:00 | 208.0 | 353.0 | 927.0 | 3.0 | 208.0 | NaN | 0.0 | 0.0 | 100.0 |
3 | 2012-11-26 03:00:00 | 208.0 | 350.0 | 927.0 | 4.0 | 209.0 | NaN | 0.0 | 0.0 | 111.0 |
4 | 2012-11-26 04:00:00 | 209.0 | 352.0 | 927.0 | 5.0 | 209.0 | NaN | 0.0 | 0.0 | 131.0 |
3-1๋ฒ
๋ฐ์ดํฐ์ ๋ถํ ๋ฐ ๊ฒฐ๊ณผ ๊ฒ์ฆ
๋ฐ์ดํฐ์ 7:3 ๋ถํ
๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ ๋ฐ ์์ธก ๋ชจ๋ธ ์์ฑ
๋ชจ๋ธ ์ฑ๋ฅ ๊ฒ์ฆ : RMSE, R์ ๊ณฑ, ์ ํ๋(์๋ ๋ฐฉ์์ผ๋ก ์ฐ์ฐ)๋ก ๊ตฌํ์ฌ๋ผ
์ ํ๋์ ๊ฒฝ์ฐ ์ค์ ๊ฐ>์์ธก๊ฐ์ธ ๊ฒฝ์ฐ (1-์์ธก๊ฐ/์ค์ ๊ฐ), ์ค์ ๊ฐ<์์ธก๊ฐ์ธ ๊ฒฝ์ฐ (1- ์ค์ ๊ฐ/์์ธก๊ฐ)์ผ๋ก ํ๊ณ ์ด๊ฒ๋ค์ ํ๊ท ๋ธ ํ 1์์ ๋บ๊ฐ์ผ๋ก ํ๋ค.
๋ถ์์์ ๋ถ๋ชจ๊ฐ 0์ธ ๊ฒฝ์ฐ์ ์ ํ๋๋ 0.5๋ก ์ทจ๊ธํ๋ค.์ต์ข ๊ฒฐ๊ณผ ์ ์ถ : ์์์ 3์งธ์๋ฆฌ ๋ฐ์ฌ๋ฆผ
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score,accuracy_score, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
df = df.drop(columns =['SOLAR'])
def suntimeChecker(x):
if pd.to_datetime(x).hour in list(range(6,18)):
return 1
else:
return 0
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'])
df['suntime'] = df['TIMESTAMP'].apply(suntimeChecker)
X = df.drop(columns=['TIMESTAMP','Hour','SOLAR PV'])
y= df['SOLAR PV']
X_train,X_test ,y_train,y_test = train_test_split(X,y,random_state =2 , test_size =0.3)
rf =RandomForestRegressor()
rf.fit(X_train,y_train)
pred = rf.predict(X_test)
def getEachAccuracy(y_true,y_pred):
if y_true ==0:
return 0.5
if y_pred ==0:
return 0.5
if y_true > y_pred:
return 1-(y_pred/y_true)
else:
return 1-(y_true/y_pred)
acc = []
for i,v in enumerate(y_test):
acc.append(getEachAccuracy(v,pred[i]))
# ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ์ ๊ฒฝ์ฐ ๋ ์ง ์ปฌ๋ผ์ ์ ์ธํ๊ณ , nan๊ฐ๋ง ์๋ ์ปฌ๋ผ์ ์ ์ธํ๋ค.
# ํด๊ฐ ์กด์ฌํ๋์๊ฐ์ (06~17์)๋ก ์ค์ ํด์ ํ์๋ณ์๋ฅผ ๋ง๋ค์ด์คฌ๋ค
# ์ ํ๋์ ๊ฒฝ์ฐ ์๋์ ๊ฐ๋ค
print('RMSE',round(mean_squared_error(y_test, pred)**0.5,3))
print('r2',round(r2_score(y_test, pred),3))
print('acc',1- round(sum(acc)/len(acc),3))
RMSE 702.12
r2 0.914
acc 0.623