23ํšŒ#

Hits

Attention

์บ๊ธ€์— ์—…๋กœ๋“œ๋œ ๋‹ค๋ฅธ ๋ถ„๋“ค ์ฝ”๋“œ ๋ณด๋Ÿฌ๊ฐ€๊ธฐ
๋ฐ์ดํ„ฐ์…‹ ๋งํฌ
๋ฌธ์ œ์˜ค๋ฅ˜, ์ฝ”๋“œ์˜ค๋ฅ˜ ๋Œ“๊ธ€๋กœ ํ”ผ๋“œ๋ฐฑ์ฃผ์„ธ์š”

๊ธฐ๊ณ„ํ•™์Šต(50์ )#

Attention

1๋ฒˆ

์˜จ,์Šต๋„,์กฐ๋„,CO2๋†๋„์— ๋”ฐ๋ฅธ ๊ฐ์‹ค์˜ ์‚ฌ์šฉ์œ ๋ฌด ํŒ๋ณ„
์ข…์†๋ณ€์ˆ˜ Occupancy, 0: ๋น„์–ด์žˆ์Œ , 1: ์‚ฌ์šฉ์ค‘
๋ฐ์ดํ„ฐ ์ถœ์ฒ˜ : https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
data Url : https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p1/problem1.csv

import pandas as pd
df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p1/problem1.csv')
df.head()
date Temperature Humidity Light CO2 HumidityRatio Occupancy
0 2015-02-02 14:19:59 23.7180 26.290 578.400000 760.40 0.004773 1
1 2015-02-02 14:22:00 23.7225 26.125 493.750000 774.75 0.004744 1
2 2015-02-02 14:23:00 23.7540 26.200 488.600000 779.00 0.004767 1
3 2015-02-02 14:23:59 23.7600 26.260 568.666667 790.00 0.004779 1
4 2015-02-02 14:25:59 23.7540 26.290 509.000000 797.00 0.004783 1

1-1๋ฒˆ

๋ฐ์ดํ„ฐ EDA ์ˆ˜ํ–‰ ํ›„, ๋ถ„์„๊ฐ€ ์ž…์žฅ์—์„œ ์˜๋ฏธ์žˆ๋Š” ํƒ์ƒ‰

import pandas as pd
print(df.info())
print('''
๊ฒฐ์ธก์น˜๊ฐ€ ์ผ๋ถ€ ์กด์žฌํ•˜๋ฉฐ ๋ฐ์ดํ„ฐ ํƒ€์ž…์€ date์ปฌ๋Ÿผ์„ ์ œ์™ธํ•˜๊ณ  ๋ชจ๋‘ floatํ˜•์‹์ด๋‹ค. 
''')


display(df.isnull().sum())
print()
print(df[df.CO2.isnull()].date.values)
print('\n๊ฒฐ์ธก์น˜๋Š” CO2 ์ปฌ๋Ÿผ์—๋งŒ ์กด์žฌํ•œ๋‹ค. ๊ฒฐ์ธก์น˜์˜ ๋ฐ์ดํ„ฐ๋Š” ์—ฐ์†์ ์œผ๋กœ ์กด์žฌํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋‹ค.')



import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(df)
plt.show()
for v in df.select_dtypes(include='float'):
    target = df[v].dropna()
    plt.boxplot(target)
    plt.title(v)
    plt.show()
    
print('''
Humidity, HumidityRatio์ปฌ๋Ÿผ์€ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ด๋ฉฐ, ๋‘ ์ปฌ๋Ÿผ์„ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ ์ด์ƒ์น˜๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ๋งŽ์ด ์กด์žฌํ•œ๋‹ค
''')



display(df[df.Light <0].shape)
display(df.describe())
print('''
Light์ปฌ๋Ÿผ์˜ ๊ฒฝ์šฐ -99์ธ ๊ฐ’์ด 50๊ฐœ ์กด์žฌํ•œ๋‹ค. 
''')


df['date'] = pd.to_datetime(df['date'])
timedeltas = df['date'].diff().dt.seconds.dropna()
display(timedeltas.describe())
print()
print('''
75%์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฐ์†ํ•˜๋Š” ๋ฐ์ดํ„ฐ ์‚ฌ์ด์— 61์ดˆ ์ด๋‚ด์˜ ์‹œ๊ฐ„ ์ฐจ์ด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. 
์—ฐ์†๋œ ๋ฐ์ดํ„ฐ๊ฐ„์˜ ์ตœ๋Œ€ ์‹œ๊ฐ„์ฐจ์ด๋Š” 25680์ดˆ๋กœ ๋Œ€๋žต 7์‹œ๊ฐ„ ์ฐจ์ด๊ฐ€ ์กด์žฌํ•œ๋‹ค.
์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํ•ด์„ํ•  ๊ฒฝ์šฐ ์ด ์‚ฌ์ด ์‹œ๊ฐ„๋“ค์€ ๊ฒฐ์ธก์น˜๋กœ ๋ณผ์ˆ˜ ์žˆ๋‹ค
''')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17910 entries, 0 to 17909
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           17910 non-null  object 
 1   Temperature    17910 non-null  float64
 2   Humidity       17910 non-null  float64
 3   Light          17910 non-null  float64
 4   CO2            17889 non-null  float64
 5   HumidityRatio  17910 non-null  float64
 6   Occupancy      17910 non-null  int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 979.6+ KB
None

๊ฒฐ์ธก์น˜๊ฐ€ ์ผ๋ถ€ ์กด์žฌํ•˜๋ฉฐ ๋ฐ์ดํ„ฐ ํƒ€์ž…์€ date์ปฌ๋Ÿผ์„ ์ œ์™ธํ•˜๊ณ  ๋ชจ๋‘ floatํ˜•์‹์ด๋‹ค. 
date              0
Temperature       0
Humidity          0
Light             0
CO2              21
HumidityRatio     0
Occupancy         0
dtype: int64
['2015-02-03 19:09:59' '2015-02-03 19:31:00' '2015-02-04 18:08:00'
 '2015-02-05 06:08:00' '2015-02-05 16:09:59' '2015-02-08 08:06:00'
 '2015-02-08 11:54:00' '2015-02-08 20:58:59' '2015-02-09 06:04:59'
 '2015-02-09 07:31:00' '2015-02-09 07:49:00' '2015-02-10 07:53:59'
 '2015-02-12 00:34:00' '2015-02-12 10:53:00' '2015-02-12 15:04:00'
 '2015-02-12 20:38:00' '2015-02-13 22:53:59' '2015-02-15 16:41:59'
 '2015-02-16 00:53:59' '2015-02-17 01:56:00' '2015-02-18 06:20:00']

๊ฒฐ์ธก์น˜๋Š” CO2 ์ปฌ๋Ÿผ์—๋งŒ ์กด์žฌํ•œ๋‹ค. ๊ฒฐ์ธก์น˜์˜ ๋ฐ์ดํ„ฐ๋Š” ์—ฐ์†์ ์œผ๋กœ ์กด์žฌํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋‹ค.
../../../_images/p1_7_3.png ../../../_images/p1_7_4.png ../../../_images/p1_7_5.png ../../../_images/p1_7_6.png ../../../_images/p1_7_7.png ../../../_images/p1_7_8.png
Humidity, HumidityRatio์ปฌ๋Ÿผ์€ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ด๋ฉฐ, ๋‘ ์ปฌ๋Ÿผ์„ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ ์ด์ƒ์น˜๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ๋งŽ์ด ์กด์žฌํ•œ๋‹ค
(50, 7)
Temperature Humidity Light CO2 HumidityRatio Occupancy
count 17910.000000 17910.000000 17910.000000 17889.000000 17910.000000 17910.000000
mean 20.749036 27.589163 78.157369 647.700865 0.004175 0.117253
std 0.994012 5.043595 168.574068 285.997340 0.000755 0.321730
min 19.000000 16.745000 -99.000000 412.750000 0.002674 0.000000
25% 20.100000 24.390000 0.000000 453.000000 0.003702 0.000000
50% 20.600000 27.200000 0.000000 532.666667 0.004222 0.000000
75% 21.200000 31.290000 22.000000 722.000000 0.004790 0.000000
max 24.408333 39.500000 1581.000000 2076.500000 0.006461 1.000000
Light์ปฌ๋Ÿผ์˜ ๊ฒฝ์šฐ -99์ธ ๊ฐ’์ด 50๊ฐœ ์กด์žฌํ•œ๋‹ค. 
count    17909.000000
mean        71.357474
std        241.363584
min         59.000000
25%         60.000000
50%         60.000000
75%         61.000000
max      25680.000000
Name: date, dtype: float64
75%์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฐ์†ํ•˜๋Š” ๋ฐ์ดํ„ฐ ์‚ฌ์ด์— 61์ดˆ ์ด๋‚ด์˜ ์‹œ๊ฐ„ ์ฐจ์ด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. 
์—ฐ์†๋œ ๋ฐ์ดํ„ฐ๊ฐ„์˜ ์ตœ๋Œ€ ์‹œ๊ฐ„์ฐจ์ด๋Š” 25680์ดˆ๋กœ ๋Œ€๋žต 7์‹œ๊ฐ„ ์ฐจ์ด๊ฐ€ ์กด์žฌํ•œ๋‹ค.
์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํ•ด์„ํ•  ๊ฒฝ์šฐ ์ด ์‚ฌ์ด ์‹œ๊ฐ„๋“ค์€ ๊ฒฐ์ธก์น˜๋กœ ๋ณผ์ˆ˜ ์žˆ๋‹ค

1-2๋ฒˆ

๊ฒฐ์ธก์น˜๋ฅผ ๋Œ€์ฒดํ•˜๋Š” ๋ฐฉ์‹ ์„ ํƒํ•˜๊ณ  ๊ทผ๊ฑฐ์ œ์‹œ, ๋Œ€์ฒด ์ˆ˜ํ–‰

print('''
CO2 ์ปฌ๋Ÿผ์— nan๊ฐ’์œผ๋กœ ๋น„์–ด์žˆ๋Š” ๋ฐ์ดํ„ฐ๋Š” ์ง์ „, ์งํ›„ ๋ฐ์ดํ„ฐ๋ฅผ ๋น„๊ตํ•ด์„œ Occupancy๊ฐ’์ด ๋ณ€ํ™”ํ•˜์ง€ ์•Š๊ฑฐ๋‚˜,      
date๊ฐ’์ด 3๋ถ„์ด๋‚ด๋กœ ์—ฐ์†ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ผ๊ณ  ํ•œ๋‹ค๋ฉด ์ง์ „ CO2๊ฐ’์œผ๋กœ ๋Œ€์ฒด ํ•œ๋‹ค.      
์œ„์˜ ๋‘ ๊ธฐ์ค€์— ํ•ด๋‹นํ•œ๋‹ค๋ฉด ๋ชจ๋“  CO2์˜ ๊ฒฐ์ธก์น˜๋ฅผ ๋Œ€์ฒด ํ•  ์ˆ˜ ์žˆ๊ณ  ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ์„ ์†์ƒ ์‹œํ‚ค์ง€ ์•Š๋‹ค๊ณ  ํŒ๋‹จ ํ•  ์ˆ˜ ์žˆ๋‹ค
''')

for value in df[df.CO2.isnull()].index:
    target = df.iloc[value-1:value+2]
    difftime = target['date'].diff().dt.seconds.dropna()
    if target.Occupancy.nunique() ==1 and len(difftime[difftime>180]) ==0:
        df.loc[value,'CO2'] =df.loc[value-1].CO2
        
display(df.isnull().sum())
CO2 ์ปฌ๋Ÿผ์— nan๊ฐ’์œผ๋กœ ๋น„์–ด์žˆ๋Š” ๋ฐ์ดํ„ฐ๋Š” ์ง์ „, ์งํ›„ ๋ฐ์ดํ„ฐ๋ฅผ ๋น„๊ตํ•ด์„œ Occupancy๊ฐ’์ด ๋ณ€ํ™”ํ•˜์ง€ ์•Š๊ฑฐ๋‚˜,      
date๊ฐ’์ด 3๋ถ„์ด๋‚ด๋กœ ์—ฐ์†ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ผ๊ณ  ํ•œ๋‹ค๋ฉด ์ง์ „ CO2๊ฐ’์œผ๋กœ ๋Œ€์ฒด ํ•œ๋‹ค.      
์œ„์˜ ๋‘ ๊ธฐ์ค€์— ํ•ด๋‹นํ•œ๋‹ค๋ฉด ๋ชจ๋“  CO2์˜ ๊ฒฐ์ธก์น˜๋ฅผ ๋Œ€์ฒด ํ•  ์ˆ˜ ์žˆ๊ณ  ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ์„ ์†์ƒ ์‹œํ‚ค์ง€ ์•Š๋‹ค๊ณ  ํŒ๋‹จ ํ•  ์ˆ˜ ์žˆ๋‹ค
date             0
Temperature      0
Humidity         0
Light            0
CO2              0
HumidityRatio    0
Occupancy        0
dtype: int64

1-3๋ฒˆ

์ถ”๊ฐ€์ ์œผ๋กœ ๋ฐ์ดํ„ฐ์˜ ์งˆ ๋ฐ ํ’ˆ์งˆ๊ด€๋ฆฌ๋ฅผ ํ–ฅ์ƒ์‹œํ‚ฌ๋งŒํ•œ ๋‚ด์šฉ ์ž‘์„ฑ

print('''
Light ์ปฌ๋Ÿผ์˜ -99๋Š” ์˜ˆ์™ธ๊ฐ’์— ๋Œ€ํ•ด ์ž„์˜ ๊ฐ’์„ ์ž…๋ ฅํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. -99๋ฅผ ์ œ์™ธํ•œ ์ปฌ๋Ÿผ์˜ ์ตœ์†Ÿ๊ฐ’์ธ 0์œผ๋กœ ๋Œ€์ฒดํ•œ๋‹ค. ํ‰๊ท ์ ์œผ๋กœ ์‹œ๊ฐ„์€ 1๋ถ„ ๊ฐ„๊ฒฉ์˜ ๋ฐ์ดํ„ฐ์ด๋‹ค.     
ํ•˜์ง€๋งŒ ์ตœ๋Œ€ 7์‹œ๊ฐ„์˜ ๊ณต๋ฐฑ์ด ์กด์žฌํ•œ๋‹ค. ์ด๋ฅผ ๋ณด๊ฐ„ ํ•ด์ฃผ๋Š” ๊ฒƒ๋„ ๋ฐฉ๋ฒ• ์ผ์ˆ˜ ์žˆ์ง€๋งŒ,์‹œ๊ณ„์—ด ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ ๊ฒƒ์ด๊ธฐ์— ๋ณด๊ฐ„์„ ๋”ฐ๋กœ ํ•ด์ฃผ์ง€ ์•Š๊ณ  ๋ชจ๋ธ๋ง์„ ์ง„ํ–‰ํ•œ๋‹ค.
''')
df.loc[df.Light ==-99,'Light'] = 0
Light ์ปฌ๋Ÿผ์˜ -99๋Š” ์˜ˆ์™ธ๊ฐ’์— ๋Œ€ํ•ด ์ž„์˜ ๊ฐ’์„ ์ž…๋ ฅํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. -99๋ฅผ ์ œ์™ธํ•œ ์ปฌ๋Ÿผ์˜ ์ตœ์†Ÿ๊ฐ’์ธ 0์œผ๋กœ ๋Œ€์ฒดํ•œ๋‹ค. ํ‰๊ท ์ ์œผ๋กœ ์‹œ๊ฐ„์€ 1๋ถ„ ๊ฐ„๊ฒฉ์˜ ๋ฐ์ดํ„ฐ์ด๋‹ค.     
ํ•˜์ง€๋งŒ ์ตœ๋Œ€ 7์‹œ๊ฐ„์˜ ๊ณต๋ฐฑ์ด ์กด์žฌํ•œ๋‹ค. ์ด๋ฅผ ๋ณด๊ฐ„ ํ•ด์ฃผ๋Š” ๊ฒƒ๋„ ๋ฐฉ๋ฒ• ์ผ์ˆ˜ ์žˆ์ง€๋งŒ,์‹œ๊ณ„์—ด ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ ๊ฒƒ์ด๊ธฐ์— ๋ณด๊ฐ„์„ ๋”ฐ๋กœ ํ•ด์ฃผ์ง€ ์•Š๊ณ  ๋ชจ๋ธ๋ง์„ ์ง„ํ–‰ํ•œ๋‹ค.

2-1๋ฒˆ

๋ฐ์ดํ„ฐ์— ๋ถˆ๊ท ํ˜•์ด ์žˆ๋Š”์ง€ ํ™•์ธ, ๋ถˆ๊ท ํ˜• ํŒ๋‹จ ๊ทผ๊ฑฐ ์ž‘์„ฑ

plt.figure(figsize=(15,4))
plt.scatter(df['date'],df['Occupancy'].astype('str'),s=0.03)
plt.show()
df.Occupancy.value_counts()

print('''
Occupancy ์˜ ๊ฒฝ์šฐ 7:1์˜ ๋น„์œจ๋กœ ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜•์ด ์กด์žฌํ•œ๋‹ค.
์œ„์˜ ๊ทธ๋ž˜ํ”„์—์„œ ๋ณด๋ฉด 2์›” 7์ผ~9์ผ , 2์›” 14~16์ผ๊นŒ์ง€๋Š” ๊ฐ์‹ค์ด ๋น„์–ด์žˆ๋‹ค. ์ด์ฒ˜๋Ÿผ ๊ฐ์‹ค์ด ๋น„์–ด์žˆ์ง€ ์•Š๋Š” ๊ฒฝ์šฐ ๋ณด๋‹ค ๋น„์–ด์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋” ๋งŽ๊ธฐ์— ๋ถˆ๊ท ํ˜•์ด ์กด์žฌํ•œ๋‹ค
''')
../../../_images/p1_13_0.png
Occupancy ์˜ ๊ฒฝ์šฐ 7:1์˜ ๋น„์œจ๋กœ ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜•์ด ์กด์žฌํ•œ๋‹ค.
์œ„์˜ ๊ทธ๋ž˜ํ”„์—์„œ ๋ณด๋ฉด 2์›” 7์ผ~9์ผ , 2์›” 14~16์ผ๊นŒ์ง€๋Š” ๊ฐ์‹ค์ด ๋น„์–ด์žˆ๋‹ค. ์ด์ฒ˜๋Ÿผ ๊ฐ์‹ค์ด ๋น„์–ด์žˆ์ง€ ์•Š๋Š” ๊ฒฝ์šฐ ๋ณด๋‹ค ๋น„์–ด์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋” ๋งŽ๊ธฐ์— ๋ถˆ๊ท ํ˜•์ด ์กด์žฌํ•œ๋‹ค

2-2๋ฒˆ

์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง ๋ฐฉ๋ฒ•๋“ค ์ค‘ 2๊ฐœ ์„ ํƒํ•˜๊ณ  ์žฅ๋‹จ์  ๋“ฑ ์„ ์ • ์ด์œ  ์ œ์‹œ

print('''*๋žœ๋ค์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง*     
์†Œ์ˆ˜ ํด๋ž˜์Šค ๋ฐ์ดํ„ฐ์ค‘ ๋žœ๋ค์ƒ˜ํ”Œ๋งํ•˜์—ฌ ๋‹ค์ˆ˜ ํด๋ž˜์Šค ๋ฐ์ดํ„ฐ ์ˆซ์ž์™€ ๋งž์ถ”๋Š” ๋ฐฉ์‹     
์žฅ์  : ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ ํ•ด๊ฒฐ     
๋‹จ์  : ์ข…์†๋ณ€์ˆ˜์— ๋Œ€ํ•œ ๊ณ ๋ ค์—†์ด ์ค‘๋ณต ์ƒ์„ฑ, overfitting์˜ ๊ฐ€๋Šฅ์„ฑ์ด ์กด์žฌ     

*SMOTE*           
์†Œ์ˆ˜ ํด๋ž˜์Šค ๋ฐ์ดํ„ฐ์™€ ๊ทธ ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด k๊ฐœ์˜ ์†Œ์ˆ˜ ํด๋ž˜์Šค ๋ฐ์ดํ„ฐ ์ค‘ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ๋œ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์˜ ์ง์„ ์ƒ์— ๊ฐ€์ƒ์˜ ์†Œ์ˆ˜ ํด๋ž˜์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•     
์žฅ์  : ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ ํ•ด๊ฒฐ, ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ๊ณ ๋ คํ•œ ์ค‘๋ณต์—†๋Š” ๋ฐ์ดํ„ฐ ์ƒ์„ฑ    
๋‹จ์  : ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ์—๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค     
''')
*๋žœ๋ค์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง*     
์†Œ์ˆ˜ ํด๋ž˜์Šค ๋ฐ์ดํ„ฐ์ค‘ ๋žœ๋ค์ƒ˜ํ”Œ๋งํ•˜์—ฌ ๋‹ค์ˆ˜ ํด๋ž˜์Šค ๋ฐ์ดํ„ฐ ์ˆซ์ž์™€ ๋งž์ถ”๋Š” ๋ฐฉ์‹     
์žฅ์  : ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ ํ•ด๊ฒฐ     
๋‹จ์  : ์ข…์†๋ณ€์ˆ˜์— ๋Œ€ํ•œ ๊ณ ๋ ค์—†์ด ์ค‘๋ณต ์ƒ์„ฑ, overfitting์˜ ๊ฐ€๋Šฅ์„ฑ์ด ์กด์žฌ     

*SMOTE*           
์†Œ์ˆ˜ ํด๋ž˜์Šค ๋ฐ์ดํ„ฐ์™€ ๊ทธ ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด k๊ฐœ์˜ ์†Œ์ˆ˜ ํด๋ž˜์Šค ๋ฐ์ดํ„ฐ ์ค‘ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ๋œ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์˜ ์ง์„ ์ƒ์— ๊ฐ€์ƒ์˜ ์†Œ์ˆ˜ ํด๋ž˜์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•     
์žฅ์  : ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ ํ•ด๊ฒฐ, ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ๊ณ ๋ คํ•œ ์ค‘๋ณต์—†๋Š” ๋ฐ์ดํ„ฐ ์ƒ์„ฑ    
๋‹จ์  : ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ์—๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค     

2-3๋ฒˆ

์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง ์ˆ˜ํ–‰ ๋ฐ ๊ฒฐ๊ณผ, ์ž˜ ๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ์„ ํŒ๋‹จํ•ด๋ผ

from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# train,test ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฆฌํ•˜๊ณ  train ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋งŒ ์˜ค๋ฒ„ ์ƒ˜ํ”Œ๋ง์„ ์ง„ํ–‰ํ•œ๋‹ค 
X  =df.drop(columns=['Occupancy'])
y  =df['Occupancy']

X_train , X_test , y_train, y_test  = train_test_split(X,y,stratify=y,random_state=43,test_size=0.35)


from imblearn.over_sampling import RandomOverSampler,SMOTE

#datetime ํ˜•ํƒœ๋กœ๋Š” ์ƒ˜ํ”Œ๋ง ํ• ์ˆ˜ ์—†๊ธฐ์— timestamps ํ˜•์‹์œผ๋กœ ๋ฐ”๊ฟ”์„œ ์ƒ˜ํ”Œ๋ง์„ ์ง„ํ–‰ํ•œ๋‹ค
X_train.loc[:,'datetime'] = X_train['date'].view('int') // 10**9
X_test.loc[:,'datetime'] = X_test['date'].view('int') // 10**9


from sklearn.preprocessing import StandardScaler


X_imb  = X_train.drop(columns = ['date']).reset_index(drop=True).copy()
y_imb  = y_train.reset_index(drop=True).copy()

X_samp, y_samp = RandomOverSampler(random_state=2022).fit_resample(X_imb,y_imb)
total = pd.concat([X_samp,y_samp],axis=1)
total['date'] = pd.to_datetime(total['datetime'], unit='s')


SMOTE_X_samp, SMOTE_y_samp = SMOTE(random_state=2022).fit_resample(X_imb,y_imb)
SMOTE_total = pd.concat([SMOTE_X_samp,SMOTE_y_samp],axis=1)
SMOTE_total['date'] = pd.to_datetime(SMOTE_total['datetime'], unit='s')


plt.figure(figsize=(15,4))
plt.title('RandomSampling')
plt.scatter(total['date'],total['Occupancy'].astype('str'),s=0.03)
plt.show()



plt.figure(figsize=(15,4))
plt.title('SMOTE')
plt.scatter(SMOTE_total['date'],SMOTE_total['Occupancy'].astype('str'),s=0.03)
plt.show()

print('''
RandomSampling ๋Œ€๋น„ SMOTE์—์„œ ์ข€ ๋” ๋‹ค์–‘ํ•œ ์ฐจ์›์˜ ์ƒ˜ํ”Œ์ด ์ƒ์„ฑ๋จ์„ ๋ณผ์ˆ˜ ์žˆ๋‹ค. (02-18์ผ ๋ถ€๊ทผ)
''')
../../../_images/p1_17_0.png ../../../_images/p1_17_1.png
RandomSampling ๋Œ€๋น„ SMOTE์—์„œ ์ข€ ๋” ๋‹ค์–‘ํ•œ ์ฐจ์›์˜ ์ƒ˜ํ”Œ์ด ์ƒ์„ฑ๋จ์„ ๋ณผ์ˆ˜ ์žˆ๋‹ค. (02-18์ผ ๋ถ€๊ทผ)

3-1๋ฒˆ

์†๋„์ธก๋ฉด, ์ •ํ™•๋„์ธก๋ฉด ๋ชจ๋ธ 1๊ฐœ์”ฉ ์„ ํƒ, ์„ ํƒ ์ด์œ ๋„ ๊ธฐ์ˆ 

print('''
์ด์ง„ ๋ถ„๋ฅ˜ ๋ฌธ์ œ์ด๋‹ค.     
์†๋„์ธก๋ฉด์—์„œ๋Š” linear regression,    
์ •ํ™•๋„ ์ธก๋ฉด์—์„œ๋Š” randomforest classifier๋ฅผ ์„ ํƒํ•˜์—ฌ ๋ถ„์„์„ ์ง„ํ–‰ํ•˜๊ฒ ๋‹ค     
''')
์ด์ง„ ๋ถ„๋ฅ˜ ๋ฌธ์ œ์ด๋‹ค.     
์†๋„์ธก๋ฉด์—์„œ๋Š” linear regression,    
์ •ํ™•๋„ ์ธก๋ฉด์—์„œ๋Š” randomforest classifier๋ฅผ ์„ ํƒํ•˜์—ฌ ๋ถ„์„์„ ์ง„ํ–‰ํ•˜๊ฒ ๋‹ค     

3-2๋ฒˆ

์œ„์—์„œ ์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง ํ•œ ๋ฐ์ดํ„ฐ 2๊ฐœ, ์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง ํ•˜๊ธฐ ์ „ ๋ฐ์ดํ„ฐ 1๊ฐœ์— ๋Œ€ํ•ด ๋ชจ๋ธ 2๊ฐœ๋ฅผ ์ ์šฉํ•˜๊ณ  ์„ฑ๋Šฅ ๋ณด์—ฌ์ฃผ๊ธฐ

import time
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import MinMaxScaler


# ๋ถˆํ•„์š” ์ปฌ๋Ÿผ์ œ๊ฑฐ ๋ฐ ์Šค์ผ€์ผ๋ง

if 'date' in X_train.columns:
    X_train = X_train.drop(columns=['date'])
    
if 'date' in X_test.columns:
    X_test = X_test.drop(columns=['date'])




result_auc_train = []
result_auc_test = []
result_time = []
for train_X,trainy in [(X_train,y_train),(X_samp, y_samp),(SMOTE_X_samp, SMOTE_y_samp)]:
    
    trainX = train_X.copy()
    testX = X_test.copy()
    sc = MinMaxScaler()    
    trainX = sc.fit_transform(trainX)
    testX = sc.transform(testX)
    
    
    lrstart = time.time()
    lr =LogisticRegression()
    lr.fit(trainX,trainy)
    lrend = time.time() - lrstart

    pred_lr = lr.predict(testX)
    auc_lr_train = roc_auc_score(trainy,lr.predict(trainX))
    auc_lr = roc_auc_score(y_test,pred_lr)
    
    rfstart = time.time()
    rf =RandomForestClassifier()
    rf.fit(trainX,trainy)
    rfend = time.time() - rfstart
    
    pred_rf  = rf.predict(testX)
    auc_rf_train  = roc_auc_score(trainy,rf.predict(trainX))
    auc_rf  = roc_auc_score(y_test,pred_rf)
    
    result_auc_test.append([auc_lr,auc_rf])
    result_time.append([lrend,rfend])
    result_auc_train.append([auc_lr_train,auc_rf_train])
    
#logistic regression ๊ณผ randomforest ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์ƒ˜ํ”Œ๋ง๋ฐฉ์‹์— ๋”ฐ๋ฅธ ํ•™์Šต์‹œ ์ •ํ™•๋„์™€ ๋ชจ๋ธ ํ•™์Šต ์‹œ๊ฐ„์— ๋Œ€ํ•ด์„œ ํ‰๊ฐ€ํ–ˆ๋‹ค.

print('ํ›ˆ๋ จ์…‹ ๋ชจ๋ธ auc ๊ฒฐ๊ณผ')
result_auc_trains = pd.DataFrame(result_auc_train)
result_auc_trains.index = ['raw','randomSampling','SMOTE']
result_auc_trains.columns = ['logistic','randomforest']
display(result_auc_trains)

print('ํ…Œ์ŠคํŠธ์…‹ ๋ชจ๋ธ auc ๊ฒฐ๊ณผ')
result_auc_tests = pd.DataFrame(result_auc_test)
result_auc_tests.index = ['raw','randomSampling','SMOTE']
result_auc_tests.columns = ['logistic','randomforest']
display(result_auc_tests)

print('๋ชจ๋ธ ํ•™์Šต์‹œ๊ฐ„ (sec)')
result_times = pd.DataFrame(result_time)
result_times.index = ['raw','randomSampling','SMOTE']
result_times.columns = ['logistic','randomforest']
result_times
ํ›ˆ๋ จ์…‹ ๋ชจ๋ธ auc ๊ฒฐ๊ณผ
logistic randomforest
raw 0.987030 1.0
randomSampling 0.989879 1.0
SMOTE 0.991096 1.0
ํ…Œ์ŠคํŠธ์…‹ ๋ชจ๋ธ auc ๊ฒฐ๊ณผ
logistic randomforest
raw 0.988823 0.984997
randomSampling 0.987016 0.988807
SMOTE 0.987107 0.990125
๋ชจ๋ธ ํ•™์Šต์‹œ๊ฐ„ (sec)
logistic randomforest
raw 0.034435 0.662965
randomSampling 0.053179 1.137727
SMOTE 0.051847 1.589572

3-3๋ฒˆ

์œ„ ์˜ˆ์ธก๊ฒฐ๊ณผ ์‚ฌ์šฉํ•ด์„œ ์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง์ด ๋ฏธ์นœ ์˜ํ–ฅ์— ๋Œ€ํ•ด ์ž‘์„ฑํ•˜๋ผ

print('''
logistic regression์˜ ๊ฒฝ์šฐ ํ•™์Šต์‹œ๊ฐ„์€ random forest์— ๋น„ํ•ด ๋‚ฎ๊ฒŒ ๋‚˜์™”์ง€๋งŒ ๋ชจ๋ธ ์„ฑ๋Šฅ์˜ ๊ฒฝ์šฐ train์…‹์˜ ๊ฒฝ์šฐ ๋ชจ๋‘ ๋‚ฎ๊ฒŒ ๋‚˜์™”์œผ๋ฉฐ test์…‹์˜ ๊ฒฝ์šฐ ์—…์ƒ˜ํ”Œ๋ง์„ ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ๋ฅผ ์ œ์™ธํ•˜๊ณ ๋Š” ๋ชจ๋‘ ์„ฑ๋Šฅ์ด ๋‚ฎ๊ฒŒ ๋‚˜์™”๋‹ค. 
randomforest์˜ ๊ฒฝ์šฐ ํ•™์Šต ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ auc๊ฐ’์ด 1๋กœ ์˜ค๋ฒ„ํ”ผํŒ… ๋๋‹ค. test์…‹์— ๋Œ€ํ•ด์„œ๋Š” raw , randomSampling, SMOTE ์ˆœ์„œ๋กœ auc๊ฐ’์ด ์ฆ๊ฐ€ํ•˜๋Š”๊ฒƒ์„ ํ™•์ธ ํ• ์ˆ˜ ์žˆ๋‹ค
''')
logistic regression์˜ ๊ฒฝ์šฐ ํ•™์Šต์‹œ๊ฐ„์€ random forest์— ๋น„ํ•ด ๋‚ฎ๊ฒŒ ๋‚˜์™”์ง€๋งŒ ๋ชจ๋ธ ์„ฑ๋Šฅ์˜ ๊ฒฝ์šฐ train์…‹์˜ ๊ฒฝ์šฐ ๋ชจ๋‘ ๋‚ฎ๊ฒŒ ๋‚˜์™”์œผ๋ฉฐ test์…‹์˜ ๊ฒฝ์šฐ ์—…์ƒ˜ํ”Œ๋ง์„ ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ๋ฅผ ์ œ์™ธํ•˜๊ณ ๋Š” ๋ชจ๋‘ ์„ฑ๋Šฅ์ด ๋‚ฎ๊ฒŒ ๋‚˜์™”๋‹ค. 
randomforest์˜ ๊ฒฝ์šฐ ํ•™์Šต ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ auc๊ฐ’์ด 1๋กœ ์˜ค๋ฒ„ํ”ผํŒ… ๋๋‹ค. test์…‹์— ๋Œ€ํ•ด์„œ๋Š” raw , randomSampling, SMOTE ์ˆœ์„œ๋กœ auc๊ฐ’์ด ์ฆ๊ฐ€ํ•˜๋Š”๊ฒƒ์„ ํ™•์ธ ํ• ์ˆ˜ ์žˆ๋‹ค

ํ†ต๊ณ„๋ถ„์„(50์ )#

Attention

2๋ฒˆ

๊ณต์žฅ์—์„œ๋Š” ์ง„๊ณต๊ด€ ์ˆ˜๋ช…์ด 1๋งŒ ์‹œ๊ฐ„์ด๋ผ๊ณ  ์ฃผ์žฅํ•˜์—ฌ ํ’ˆ์งˆ๊ด€๋ฆฌํŒ€์—์„œ 12๊ฐœ ์ƒ˜ํ”Œ์„ ๋ฝ‘์•˜์Œ ์œ ์˜์ˆ˜์ค€ 5%์—์„œ ๋ถ€ํ˜ธ ๊ฒ€์ •ํ•˜์‹œ์˜ค
data Url : https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p1/problem2.csv

import pandas as pd
df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p1/problem2.csv')
df.head()
name life span
0 sample1 10000
1 sample2 9000
2 sample3 9500
3 sample4 10000
4 sample5 10000

1๋ฒˆ

๊ท€๋ฌด๊ฐ€์„ค, ์—ฐ๊ตฌ๊ฐ€์„ค ์„ธ์šฐ๊ธฐ

print('''
๊ท€๋ฌด๊ฐ€์„ค : ๋ฐ์ดํ„ฐ์˜ ์ค‘์œ„์ˆ˜๋Š” 1๋งŒ ์‹œ๊ฐ„์ด๋‹ค     
์—ฐ๊ตฌ๊ฐ€์„ค : ๋ฐ์ดํ„ฐ์˜ ์ค‘์œ„์ˆ˜๋Š” 1๋งŒ ์‹œ๊ฐ„์ด ์•„๋‹ˆ๋‹ค''')
๊ท€๋ฌด๊ฐ€์„ค : ๋ฐ์ดํ„ฐ์˜ ์ค‘์œ„์ˆ˜๋Š” 1๋งŒ ์‹œ๊ฐ„์ด๋‹ค     
์—ฐ๊ตฌ๊ฐ€์„ค : ๋ฐ์ดํ„ฐ์˜ ์ค‘์œ„์ˆ˜๋Š” 1๋งŒ ์‹œ๊ฐ„์ด ์•„๋‹ˆ๋‹ค

2๋ฒˆ

์œ ํšจํ•œ ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜๋Š”?

print('์ค‘์œ„์ˆ˜์™€ ๋™์ผํ•œ ๊ฐ’๋“ค์€ ์ˆœ์œ„ ๋ถ€ํ˜ธ ๊ฒ€์ •์—์„œ ๋ถˆํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ์ด๋‹ค. ๊ทธ ์ˆซ์ž๋Š” : ',df[df['life span']==10000].shape[0])

df_fillter = df[df['life span'] != 10000]
์ค‘์œ„์ˆ˜์™€ ๋™์ผํ•œ ๊ฐ’๋“ค์€ ์ˆœ์œ„ ๋ถ€ํ˜ธ ๊ฒ€์ •์—์„œ ๋ถˆํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ์ด๋‹ค. ๊ทธ ์ˆซ์ž๋Š” :  4

3๋ฒˆ

๊ฒ€์ •ํ†ต๊ณ„๋Ÿ‰ ๋ฐ ์—ฐ๊ตฌ๊ฐ€์„ค ์ฑ„ํƒ ์—ฌ๋ถ€๋ฅผ ์ž‘์„ฑํ•˜๋ผ

from scipy.stats import wilcoxon
static, pvalue = wilcoxon(df_fillter['life span']-10000)
print('๊ฒ€์ •ํ†ต๊ณ„๋Ÿ‰์€ ',static,'์ด๋‹ค. pvalue๋Š” ',pvalue,'๋กœ 5% ์œ ์˜ ์ˆ˜์ค€์—์„œ ๊ท€๋ฌด๊ฐ€์„ค์„ ๊ธฐ๊ฐํ•  ์ˆ˜ ์—†๋‹ค. ์—ฐ๊ตฌ๊ฐ€์„ค์„ ์ฑ„ํƒํ•˜์ง€ ์•Š๋Š”๋‹ค. ')
๊ฒ€์ •ํ†ต๊ณ„๋Ÿ‰์€  8.5 ์ด๋‹ค. pvalue๋Š”  0.1953125 ๋กœ 5% ์œ ์˜ ์ˆ˜์ค€์—์„œ ๊ท€๋ฌด๊ฐ€์„ค์„ ๊ธฐ๊ฐํ•  ์ˆ˜ ์—†๋‹ค. ์—ฐ๊ตฌ๊ฐ€์„ค์„ ์ฑ„ํƒํ•˜์ง€ ์•Š๋Š”๋‹ค. 

Attention

3๋ฒˆ

์ฝ”๋กœ๋‚˜ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ
์ผ์ž/๊ตญ๊ฐ€๋ช…/ํ™•์ง„์ž์ˆ˜
๋ฐ์ดํ„ฐ ์ถœ์ฒ˜(ํ›„์ฒ˜๋ฆฌ๊ณผ์ • ๋ฏธํฌํ•จ) :https://www.kaggle.com/antgoldbloom/covid19panels?select=country_panel.csv ๋ฐ์ดํ„ฐ url : https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p1/problem3_covid2.csv

import pandas as pd
df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p1/problem3_covid2.csv')
df.head()
location date new_cases
0 Austria 2021-01-01 2096.0
1 Austria 2021-01-02 1391.0
2 Austria 2021-01-03 1466.0
3 Austria 2021-01-04 1642.0
4 Austria 2021-01-05 2311.0

1๋ฒˆ

๋ฐ์ดํ„ฐ๋Š” ์ผ์ž๋ณ„ ๊ฐ ๋‚˜๋ผ์˜ ์ผ์ผ ํ™•์ง„์ž์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๊ฐ ๋‚˜๋ผ์˜ ์ผ์ž๋ณ„ ๋ˆ„์ ํ™•์ง„์ž ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•˜๋ผ

target = df.groupby(['location','date']).sum().groupby(level=0).cumsum()
target.columns = ['cumulative sum']
target = target.reset_index()
target
location date cumulative sum
0 Austria 2021-01-01 2096.0
1 Austria 2021-01-02 3487.0
2 Austria 2021-01-03 4953.0
3 Austria 2021-01-04 6595.0
4 Austria 2021-01-05 8906.0
... ... ... ...
11890 Vanuatu 2021-10-28 5.0
11891 Vanuatu 2021-10-29 5.0
11892 Vanuatu 2021-10-30 5.0
11893 Vanuatu 2021-10-31 5.0
11894 Vanuatu 2021-11-01 5.0

11895 rows ร— 3 columns

2๋ฒˆ

1์—์„œ ๊ตฌํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ ๋‚˜๋ผ๋ณ„๋กœ acf๊ฐ’์„ ๊ตฌํ•˜๊ณ (lag๋Š” 50๊ฐœ๊นŒ์ง€ ๊ตฌํ•˜๊ณ  ์ฒซ๋ฒˆ์งธ ๊ฐ’์„ ์ œ์™ธํ•˜๋ผ) ๊ตญ๊ฐ€๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์œ ํด๋ฆฌ๋””์•ˆ ๊ฑฐ๋ฆฌ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ง์„ ์ง„ํ–‰ ํ›„ ๊ณ„์ธต์  ๊ตฐ์ง‘ ๋ถ„์„์„ ์œ„ํ•ด ๋ด๋“œ๋กœ๊ทธ๋žจ ์ž‘์„ฑํ•˜๋ผ

from scipy.spatial import distance
import statsmodels.api as sm
import numpy as np
name =[]
for lo in target.location.unique():
    
    v = sm.tsa.stattools.acf(target[target.location==lo]['cumulative sum'], nlags=50, fft=False)
    name.append([lo]+list(v[1:]))

v = pd.DataFrame(name)


import seaborn as sns
import scipy
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as sch


data =v.set_index(0)
label = np.array(data.index)

datav = data.values

fig = plt.figure(figsize=(17,10))

ax3 = fig.add_subplot(1,1,1)
dend3 = sch.linkage(datav, method='average', metric='euclidean')
cutoff = 0.3*max(dend3[:,2])
dend_res3 = sch.dendrogram(dend3, color_threshold=cutoff)
ax3.set_xticklabels(label[dend_res3['leaves']], minor=False)

plt.show()
../../../_images/p1_38_0.png

Attention

4๋ฒˆ

์•„๋ž˜ ์ด๋ฏธ์ง€์™€ ๊ฐ™์€ ํ•™๊ณผ๋ณ„ ํ•™์  ๋ถ„ํฌ ์ธ์›์ˆ˜ ํ‘œ๊ฐ€ ์žˆ๋‹ค. ํ•™๊ณผ์™€ ์„ฑ์ ์ด ๊ด€๊ณ„์žˆ๋Š”์ง€๋ฅผ ๊ฒ€์ •ํ•˜๋ผ

p4

1๋ฒˆ

๊ท€๋ฌด๊ฐ€์„ค, ์—ฐ๊ตฌ๊ฐ€์„ค ์„ธ์šฐ๊ธฐ

print('''
๊ท€๋ฌด๊ฐ€์„ค : ํ•™๊ณผ์™€ ์„ฑ์ ์€ ๊ด€๋ จ์ด ์—†๋‹ค (๋…๋ฆฝ์ด๋‹ค)     
์—ฐ๊ตฌ๊ฐ€์„ค : ํ•™๊ณผ์™€ ์„ฑ์ ์€ ๊ด€๋ จ์ด ์žˆ๋‹ค (๋…๋ฆฝ์ด ์•„๋‹ˆ๋‹ค)
''')
๊ท€๋ฌด๊ฐ€์„ค : ํ•™๊ณผ์™€ ์„ฑ์ ์€ ๊ด€๋ จ์ด ์—†๋‹ค (๋…๋ฆฝ์ด๋‹ค)     
์—ฐ๊ตฌ๊ฐ€์„ค : ํ•™๊ณผ์™€ ์„ฑ์ ์€ ๊ด€๋ จ์ด ์žˆ๋‹ค (๋…๋ฆฝ์ด ์•„๋‹ˆ๋‹ค)

2๋ฒˆ

ํ•™๊ณผ์™€ ์„ฑ์ ์ด ๋…๋ฆฝ์ผ ๊ฒฝ์šฐ์˜ ๊ธฐ๋Œ“๊ฐ’์„ ๊ตฌํ•˜์‹œ์˜ค

df = pd.DataFrame({'์‚ฌํšŒ๊ณผํ•™':[15,60,24],'์ž์—ฐ๊ณผํ•™':[25,69,5],'๊ณตํ•™':[10,77,13]})
df.index = ['1.5-2.5','2.5-3.5','3.5-4.5']
from scipy.stats import chi2_contingency,fisher_exact
chi2 , p ,dof, expected = chi2_contingency(df)
print(expected)
[[16.61073826 16.61073826 16.77852349]
 [68.43624161 68.43624161 69.12751678]
 [13.95302013 13.95302013 14.09395973]]

3๋ฒˆ

๊ฒ€์ •ํ†ต๊ณ„๋Ÿ‰ ๊ตฌํ•˜๊ณ  ์—ฐ๊ตฌ๊ฐ€์„ค์˜ ์ฑ„ํƒ์—ฌ๋ถ€ ์ž‘์„ฑ

print(p)

# ์นด์ด ์ œ๊ณฑ ๊ฒ€์ •์‹œ p-value๋Š” 0.00018๋กœ ๊ท€๋ฌด๊ฐ€์„ค์„ ๊ธฐ๊ฐํ•œ๋‹ค. ํ•™๊ณผ์™€ ์„ฑ์ ์€ ๊ด€๋ จ์ด ์žˆ๋‹ค.

# ๋งŒ์•ฝ 5๋ณด๋‹ค ์ž‘์€ ์…€์ด 20%๊ฐ€ ๋„˜์–ด๊ฐ€๋ฏ€๋กœ(75%) ํ”ผ์…”์˜ ์ •ํ™•๊ฒ€์ •์„ ์‚ฌ์šฉ ํ•ด์•ผํ•œ๋‹ค. #print(fisher_exact(df))
0.00018822647762421383