작업 2유형 (파이썬)#

Hits

  • 빅분기, adp 정보공유 오픈카톡방1
    @@@참여 링크@@@
    참여 코드 : dbscan (수시 업데이트, 카톡 화면 하단에 문제 확인 해주세요)

  • 빅분기, adp 정보공유 오픈카톡방2
    @@@참여 링크@@@

나는 당신이 광고를 눌러  것이라 믿고 있다.

참고

모든 문제의 y_test값은 해당 url에서 y_test로 불러와 확인가능합니다. 실제로 제출을 위해 만든 데이터의 예측 점수를 확인해보세요

분류#

서비스 이탈예측 데이터#

캐글 공유 코드 저장소

본인만의 코드를 작성하고 upvote를 받아 broze medal을 획득 해보세요 캐글 노트북 링크

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/X_test.csv")


display(x_train.head())
display(y_train.head())
CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary
0 15799217 Zetticci 791 Germany Female 35 7 52436.20 1 1 0 161051.75
1 15748986 Bischof 705 Germany Male 42 8 166685.92 2 1 1 55313.51
2 15722004 Hsiung 543 France Female 31 4 138317.94 1 0 0 61843.73
3 15780966 Pritchard 709 France Female 32 2 0.00 2 0 0 109681.29
4 15636731 Ts'ai 714 Germany Female 36 1 101609.01 2 1 1 447.73
CustomerId Exited
0 15799217 0
1 15748986 0
2 15722004 0
3 15780966 0
4 15636731 0
Hide code cell content
# print(x_train.info())
# print(x_train.nunique())  <- 판다스 낮은버전은 동작 안할수도 있음

drop_col = ['CustomerId','Surname']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)

# import sklearn
# print(sklearn.__all__)
# import sklearn.model_selection
# print(dir(sklearn.model_selection))

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x_train_dummies = pd.get_dummies(x_train_drop)
y = y_train['Exited']


x_test_dummies = pd.get_dummies(x_test_drop)
# train과 컬럼 순서 동일하게 하기 (더미화 하면서 순서대로 정렬을 이미 하기 때문에 오류가 난다면 해당 컬럼이 누락된것)
x_test_dummies = x_test_dummies[x_train_dummies.columns]
# print(help(train_test_split))



X_train, X_validation, Y_train, Y_validation = train_test_split(x_train_dummies, y, test_size=0.33, random_state=42)
rf = RandomForestClassifier(random_state =23)
rf.fit(X_train,Y_train)

# import sklearn.metrics
# print(dir(sklearn.metrics))

from sklearn.metrics import accuracy_score , f1_score, recall_score, roc_auc_score ,precision_score

#model_score
predict_train_label = rf.predict(X_train)
predict_train_proba = rf.predict_proba(X_train)[:,1]

predict_validation_label = rf.predict(X_validation)
predict_validation_prob = rf.predict_proba(X_validation)[:,1]


# 문제에서 묻는 것에 따라 모델 성능 확인하기
# 정확도 (accuracy) , f1_score , recall , precision -> model.predict로 결과뽑기
# auc , 확률이라는 표현있으면 model.predict_proba로 결과뽑고 첫번째 행의 값을 가져오기 model.predict_proba()[:,1]
print('train accuracy :', accuracy_score(Y_train,predict_train_label))
print('validation accuracy :', accuracy_score(Y_validation,predict_validation_label))
print('\n')
print('train f1_score :', f1_score(Y_train,predict_train_label))
print('validation accuracy :', f1_score(Y_validation,predict_validation_label))
print('\n')
print('train recall_score :', recall_score(Y_train,predict_train_label))
print('validation recall_score :', recall_score(Y_validation,predict_validation_label))
print('\n')
print('train precision_score :', precision_score(Y_train,predict_train_label))
print('validation precision_score :', precision_score(Y_validation,predict_validation_label))
print('\n')
print('train auc :', roc_auc_score(Y_train,predict_train_proba))
print('validation auc :', roc_auc_score(Y_validation,predict_validation_prob))


# test데이터 마찬가지 위와 같은 방식
predict_test_label = rf.predict(x_test_dummies)
predict_test_proba = rf.predict_proba(x_test_dummies)[:,1]


# accuracy, f1_score, recall, precision 
#pd.DataFrame({'CustomerId': x_test.CustomerId, 'Exited': predict_test_label}).to_csv('003000000.csv', index=False)

# auc, 확률
#pd.DataFrame({'CustomerId': x_test.CustomerId, 'Exited': predict_test_proba}).to_csv('003000000.csv', index=False)
train accuracy : 1.0
validation accuracy : 0.8652680652680653


train f1_score : 1.0
validation accuracy : 0.5912305516265912


train recall_score : 1.0
validation recall_score : 0.4543478260869565


train precision_score : 1.0
validation precision_score : 0.8461538461538461


train auc : 1.0
validation auc : 0.8497613211198555

이직여부 판단 데이터#

캐글 공유 코드 저장소

본인만의 코드를 작성하고 upvote를 받아 broze medal을 획득 해보세요 캐글 노트북 링크

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/X_test.csv")


display(x_train.head())
display(y_train.head())
enrollee_id city city_development_index gender relevent_experience enrolled_university education_level major_discipline experience company_size company_type last_new_job training_hours
0 25298 city_138 0.836 Male No relevent experience Full time course High School NaN 5 100-500 Pvt Ltd 1 45
1 4241 city_160 0.920 Male No relevent experience Full time course High School NaN 5 NaN NaN 1 17
2 24086 city_57 0.866 Male No relevent experience no_enrollment Graduate STEM 10 NaN NaN 1 50
3 26773 city_16 0.910 Male Has relevent experience no_enrollment Graduate STEM >20 50-99 Pvt Ltd >4 135
4 32325 city_143 0.740 NaN No relevent experience Full time course Graduate STEM 5 NaN NaN never 17
enrollee_id target
0 25298 0.0
1 4241 1.0
2 24086 0.0
3 26773 0.0
4 32325 1.0
Hide code cell content
# print(x_train.info())
# print(x_train.nunique())
# print(x_train.isnull().sum())
# 결측치가 있지만 따로 처리하지 않고 더미화


# 범주형 변수인데 적당히 많은 unique값을 가진 컬럼은 날린다.
drop_col = ['enrollee_id','city','company_type','experience']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)

# import sklearn
# print(sklearn.__all__)
# import sklearn.model_selection
# print(dir(sklearn.model_selection))

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x_train_dummies = pd.get_dummies(x_train_drop)
y = y_train['target'].astype('int')


x_test_dummies = pd.get_dummies(x_test_drop)
# train과 컬럼 순서 동일하게 하기 (더미화 하면서 순서대로 정렬을 이미 하기 때문에 오류가 난다면 해당 컬럼이 누락된것)
x_test_dummies = x_test_dummies[x_train_dummies.columns]
# print(help(train_test_split))



X_train, X_validation, Y_train, Y_validation = train_test_split(x_train_dummies, y, test_size=0.33, random_state=42)
rf = RandomForestClassifier(random_state =23)
rf.fit(X_train,Y_train)

# import sklearn.metrics
# print(dir(sklearn.metrics))

from sklearn.metrics import accuracy_score , f1_score, recall_score, roc_auc_score ,precision_score

#model_score
predict_train_label = rf.predict(X_train)
predict_train_proba = rf.predict_proba(X_train)[:,1]

predict_validation_label = rf.predict(X_validation)
predict_validation_prob = rf.predict_proba(X_validation)[:,1]


# 문제에서 묻는 것에 따라 모델 성능 확인하기
# 정확도 (accuracy) , f1_score , recall , precision -> model.predict로 결과뽑기
# auc , 확률이라는 표현있으면 model.predict_proba로 결과뽑고 첫번째 행의 값을 가져오기 model.predict_proba()[:,1]
print('train accuracy :', accuracy_score(Y_train,predict_train_label))
print('validation accuracy :', accuracy_score(Y_validation,predict_validation_label))
print('\n')
print('train f1_score :', f1_score(Y_train,predict_train_label))
print('validation f1_score :', f1_score(Y_validation,predict_validation_label))
print('\n')
print('train recall_score :', recall_score(Y_train,predict_train_label))
print('validation recall_score :', recall_score(Y_validation,predict_validation_label))
print('\n')
print('train precision_score :', precision_score(Y_train,predict_train_label))
print('validation precision_score :', precision_score(Y_validation,predict_validation_label))
print('\n')
print('train auc :', roc_auc_score(Y_train,predict_train_proba))
print('validation auc :', roc_auc_score(Y_validation,predict_validation_prob))


# test데이터 마찬가지 위와 같은 방식
predict_test_label = rf.predict(x_test_dummies)
predict_test_proba = rf.predict_proba(x_test_dummies)[:,1]


# accuracy, f1_score, recall, precision 
#pd.DataFrame({'enrollee_id': x_test.enrollee_id, 'target': predict_test_label}).to_csv('003000000.csv', index=False)

# auc, 확률
#pd.DataFrame({'enrollee_id': x_test.enrollee_id, 'target': predict_test_proba}).to_csv('003000000.csv', index=False)
train accuracy : 0.9965236154399425
validation accuracy : 0.7535279805352798


train f1_score : 0.9929731039496001
validation f1_score : 0.42736009044657997


train recall_score : 0.9927325581395349
validation recall_score : 0.3631123919308357


train precision_score : 0.9932137663596704
validation precision_score : 0.5192307692307693


train auc : 0.9998907607098495
validation auc : 0.740677513569584

정시 배송 여부 판단 (2회기출)#

캐글 공유 코드 저장소

본인만의 코드를 작성하고 upvote를 받아 broze medal을 획득 해보세요 캐글 노트북 링크

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/X_test.csv")


display(x_train.head())
display(y_train.head())
ID Warehouse_block Mode_of_Shipment Customer_care_calls Customer_rating Cost_of_the_Product Prior_purchases Product_importance Gender Discount_offered Weight_in_gms
0 6045 A Flight 4 3 266 5 high F 5 1590
1 44 F Ship 3 1 174 2 low M 44 1556
2 7940 F Road 4 1 154 10 high M 10 5674
3 1596 F Ship 4 3 158 3 medium F 27 1207
4 4395 A Flight 5 3 175 3 low M 7 4833
ID Reached.on.Time_Y.N
0 6045 0
1 44 1
2 7940 1
3 1596 1
4 4395 1
Hide code cell content
# print(x_train.info())
# print(x_train.nunique())
# print(x_train.isnull().sum())
# 결측치가 있지만 따로 처리하지 않고 더미화


# 범주형 변수인데 적당히 많은 unique값을 가진 컬럼은 날린다.
drop_col = ['ID']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)

# import sklearn
# print(sklearn.__all__)
# import sklearn.model_selection
# print(dir(sklearn.model_selection))

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x_train_dummies = pd.get_dummies(x_train_drop)
y = y_train['Reached.on.Time_Y.N']


x_test_dummies = pd.get_dummies(x_test_drop)
# train과 컬럼 순서 동일하게 하기 (더미화 하면서 순서대로 정렬을 이미 하기 때문에 오류가 난다면 해당 컬럼이 누락된것)
x_test_dummies = x_test_dummies[x_train_dummies.columns]
# print(help(train_test_split))



X_train, X_validation, Y_train, Y_validation = train_test_split(x_train_dummies, y, test_size=0.33, random_state=42)
rf = RandomForestClassifier(random_state =23)
rf.fit(X_train,Y_train)

# import sklearn.metrics
# print(dir(sklearn.metrics))

from sklearn.metrics import accuracy_score , f1_score, recall_score, roc_auc_score ,precision_score

#model_score
predict_train_label = rf.predict(X_train)
predict_train_proba = rf.predict_proba(X_train)[:,1]

predict_validation_label = rf.predict(X_validation)
predict_validation_prob = rf.predict_proba(X_validation)[:,1]


# 문제에서 묻는 것에 따라 모델 성능 확인하기
# 정확도 (accuracy) , f1_score , recall , precision -> model.predict로 결과뽑기
# auc , 확률이라는 표현있으면 model.predict_proba로 결과뽑고 첫번째 행의 값을 가져오기 model.predict_proba()[:,1]
print('train accuracy :', accuracy_score(Y_train,predict_train_label))
print('validation accuracy :', accuracy_score(Y_validation,predict_validation_label))
print('\n')
print('train f1_score :', f1_score(Y_train,predict_train_label))
print('validation f1_score :', f1_score(Y_validation,predict_validation_label))
print('\n')
print('train recall_score :', recall_score(Y_train,predict_train_label))
print('validation recall_score :', recall_score(Y_validation,predict_validation_label))
print('\n')
print('train precision_score :', precision_score(Y_train,predict_train_label))
print('validation precision_score :', precision_score(Y_validation,predict_validation_label))
print('\n')
print('train auc :', roc_auc_score(Y_train,predict_train_proba))
print('validation auc :', roc_auc_score(Y_validation,predict_validation_prob))


# test데이터 마찬가지 위와 같은 방식
predict_test_label = rf.predict(x_test_dummies)
predict_test_proba = rf.predict_proba(x_test_dummies)[:,1]


# accuracy, f1_score, recall, precision 
#pd.DataFrame({'ID': x_test.ID, 'Reached.on.Time_Y.N': predict_test_label}).to_csv('003000000.csv', index=False)

# auc, 확률
#pd.DataFrame({'ID': x_test.ID, 'Reached.on.Time_Y.N': predict_test_proba}).to_csv('003000000.csv', index=False)
train accuracy : 1.0
validation accuracy : 0.6395775941230487


train f1_score : 1.0
validation f1_score : 0.6744089589382


train recall_score : 1.0
validation recall_score : 0.630721489526765


train precision_score : 1.0
validation precision_score : 0.7245989304812834


train auc : 1.0
validation auc : 0.7261997118475008

성인 건강검진 데이터#

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/x_test.csv")


display(x_train.head())
display(y_train.head())
ID 성별코드 연령대코드(5세단위) 신장(5Cm단위) 체중(5Kg단위) 허리둘레 시력(좌) 시력(우) 청력(좌) 청력(우) 수축기혈압 이완기혈압 식전혈당(공복혈당) 총콜레스테롤 트리글리세라이드 HDL콜레스테롤 LDL콜레스테롤 혈색소 요단백 혈청크레아티닌 (혈청지오티)AST (혈청지오티)ALT 감마지티피 구강검진수검여부 치아우식증유무 치석
0 0 F 40 155 60 81.3 1.2 1.0 1.0 1.0 114.0 73.0 94.0 215.0 82.0 73.0 126.0 12.9 1.0 0.7 18.0 19.0 27.0 Y 0.0 Y
1 1 F 40 160 60 81.0 0.8 0.6 1.0 1.0 119.0 70.0 130.0 192.0 115.0 42.0 127.0 12.7 1.0 0.6 22.0 19.0 18.0 Y 0.0 Y
2 2 M 55 170 60 80.0 0.8 0.8 1.0 1.0 138.0 86.0 89.0 242.0 182.0 55.0 151.0 15.8 1.0 1.0 21.0 16.0 22.0 Y 0.0 N
3 3 M 40 165 70 88.0 1.5 1.5 1.0 1.0 100.0 60.0 96.0 322.0 254.0 45.0 226.0 14.7 1.0 1.0 19.0 26.0 18.0 Y 0.0 Y
4 4 F 40 155 60 86.0 1.0 1.0 1.0 1.0 120.0 74.0 80.0 184.0 74.0 62.0 107.0 12.5 1.0 0.6 16.0 14.0 22.0 Y 0.0 N
ID 흡연상태
0 0 0
1 1 0
2 2 1
3 3 0
4 4 0
Hide code cell content
# print(x_train.info())
# print(x_train.nunique())
# print(x_train.isnull().sum())
# 결측치가 있지만 따로 처리하지 않고 더미화


# 범주형 변수인데 적당히 많은 unique값을 가진 컬럼은 날린다. 구강검진수검여부의 경우 unique값이 한개이기 때문에 제거한다.
drop_col = ['ID','구강검진수검여부']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)

# import sklearn
# print(sklearn.__all__)
# import sklearn.model_selection
# print(dir(sklearn.model_selection))

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x_train_dummies = pd.get_dummies(x_train_drop)
y = y_train['흡연상태']


x_test_dummies = pd.get_dummies(x_test_drop)
# train과 컬럼 순서 동일하게 하기 (더미화 하면서 순서대로 정렬을 이미 하기 때문에 오류가 난다면 해당 컬럼이 누락된것)
x_test_dummies = x_test_dummies[x_train_dummies.columns]
# print(help(train_test_split))



X_train, X_validation, Y_train, Y_validation = train_test_split(x_train_dummies, y, test_size=0.33, random_state=42,stratify=y)
rf = RandomForestClassifier(random_state =23)
rf.fit(X_train,Y_train)

# import sklearn.metrics
# print(dir(sklearn.metrics))

from sklearn.metrics import accuracy_score , f1_score, recall_score, roc_auc_score ,precision_score

#model_score
predict_train_label = rf.predict(X_train)
predict_train_proba = rf.predict_proba(X_train)[:,1]

predict_validation_label = rf.predict(X_validation)
predict_validation_prob = rf.predict_proba(X_validation)[:,1]


# 문제에서 묻는 것에 따라 모델 성능 확인하기
# 정확도 (accuracy) , f1_score , recall , precision -> model.predict로 결과뽑기
# auc , 확률이라는 표현있으면 model.predict_proba로 결과뽑고 첫번째 행의 값을 가져오기 model.predict_proba()[:,1]
print('train accuracy :', accuracy_score(Y_train,predict_train_label))
print('validation accuracy :', accuracy_score(Y_validation,predict_validation_label))
print('\n')
print('train f1_score :', f1_score(Y_train,predict_train_label))
print('validation f1_score :', f1_score(Y_validation,predict_validation_label))
print('\n')
print('train recall_score :', recall_score(Y_train,predict_train_label))
print('validation recall_score :', recall_score(Y_validation,predict_validation_label))
print('\n')
print('train precision_score :', precision_score(Y_train,predict_train_label))
print('validation precision_score :', precision_score(Y_validation,predict_validation_label))
print('\n')
print('train auc :', roc_auc_score(Y_train,predict_train_proba))
print('validation auc :', roc_auc_score(Y_validation,predict_validation_prob))


# test데이터 마찬가지 위와 같은 방식
predict_test_label = rf.predict(x_test_dummies)
predict_test_proba = rf.predict_proba(x_test_dummies)[:,1]


# accuracy, f1_score, recall, precision 
#pd.DataFrame({'ID': x_test.ID, '흡연상태': predict_test_label}).to_csv('003000000.csv', index=False)

# auc, 확률
#pd.DataFrame({'ID': x_test.ID, '흡연상태': predict_test_proba}).to_csv('003000000.csv', index=False)
train accuracy : 1.0
validation accuracy : 0.75637624974495


train f1_score : 1.0
validation f1_score : 0.6791472590469365


train recall_score : 1.0
validation recall_score : 0.7025574499629355


train precision_score : 1.0
validation precision_score : 0.657246879334258


train auc : 1.0
validation auc : 0.834807387299372

자동차 보험가입 예측데이터#

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/insurance/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/insurance/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/insurance/x_test.csv")


display(x_train.head())
display(y_train.head())
ID Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage id
0 0 Female 23 1 8.0 0 < 1 Year Yes 61354.0 152.0 235 NaN
1 1 Male 27 1 28.0 1 < 1 Year No 38036.0 152.0 207 NaN
2 2 Female 23 1 45.0 0 < 1 Year Yes 25984.0 152.0 217 NaN
3 3 Male 22 1 46.0 0 < 1 Year No 39499.0 152.0 277 NaN
4 4 Male 32 1 30.0 1 < 1 Year No 38771.0 152.0 251 NaN
ID Response
0 0 0
1 1 0
2 2 1
3 3 0
4 4 0
Hide code cell content
# print(x_train.info())
# print(x_train.nunique())
# print(x_train.isnull().sum())



# id는 제가 잘못넣은 컬럼입니다.
drop_col = ['ID','id']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)

# import sklearn
# print(sklearn.__all__)
# import sklearn.model_selection
# print(dir(sklearn.model_selection))

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x_train_dummies = pd.get_dummies(x_train_drop)
y = y_train['Response']


x_test_dummies = pd.get_dummies(x_test_drop)
# train과 컬럼 순서 동일하게 하기 (더미화 하면서 순서대로 정렬을 이미 하기 때문에 오류가 난다면 해당 컬럼이 누락된것)
x_test_dummies = x_test_dummies[x_train_dummies.columns]
# print(help(train_test_split))



X_train, X_validation, Y_train, Y_validation = train_test_split(x_train_dummies, y, test_size=0.33, random_state=42,stratify=y)
rf = RandomForestClassifier(random_state =23)
rf.fit(X_train,Y_train)

# import sklearn.metrics
# print(dir(sklearn.metrics))

from sklearn.metrics import accuracy_score , f1_score, recall_score, roc_auc_score ,precision_score

#model_score
predict_train_label = rf.predict(X_train)
predict_train_proba = rf.predict_proba(X_train)[:,1]

predict_validation_label = rf.predict(X_validation)
predict_validation_prob = rf.predict_proba(X_validation)[:,1]


# 문제에서 묻는 것에 따라 모델 성능 확인하기
# 정확도 (accuracy) , f1_score , recall , precision -> model.predict로 결과뽑기
# auc , 확률이라는 표현있으면 model.predict_proba로 결과뽑고 첫번째 행의 값을 가져오기 model.predict_proba()[:,1]
print('train accuracy :', accuracy_score(Y_train,predict_train_label))
print('validation accuracy :', accuracy_score(Y_validation,predict_validation_label))
print('\n')
print('train f1_score :', f1_score(Y_train,predict_train_label))
print('validation f1_score :', f1_score(Y_validation,predict_validation_label))
print('\n')
print('train recall_score :', recall_score(Y_train,predict_train_label))
print('validation recall_score :', recall_score(Y_validation,predict_validation_label))
print('\n')
print('train precision_score :', precision_score(Y_train,predict_train_label))
print('validation precision_score :', precision_score(Y_validation,predict_validation_label))
print('\n')
print('train auc :', roc_auc_score(Y_train,predict_train_proba))
print('validation auc :', roc_auc_score(Y_validation,predict_validation_prob))


# test데이터 마찬가지 위와 같은 방식
predict_test_label = rf.predict(x_test_dummies)
predict_test_proba = rf.predict_proba(x_test_dummies)[:,1]


# accuracy, f1_score, recall, precision 
#pd.DataFrame({'ID': x_test.ID, 'Response': predict_test_label}).to_csv('003000000.csv', index=False)

# auc, 확률
#pd.DataFrame({'ID': x_test.ID, 'Response': predict_test_proba}).to_csv('003000000.csv', index=False)
train accuracy : 0.9999314646014666
validation accuracy : 0.8655143967479352


train f1_score : 0.9997203691127712
validation f1_score : 0.18118003025718607


train recall_score : 0.9995606502376483
validation recall_score : 0.12140134620063255


train precision_score : 0.9998801390387151
validation precision_score : 0.3569384835479256


train auc : 0.9999998704194665
validation auc : 0.8340167689531695

비행탑승 경험 만족도 데이터#

Attention

test 데이터에 대해서 neutral or dissatisfied라고 예측할 확률을 구하고 그 확률 값을 제출하라

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/x_test.csv")


display(x_train.head())
display(y_train.head())
ID Gender Customer Type Age Type of Travel Class Flight Distance Inflight wifi service Departure/Arrival time convenient Ease of Online booking Gate location Food and drink Online boarding Seat comfort Inflight entertainment On-board service Leg room service Baggage handling Checkin service Inflight service Cleanliness Departure Delay in Minutes Arrival Delay in Minutes id
0 0 Female Loyal Customer 54 Personal Travel Eco 1068 3 4 3 1 5 4 4 5 5 3 5 3 5 3 47 22.0 NaN
1 2 Male Loyal Customer 20 Personal Travel Eco 1546 4 4 4 4 4 4 4 4 3 3 4 4 4 4 5 2.0 NaN
2 3 Male Loyal Customer 59 Business travel Business 2962 0 4 0 4 2 4 5 1 1 1 1 5 1 4 54 46.0 NaN
3 4 Male Loyal Customer 35 Business travel Eco Plus 106 5 4 4 4 5 5 5 5 2 1 5 4 4 5 130 121.0 NaN
4 5 Female Loyal Customer 9 Business travel Business 2917 3 3 3 3 4 4 4 4 4 4 5 4 3 4 0 0.0 NaN
ID satisfaction
0 0 neutral or dissatisfied
1 2 neutral or dissatisfied
2 3 satisfied
3 4 satisfied
4 5 satisfied
Hide code cell content
# print(x_train.info())
# print(x_train.nunique())
print(x_train.isnull().sum())


mean_values = x_train['Arrival Delay in Minutes'].mean()
x_train['Arrival Delay in Minutes'] = x_train['Arrival Delay in Minutes'].fillna(mean_values)

# data leakage 때문에 결측치는 train값으로 채우는게 원칙이나 신경쓰기 어렵다면 test의 결측치는 test의 평균값으로 대치하세요
x_test['Arrival Delay in Minutes'] = x_test['Arrival Delay in Minutes'].fillna(mean_values)


# id는 제가 잘못넣은 컬럼입니다.
drop_col = ['ID','id']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)

# import sklearn
# print(sklearn.__all__)
# import sklearn.model_selection
# print(dir(sklearn.model_selection))

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x_train_dummies = pd.get_dummies(x_train_drop)



y = y_train['satisfaction']


x_test_dummies = pd.get_dummies(x_test_drop)
x_test_dummies = x_test_dummies[x_train_dummies.columns]
# print(help(train_test_split))



X_train, X_validation, Y_train, Y_validation = train_test_split(x_train_dummies, y, test_size=0.33, random_state=42,stratify=y)
rf = RandomForestClassifier(random_state =23)
rf.fit(X_train,Y_train)

# import sklearn.metrics
# print(dir(sklearn.metrics))

from sklearn.metrics import accuracy_score , f1_score, recall_score, roc_auc_score ,precision_score

#model_score
predict_train_label = rf.predict(X_train)
predict_train_proba = rf.predict_proba(X_train)[:,1]

predict_validation_label = rf.predict(X_validation)
predict_validation_prob = rf.predict_proba(X_validation)[:,1]


# 문제에서 묻는 것에 따라 모델 성능 확인하기
# 라벨이 숫자가 아닌 문자로 냅두고 학습을 했다는 pos_label값을 지정해줘야한다.
# 정확도 (accuracy) , f1_score , recall , precision -> model.predict로 결과뽑기

print('train accuracy :', accuracy_score(Y_train,predict_train_label))
print('validation accuracy :', accuracy_score(Y_validation,predict_validation_label))
print('\n')
print('train f1_score :', f1_score(Y_train,predict_train_label ,pos_label = 'neutral or dissatisfied'))
print('validation f1_score :', f1_score(Y_validation,predict_validation_label ,pos_label = 'neutral or dissatisfied'))
print('\n')
print('train recall_score :', recall_score(Y_train,predict_train_label,pos_label = 'neutral or dissatisfied'))
print('validation recall_score :', recall_score(Y_validation,predict_validation_label,pos_label = 'neutral or dissatisfied'))
print('\n')
print('train precision_score :', precision_score(Y_train,predict_train_label,pos_label = 'neutral or dissatisfied'))
print('validation precision_score :', precision_score(Y_validation,predict_validation_label,pos_label = 'neutral or dissatisfied'))
print('\n')
print('train auc :', roc_auc_score(Y_train,predict_train_proba))
print('validation auc :', roc_auc_score(Y_validation,predict_validation_prob))


# test데이터 마찬가지 위와 같은 방식
predict_test_label = rf.predict(x_test_dummies)

# 'neutral or dissatisfied'는 0번 클래스이기 떄문에 predict_proba후 [:,0]으로 지정해준다
# 헷갈리는 경우가 있기 때문에 처음부터 y값을 라벨인코딩 해주고 시작해주는것도 좋음
# dic = {'neutral or dissatisfied':1 , 'satisfied':0}
# y = y_train['satisfaction'].map(dic)
# 위의 경우에는 neutral or dissatisfied는 1클래스를 가지므로 predict_proba()[:,1]로 지정해준다


predict_test_proba = rf.predict_proba(x_test_dummies)[:,0]


# accuracy, f1_score, recall, precision 
#pd.DataFrame({'ID': x_test.ID, 'satisfaction': predict_test_label}).to_csv('003000000.csv', index=False)

# auc, 확률
#pd.DataFrame({'ID': x_test.ID, 'satisfaction': predict_test_proba}).to_csv('003000000.csv', index=False)
ID                                       0
Gender                                   0
Customer Type                            0
Age                                      0
Type of Travel                           0
Class                                    0
Flight Distance                          0
Inflight wifi service                    0
Departure/Arrival time convenient        0
Ease of Online booking                   0
Gate location                            0
Food and drink                           0
Online boarding                          0
Seat comfort                             0
Inflight entertainment                   0
On-board service                         0
Leg room service                         0
Baggage handling                         0
Checkin service                          0
Inflight service                         0
Cleanliness                              0
Departure Delay in Minutes               0
Arrival Delay in Minutes                 0
id                                   83123
dtype: int64
train accuracy : 1.0
validation accuracy : 0.9611388574969925


train f1_score : 1.0
validation f1_score : 0.9661221636051611


train recall_score : 1.0
validation recall_score : 0.9778692743180648


train precision_score : 1.0
validation precision_score : 0.954653937947494


train auc : 1.0
validation auc : 0.9936992375795042

수질 음용성 여부 데이터#

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/x_test.csv")


display(x_train.head())
display(y_train.head())
ID ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon Trihalomethanes Turbidity
0 0 8.662710 173.531947 20333.079495 5.636388 439.787938 459.633120 16.283311 89.924253 5.120103
1 1 NaN 226.270824 15380.124079 6.661474 NaN 392.558205 14.083110 50.286395 4.516870
2 2 7.583770 217.283262 36343.407055 8.532726 375.964391 393.877683 17.442301 77.722257 3.642289
3 3 6.584813 182.375456 24723.106296 6.238920 NaN 414.350751 17.582615 78.213738 4.404132
4 4 7.179864 180.854211 10859.553752 8.263503 341.302486 358.056264 12.065317 83.329918 3.878447
ID Potability
0 0 0
1 1 1
2 2 0
3 3 0
4 4 0
Hide code cell content
# print(x_train.info())
# print(x_train.nunique())
print(x_train.isnull().sum())


for col in x_train.isnull().sum().where(lambda x : x !=0).dropna().index:
    mean_values = x_train[col].mean()
    x_train[col] = x_train[col].fillna(mean_values)

    # data leakage 때문에 결측치는 train값으로 채우는게 원칙이나 신경쓰기 어렵다면 test의 결측치는 test의 평균값으로 대치하세요
    x_test[col] = x_test[col].fillna(mean_values)


drop_col = ['ID']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)

# import sklearn
# print(sklearn.__all__)
# import sklearn.model_selection
# print(dir(sklearn.model_selection))

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x_train_dummies = pd.get_dummies(x_train_drop)



y = y_train['Potability']


x_test_dummies = pd.get_dummies(x_test_drop)
x_test_dummies = x_test_dummies[x_train_dummies.columns]
# print(help(train_test_split))



X_train, X_validation, Y_train, Y_validation = train_test_split(x_train_dummies, y, test_size=0.33, random_state=42,stratify=y)
rf = RandomForestClassifier(random_state =23)
rf.fit(X_train,Y_train)

# import sklearn.metrics
# print(dir(sklearn.metrics))

from sklearn.metrics import accuracy_score , f1_score, recall_score, roc_auc_score ,precision_score

#model_score
predict_train_label = rf.predict(X_train)
predict_train_proba = rf.predict_proba(X_train)[:,1]

predict_validation_label = rf.predict(X_validation)
predict_validation_prob = rf.predict_proba(X_validation)[:,1]


# 문제에서 묻는 것에 따라 모델 성능 확인하기
# 라벨이 숫자가 아닌 문자로 냅두고 학습을 했다는 pos_label값을 지정해줘야한다.
# 정확도 (accuracy) , f1_score , recall , precision -> model.predict로 결과뽑기

print('train accuracy :', accuracy_score(Y_train,predict_train_label))
print('validation accuracy :', accuracy_score(Y_validation,predict_validation_label))
print('\n')
print('train f1_score :', f1_score(Y_train,predict_train_label))
print('validation f1_score :', f1_score(Y_validation,predict_validation_label ))
print('\n')
print('train recall_score :', recall_score(Y_train,predict_train_label))
print('validation recall_score :', recall_score(Y_validation,predict_validation_label))
print('\n')
print('train precision_score :', precision_score(Y_train,predict_train_label))
print('validation precision_score :', precision_score(Y_validation,predict_validation_label))
print('\n')
print('train auc :', roc_auc_score(Y_train,predict_train_proba))
print('validation auc :', roc_auc_score(Y_validation,predict_validation_prob))


# test데이터 마찬가지 위와 같은 방식
predict_test_label = rf.predict(x_test_dummies)
predict_test_proba = rf.predict_proba(x_test_dummies)[:,0]


# accuracy, f1_score, recall, precision 
#pd.DataFrame({'ID': x_test.ID, 'Potability': predict_test_label}).to_csv('003000000.csv', index=False)

# auc, 확률
#pd.DataFrame({'ID': x_test.ID, 'Potability': predict_test_proba}).to_csv('003000000.csv', index=False)
ID                 0
ph                 0
Hardness           0
Solids             0
Chloramines        0
Sulfate            0
Conductivity       0
Organic_carbon     0
Trihalomethanes    0
Turbidity          0
dtype: int64
train accuracy : 1.0
validation accuracy : 0.6497109826589595


train f1_score : 1.0
validation f1_score : 0.4116504854368932


train recall_score : 1.0
validation recall_score : 0.314540059347181


train precision_score : 1.0
validation precision_score : 0.5955056179775281


train auc : 1.0
validation auc : 0.6309515780954951

약물 분류 데이터#

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/x_test.csv")


display(x_train.head())
display(y_train.head())
ID Age Sex BP Cholesterol Na_to_K
0 0 36 F NORMAL HIGH 16.753
1 1 47 F LOW HIGH 11.767
2 2 69 F NORMAL HIGH 10.065
3 3 35 M LOW NORMAL 9.170
4 4 49 M LOW NORMAL 11.014
ID Drug
0 0 0
1 1 3
2 2 4
3 3 4
4 4 4

사기회사 분류 데이터#

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/x_test.csv")


display(x_train.head())
display(y_train.head())
ID Sector_score LOCATION_ID PARA_A Score_A Risk_A PARA_B Score_B Risk_B TOTAL ... PROB RiSk_E History Prob Risk_F Score Inherent_Risk CONTROL_RISK Detection_Risk Audit_Risk
0 0 2.37 16 0.01 0.2 0.002 0.007 0.2 0.0014 0.017 ... 0.2 0.4 0 0.2 0.0 2.0 1.4034 0.4 0.5 0.28068
1 2 55.57 9 1.06 0.4 0.424 0.000 0.2 0.0000 1.060 ... 0.2 0.4 0 0.2 0.0 2.2 1.8240 0.4 0.5 0.36480
2 3 55.57 16 2.42 0.6 1.452 3.530 0.6 2.1180 5.950 ... 0.2 0.4 0 0.2 0.0 3.8 7.4940 0.4 0.5 1.49880
3 4 2.37 9 0.31 0.2 0.062 0.690 0.2 0.1380 1.000 ... 0.2 0.4 0 0.2 0.0 2.0 1.6000 0.4 0.5 0.32000
4 5 55.57 6 0.62 0.2 0.124 0.420 0.2 0.0840 1.040 ... 0.2 0.4 0 0.2 0.0 2.0 1.6080 0.4 0.5 0.32160

5 rows × 27 columns

ID Risk
0 0 0
1 2 0
2 3 1
3 4 0
4 5 0

센서데이터 동작유형 분류 데이터#

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_test.csv")


display(x_train.head())
display(y_train.head())
ID motion_0 motion_1 motion_2 motion_3 motion_4 motion_5 motion_6 motion_7 motion_8 ... motion_54 motion_55 motion_56 motion_57 motion_58 motion_59 motion_60 motion_61 motion_62 motion_63
0 0 1.0 -2.0 -1.0 4.0 -5.0 -4.0 1.0 0.0 -15.0 ... 0.0 -1.0 -13.0 -3.0 1.0 -1.0 -32.0 -22.0 -2.0 -3.0
1 2 20.0 0.0 0.0 1.0 5.0 6.0 -52.0 18.0 15.0 ... -70.0 -55.0 -38.0 -14.0 -12.0 -8.0 -34.0 -63.0 -87.0 -77.0
2 4 1.0 -1.0 1.0 4.0 -5.0 -8.0 1.0 -3.0 -14.0 ... 1.0 12.0 -25.0 0.0 0.0 3.0 2.0 -27.0 1.0 0.0
3 5 13.0 2.0 1.0 -3.0 1.0 3.0 28.0 3.0 12.0 ... 0.0 -21.0 -17.0 -2.0 0.0 -4.0 -17.0 -21.0 -21.0 25.0
4 6 -2.0 -7.0 -4.0 -8.0 16.0 44.0 1.0 3.0 -16.0 ... -1.0 2.0 -1.0 1.0 4.0 4.0 -17.0 -38.0 -3.0 3.0

5 rows × 65 columns

ID pose
0 0 1
1 2 0
2 4 1
3 5 0
4 6 1

당뇨여부판단 데이터#

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/x_test.csv")


display(x_train.head())
display(y_train.head())
ID Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 0 8 126 88 36 108 38.5 0.349 49
1 1 0 74 52 10 36 27.8 0.269 22
2 2 1 140 74 26 180 24.1 0.828 23
3 3 6 162 62 0 0 24.3 0.178 50
4 4 2 94 68 18 76 26.0 0.561 21
ID Outcome
0 0 0
1 1 0
2 2 0
3 3 1
4 4 0

회귀#

학생성적 예측 데이터#

캐글 공유 코드 저장소

본인만의 코드를 작성하고 upvote를 받아 broze medal을 획득 해보세요 캐글 노트북 링크

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/X_test.csv")


display(x_train.head())
display(y_train.head())
StudentID school sex age address famsize Pstatus Medu Fedu Mjob ... romantic famrel freetime goout Dalc Walc health absences G1 G2
0 1714 GP F 18 U GT3 T 4 3 other ... no 4 3 3 1 1 3 0 14 13
1 1254 GP F 17 U GT3 T 4 3 health ... yes 4 4 3 1 3 4 0 13 15
2 1639 GP F 16 R GT3 T 4 4 health ... no 2 4 4 2 3 4 6 10 11
3 1118 GP M 16 U GT3 T 4 4 services ... no 5 3 3 1 3 5 0 15 13
4 1499 GP M 19 U GT3 T 3 2 services ... yes 4 5 4 1 1 4 0 5 0

5 rows × 33 columns

StudentID G3
0 1714 14
1 1254 15
2 1639 11
3 1118 13
4 1499 0
Hide code cell content
#x_train.isnull().sum()
#x_test.isnull().sum()

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error , mean_absolute_error , mean_absolute_percentage_error ,r2_score
import numpy as np




drop_col = ['StudentID']

x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)
y_train_target = y_train['G3']

x_train_dum = pd.get_dummies(x_train_drop)
x_test_dum = pd.get_dummies(x_test_drop)[x_train_dum.columns]


xtr , xt , ytr, yt = train_test_split(x_train_dum , y_train_target)

rf = RandomForestRegressor(random_state =42)
rf.fit(xtr,ytr)

y_validation_pred = rf.predict(xt)

# 모델평가 
# mse : mean_squared_error / mae : mean_absolute_error  / mape : mean_absolute_percentage_error
# rmse : root_mean_squerd_error -> 패키지 없음 np.sqrt(mean_squared_error) 해줘야함

# y_true ,y_pred 순서 help로 잘 확인 하시고 사용하셔요

#mse 
print('validation mse' ,mean_squared_error(yt,y_validation_pred))

#mae 
print('validation mae' ,mean_absolute_error(yt,y_validation_pred))

#mape 
print('validation mape' ,mean_absolute_percentage_error(yt,y_validation_pred))

#rmse
print('validation rmse' ,np.sqrt(mean_absolute_percentage_error(yt,y_validation_pred)))

#r2
print('validation r2 score' ,r2_score(yt,y_validation_pred))
validation mse 1.2227864771241832
validation mae 0.7917254901960783
validation mape 178554479343983.25
validation rmse 13362427.898551343
validation r2 score 0.8846521985576391

중고차 가격 예측 데이터#

캐글 공유 코드 저장소

본인만의 코드를 작성하고 upvote를 받아 broze medal을 획득 해보세요 캐글 노트북 링크

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/carsprice/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/carsprice/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/carsprice/X_test.csv")


display(x_train.head())
display(y_train.head())
carID brand model year transmission mileage fuelType tax mpg engineSize
0 13207 hyundi Santa Fe 2019 Semi-Auto 4223 Diesel 145.0 39.8 2.2
1 17314 vauxhall GTC 2015 Manual 47870 Diesel 125.0 60.1 2.0
2 12342 audi RS4 2019 Automatic 5151 Petrol 145.0 29.1 2.9
3 13426 vw Scirocco 2016 Automatic 20423 Diesel 30.0 57.6 2.0
4 16004 skoda Scala 2020 Semi-Auto 3569 Petrol 145.0 47.1 1.0
carID price
0 13207 31995
1 17314 7700
2 12342 58990
3 13426 12999
4 16004 16990
#x_train.isnull().sum()
#x_test.isnull().sum()


from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error , mean_absolute_error , mean_absolute_percentage_error ,r2_score
import numpy as np


drop_col = ['carID']

x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)
y_train_target = y_train['price']


# train에만 있고 test에는 없는 model들이 있기에 합쳐서 원핫 인코딩을 진행
# data leakage 이슈가 있으나 빅분기 시험에서는 이러한 방식으로 해결하는 것으로 태클걸지 않음
# 정석은 train과 동일한 컬럼을 만들고 test에서 없는 경우에는 0으로 채우고, test에는 있으나 train에 없는 경우에는 데이터 제거


combined = pd.concat([x_train_drop, x_test_drop])
combined_encoded = pd.get_dummies(combined)
train_encoded = combined_encoded[:len(x_train_drop)]
test_encoded = combined_encoded[len(x_train_drop):]

xtr , xt , ytr, yt = train_test_split(train_encoded , y_train_target)

rf = RandomForestRegressor(random_state =42)
rf.fit(xtr,ytr)

y_validation_pred = rf.predict(xt)

# 모델평가 
# mse : mean_squared_error / mae : mean_absolute_error  / mape : mean_absolute_percentage_error
# rmse : root_mean_squerd_error -> 패키지 없음 np.sqrt(mean_squared_error) 해줘야함

# y_true ,y_pred 순서 help로 잘 확인 하시고 사용하셔요

#mse 
print('validation mse' ,mean_squared_error(yt,y_validation_pred))

#mae 
print('validation mae' ,mean_absolute_error(yt,y_validation_pred))

#mape 
print('validation mape' ,mean_absolute_percentage_error(yt,y_validation_pred))

#rmse
print('validation rmse' ,np.sqrt(mean_absolute_percentage_error(yt,y_validation_pred)))

#r2
print('validation r2 score' ,r2_score(yt,y_validation_pred))
validation mse 14447219.441083966
validation mae 1990.275561392729
validation mape 0.09895579384297525
validation rmse 0.31457239841247236
validation r2 score 0.9385284823923195

의료 비용 예측 데이터#

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/MedicalCost/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/MedicalCost/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/MedicalCost/x_test.csv")


display(x_train.head())
display(y_train.head())
ID age sex bmi children smoker region
0 2 35 female 35.860 2 no southeast
1 3 28 female 23.845 2 no northwest
2 4 23 female 32.780 2 yes southeast
3 6 52 female 25.300 2 yes southeast
4 7 63 male 39.800 3 no southwest
ID charges
0 2 5836.52040
1 3 4719.73655
2 4 36021.01120
3 6 24667.41900
4 7 15170.06900

킹카운티 주거지 가격예측문제 데이터#

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice/x_test.csv")


display(x_train.head())
display(y_train.head())
ID id date bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 2 8651400730 20150428T000000 3 1.00 840 5525 1.0 0 0 ... 6 840 0 1969 0 98042 47.3607 -122.085 920 5330
1 3 3163600130 20150317T000000 3 1.00 1250 8000 1.0 0 0 ... 7 1250 0 1956 0 98146 47.5065 -122.337 1040 6973
2 4 5045700330 20140725T000000 4 2.50 2200 6400 2.0 0 0 ... 8 2200 0 2010 0 98059 47.4856 -122.156 2600 5870
3 5 1036100130 20140808T000000 3 2.50 1980 39932 2.0 0 0 ... 8 1980 0 1994 0 98011 47.7433 -122.196 2610 12769
4 6 7696630080 20140506T000000 3 1.75 1690 7735 1.0 0 0 ... 7 1060 630 1976 0 98001 47.3324 -122.280 1580 7503

5 rows × 21 columns

ID price
0 2 191000.0
1 3 234900.0
2 4 460000.0
3 5 442000.0
4 6 197000.0

대학원 입학가능성 데이터#

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/admission/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/admission/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/admission/x_test.csv")


display(x_train.head())
display(y_train.head())
ID Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research
0 0 67 327 114 3 3.0 3.0 9.02 0
1 1 112 321 109 4 4.0 4.0 8.68 1
2 2 495 301 99 3 2.5 2.0 8.45 1
3 3 356 317 106 2 2.0 3.5 8.12 0
4 4 250 321 111 3 3.5 4.0 8.83 1
ID Chance of Admit
0 0 0.61
1 1 0.69
2 2 0.68
3 3 0.73
4 4 0.77

레드 와인 퀄리티 예측 데이터#

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/redwine/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/redwine/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/redwine/x_test.csv")


display(x_train.head())
display(y_train.head())
ID fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
0 1 10.6 0.44 0.68 4.1 0.114 6.0 24.0 0.99700 3.06 0.66 13.4
1 2 7.0 0.60 0.30 4.5 0.068 20.0 110.0 0.99914 3.30 1.17 10.2
2 3 8.0 0.43 0.36 2.3 0.075 10.0 48.0 0.99760 3.34 0.46 9.4
3 4 7.9 0.53 0.24 2.0 0.072 15.0 105.0 0.99600 3.27 0.54 9.4
4 5 8.0 0.45 0.23 2.2 0.094 16.0 29.0 0.99620 3.21 0.49 10.2
ID quality
0 1 6
1 2 5
2 3 5
3 4 6
4 5 6

현대 차량 가격 분류문제 데이터#

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/hyundai/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/hyundai/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/hyundai/x_test.csv")


display(x_train.head())
display(y_train.head())
ID model year transmission mileage fuelType tax(£) mpg engineSize
0 0 I30 2019 Manual 21 Petrol 150 34.0 2.0
1 1 Santa Fe 2018 Semi-Auto 10500 Diesel 145 39.8 2.2
2 2 Tucson 2017 Manual 29968 Diesel 30 61.7 1.7
3 3 Kona 2018 Manual 27317 Petrol 145 52.3 1.0
4 4 Tucson 2018 Semi-Auto 31459 Diesel 145 57.7 1.7
ID price
0 0 23995
1 1 28490
2 2 13251
3 3 14990
4 4 17591

이 포스팅은 쿠팡 파트너스 활동의 일환으로, 이에 따른 일정액의 수수료를 제공받습니다