작업 2유형 (파이썬)#
빅분기, adp 정보공유 오픈카톡방1
@@@참여 링크@@@
참여 코드 : dbscan (수시 업데이트, 카톡 화면 하단에 문제 확인 해주세요)빅분기, adp 정보공유 오픈카톡방2
@@@참여 링크@@@
나는 당신이 광고를 눌러 줄 것이라 믿고 있다.
참고
모든 문제의 y_test값은 해당 url에서 y_test로 불러와 확인가능합니다. 실제로 제출을 위해 만든 데이터의 예측 점수를 확인해보세요
분류#
서비스 이탈예측 데이터#
Attention
데이터 설명 : 고객의 신상정보 데이터를 통한 회사 서비스 이탈 예측 (종속변수 : Exited)
x_train : https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/X_train.csv
y_train : https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/y_train.csv
x_test : https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/X_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/y_test.csv
데이터 출처 : https://www.kaggle.com/shubh0799/churn-modelling 에서 변형
캐글 공유 코드 저장소
본인만의 코드를 작성하고 upvote를 받아 broze medal을 획득 해보세요 캐글 노트북 링크
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/X_test.csv")
display(x_train.head())
display(y_train.head())
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 15799217 | Zetticci | 791 | Germany | Female | 35 | 7 | 52436.20 | 1 | 1 | 0 | 161051.75 |
1 | 15748986 | Bischof | 705 | Germany | Male | 42 | 8 | 166685.92 | 2 | 1 | 1 | 55313.51 |
2 | 15722004 | Hsiung | 543 | France | Female | 31 | 4 | 138317.94 | 1 | 0 | 0 | 61843.73 |
3 | 15780966 | Pritchard | 709 | France | Female | 32 | 2 | 0.00 | 2 | 0 | 0 | 109681.29 |
4 | 15636731 | Ts'ai | 714 | Germany | Female | 36 | 1 | 101609.01 | 2 | 1 | 1 | 447.73 |
CustomerId | Exited | |
---|---|---|
0 | 15799217 | 0 |
1 | 15748986 | 0 |
2 | 15722004 | 0 |
3 | 15780966 | 0 |
4 | 15636731 | 0 |
Show code cell content
# print(x_train.info())
# print(x_train.nunique()) <- 판다스 낮은버전은 동작 안할수도 있음
drop_col = ['CustomerId','Surname']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)
# import sklearn
# print(sklearn.__all__)
# import sklearn.model_selection
# print(dir(sklearn.model_selection))
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x_train_dummies = pd.get_dummies(x_train_drop)
y = y_train['Exited']
x_test_dummies = pd.get_dummies(x_test_drop)
# train과 컬럼 순서 동일하게 하기 (더미화 하면서 순서대로 정렬을 이미 하기 때문에 오류가 난다면 해당 컬럼이 누락된것)
x_test_dummies = x_test_dummies[x_train_dummies.columns]
# print(help(train_test_split))
X_train, X_validation, Y_train, Y_validation = train_test_split(x_train_dummies, y, test_size=0.33, random_state=42)
rf = RandomForestClassifier(random_state =23)
rf.fit(X_train,Y_train)
# import sklearn.metrics
# print(dir(sklearn.metrics))
from sklearn.metrics import accuracy_score , f1_score, recall_score, roc_auc_score ,precision_score
#model_score
predict_train_label = rf.predict(X_train)
predict_train_proba = rf.predict_proba(X_train)[:,1]
predict_validation_label = rf.predict(X_validation)
predict_validation_prob = rf.predict_proba(X_validation)[:,1]
# 문제에서 묻는 것에 따라 모델 성능 확인하기
# 정확도 (accuracy) , f1_score , recall , precision -> model.predict로 결과뽑기
# auc , 확률이라는 표현있으면 model.predict_proba로 결과뽑고 첫번째 행의 값을 가져오기 model.predict_proba()[:,1]
print('train accuracy :', accuracy_score(Y_train,predict_train_label))
print('validation accuracy :', accuracy_score(Y_validation,predict_validation_label))
print('\n')
print('train f1_score :', f1_score(Y_train,predict_train_label))
print('validation accuracy :', f1_score(Y_validation,predict_validation_label))
print('\n')
print('train recall_score :', recall_score(Y_train,predict_train_label))
print('validation recall_score :', recall_score(Y_validation,predict_validation_label))
print('\n')
print('train precision_score :', precision_score(Y_train,predict_train_label))
print('validation precision_score :', precision_score(Y_validation,predict_validation_label))
print('\n')
print('train auc :', roc_auc_score(Y_train,predict_train_proba))
print('validation auc :', roc_auc_score(Y_validation,predict_validation_prob))
# test데이터 마찬가지 위와 같은 방식
predict_test_label = rf.predict(x_test_dummies)
predict_test_proba = rf.predict_proba(x_test_dummies)[:,1]
# accuracy, f1_score, recall, precision
#pd.DataFrame({'CustomerId': x_test.CustomerId, 'Exited': predict_test_label}).to_csv('003000000.csv', index=False)
# auc, 확률
#pd.DataFrame({'CustomerId': x_test.CustomerId, 'Exited': predict_test_proba}).to_csv('003000000.csv', index=False)
train accuracy : 1.0
validation accuracy : 0.8652680652680653
train f1_score : 1.0
validation accuracy : 0.5912305516265912
train recall_score : 1.0
validation recall_score : 0.4543478260869565
train precision_score : 1.0
validation precision_score : 0.8461538461538461
train auc : 1.0
validation auc : 0.8497613211198555
이직여부 판단 데이터#
Attention
데이터 설명 : 이직여부 판단 데이터 (target: 1: 이직 , 0 : 이직 x)
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/X_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/X_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/y_test.csv
데이터 출처 :https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists (참고, 데이터 수정)
캐글 공유 코드 저장소
본인만의 코드를 작성하고 upvote를 받아 broze medal을 획득 해보세요 캐글 노트북 링크
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/X_test.csv")
display(x_train.head())
display(y_train.head())
enrollee_id | city | city_development_index | gender | relevent_experience | enrolled_university | education_level | major_discipline | experience | company_size | company_type | last_new_job | training_hours | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 25298 | city_138 | 0.836 | Male | No relevent experience | Full time course | High School | NaN | 5 | 100-500 | Pvt Ltd | 1 | 45 |
1 | 4241 | city_160 | 0.920 | Male | No relevent experience | Full time course | High School | NaN | 5 | NaN | NaN | 1 | 17 |
2 | 24086 | city_57 | 0.866 | Male | No relevent experience | no_enrollment | Graduate | STEM | 10 | NaN | NaN | 1 | 50 |
3 | 26773 | city_16 | 0.910 | Male | Has relevent experience | no_enrollment | Graduate | STEM | >20 | 50-99 | Pvt Ltd | >4 | 135 |
4 | 32325 | city_143 | 0.740 | NaN | No relevent experience | Full time course | Graduate | STEM | 5 | NaN | NaN | never | 17 |
enrollee_id | target | |
---|---|---|
0 | 25298 | 0.0 |
1 | 4241 | 1.0 |
2 | 24086 | 0.0 |
3 | 26773 | 0.0 |
4 | 32325 | 1.0 |
Show code cell content
# print(x_train.info())
# print(x_train.nunique())
# print(x_train.isnull().sum())
# 결측치가 있지만 따로 처리하지 않고 더미화
# 범주형 변수인데 적당히 많은 unique값을 가진 컬럼은 날린다.
drop_col = ['enrollee_id','city','company_type','experience']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)
# import sklearn
# print(sklearn.__all__)
# import sklearn.model_selection
# print(dir(sklearn.model_selection))
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x_train_dummies = pd.get_dummies(x_train_drop)
y = y_train['target'].astype('int')
x_test_dummies = pd.get_dummies(x_test_drop)
# train과 컬럼 순서 동일하게 하기 (더미화 하면서 순서대로 정렬을 이미 하기 때문에 오류가 난다면 해당 컬럼이 누락된것)
x_test_dummies = x_test_dummies[x_train_dummies.columns]
# print(help(train_test_split))
X_train, X_validation, Y_train, Y_validation = train_test_split(x_train_dummies, y, test_size=0.33, random_state=42)
rf = RandomForestClassifier(random_state =23)
rf.fit(X_train,Y_train)
# import sklearn.metrics
# print(dir(sklearn.metrics))
from sklearn.metrics import accuracy_score , f1_score, recall_score, roc_auc_score ,precision_score
#model_score
predict_train_label = rf.predict(X_train)
predict_train_proba = rf.predict_proba(X_train)[:,1]
predict_validation_label = rf.predict(X_validation)
predict_validation_prob = rf.predict_proba(X_validation)[:,1]
# 문제에서 묻는 것에 따라 모델 성능 확인하기
# 정확도 (accuracy) , f1_score , recall , precision -> model.predict로 결과뽑기
# auc , 확률이라는 표현있으면 model.predict_proba로 결과뽑고 첫번째 행의 값을 가져오기 model.predict_proba()[:,1]
print('train accuracy :', accuracy_score(Y_train,predict_train_label))
print('validation accuracy :', accuracy_score(Y_validation,predict_validation_label))
print('\n')
print('train f1_score :', f1_score(Y_train,predict_train_label))
print('validation f1_score :', f1_score(Y_validation,predict_validation_label))
print('\n')
print('train recall_score :', recall_score(Y_train,predict_train_label))
print('validation recall_score :', recall_score(Y_validation,predict_validation_label))
print('\n')
print('train precision_score :', precision_score(Y_train,predict_train_label))
print('validation precision_score :', precision_score(Y_validation,predict_validation_label))
print('\n')
print('train auc :', roc_auc_score(Y_train,predict_train_proba))
print('validation auc :', roc_auc_score(Y_validation,predict_validation_prob))
# test데이터 마찬가지 위와 같은 방식
predict_test_label = rf.predict(x_test_dummies)
predict_test_proba = rf.predict_proba(x_test_dummies)[:,1]
# accuracy, f1_score, recall, precision
#pd.DataFrame({'enrollee_id': x_test.enrollee_id, 'target': predict_test_label}).to_csv('003000000.csv', index=False)
# auc, 확률
#pd.DataFrame({'enrollee_id': x_test.enrollee_id, 'target': predict_test_proba}).to_csv('003000000.csv', index=False)
train accuracy : 0.9965236154399425
validation accuracy : 0.7535279805352798
train f1_score : 0.9929731039496001
validation f1_score : 0.42736009044657997
train recall_score : 0.9927325581395349
validation recall_score : 0.3631123919308357
train precision_score : 0.9932137663596704
validation precision_score : 0.5192307692307693
train auc : 0.9998907607098495
validation auc : 0.740677513569584
정시 배송 여부 판단 (2회기출)#
Attention
데이터 설명 : e-commerce 배송의 정시 도착여부 (1: 정시배송 0 : 정시미배송)
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/X_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/X_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/y_test.csv
데이터 출처 :https://www.kaggle.com/datasets/prachi13/customer-analytics (참고, 데이터 수정)
캐글 공유 코드 저장소
본인만의 코드를 작성하고 upvote를 받아 broze medal을 획득 해보세요 캐글 노트북 링크
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/X_test.csv")
display(x_train.head())
display(y_train.head())
ID | Warehouse_block | Mode_of_Shipment | Customer_care_calls | Customer_rating | Cost_of_the_Product | Prior_purchases | Product_importance | Gender | Discount_offered | Weight_in_gms | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6045 | A | Flight | 4 | 3 | 266 | 5 | high | F | 5 | 1590 |
1 | 44 | F | Ship | 3 | 1 | 174 | 2 | low | M | 44 | 1556 |
2 | 7940 | F | Road | 4 | 1 | 154 | 10 | high | M | 10 | 5674 |
3 | 1596 | F | Ship | 4 | 3 | 158 | 3 | medium | F | 27 | 1207 |
4 | 4395 | A | Flight | 5 | 3 | 175 | 3 | low | M | 7 | 4833 |
ID | Reached.on.Time_Y.N | |
---|---|---|
0 | 6045 | 0 |
1 | 44 | 1 |
2 | 7940 | 1 |
3 | 1596 | 1 |
4 | 4395 | 1 |
Show code cell content
# print(x_train.info())
# print(x_train.nunique())
# print(x_train.isnull().sum())
# 결측치가 있지만 따로 처리하지 않고 더미화
# 범주형 변수인데 적당히 많은 unique값을 가진 컬럼은 날린다.
drop_col = ['ID']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)
# import sklearn
# print(sklearn.__all__)
# import sklearn.model_selection
# print(dir(sklearn.model_selection))
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x_train_dummies = pd.get_dummies(x_train_drop)
y = y_train['Reached.on.Time_Y.N']
x_test_dummies = pd.get_dummies(x_test_drop)
# train과 컬럼 순서 동일하게 하기 (더미화 하면서 순서대로 정렬을 이미 하기 때문에 오류가 난다면 해당 컬럼이 누락된것)
x_test_dummies = x_test_dummies[x_train_dummies.columns]
# print(help(train_test_split))
X_train, X_validation, Y_train, Y_validation = train_test_split(x_train_dummies, y, test_size=0.33, random_state=42)
rf = RandomForestClassifier(random_state =23)
rf.fit(X_train,Y_train)
# import sklearn.metrics
# print(dir(sklearn.metrics))
from sklearn.metrics import accuracy_score , f1_score, recall_score, roc_auc_score ,precision_score
#model_score
predict_train_label = rf.predict(X_train)
predict_train_proba = rf.predict_proba(X_train)[:,1]
predict_validation_label = rf.predict(X_validation)
predict_validation_prob = rf.predict_proba(X_validation)[:,1]
# 문제에서 묻는 것에 따라 모델 성능 확인하기
# 정확도 (accuracy) , f1_score , recall , precision -> model.predict로 결과뽑기
# auc , 확률이라는 표현있으면 model.predict_proba로 결과뽑고 첫번째 행의 값을 가져오기 model.predict_proba()[:,1]
print('train accuracy :', accuracy_score(Y_train,predict_train_label))
print('validation accuracy :', accuracy_score(Y_validation,predict_validation_label))
print('\n')
print('train f1_score :', f1_score(Y_train,predict_train_label))
print('validation f1_score :', f1_score(Y_validation,predict_validation_label))
print('\n')
print('train recall_score :', recall_score(Y_train,predict_train_label))
print('validation recall_score :', recall_score(Y_validation,predict_validation_label))
print('\n')
print('train precision_score :', precision_score(Y_train,predict_train_label))
print('validation precision_score :', precision_score(Y_validation,predict_validation_label))
print('\n')
print('train auc :', roc_auc_score(Y_train,predict_train_proba))
print('validation auc :', roc_auc_score(Y_validation,predict_validation_prob))
# test데이터 마찬가지 위와 같은 방식
predict_test_label = rf.predict(x_test_dummies)
predict_test_proba = rf.predict_proba(x_test_dummies)[:,1]
# accuracy, f1_score, recall, precision
#pd.DataFrame({'ID': x_test.ID, 'Reached.on.Time_Y.N': predict_test_label}).to_csv('003000000.csv', index=False)
# auc, 확률
#pd.DataFrame({'ID': x_test.ID, 'Reached.on.Time_Y.N': predict_test_proba}).to_csv('003000000.csv', index=False)
train accuracy : 1.0
validation accuracy : 0.6395775941230487
train f1_score : 1.0
validation f1_score : 0.6744089589382
train recall_score : 1.0
validation recall_score : 0.630721489526765
train precision_score : 1.0
validation precision_score : 0.7245989304812834
train auc : 1.0
validation auc : 0.7261997118475008
성인 건강검진 데이터#
Attention
데이터 설명 : 2018년도 성인의 건강검 진데이터 (종속변수 : 흡연상태 1- 흡연, 0-비흡연 )
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/x_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/x_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/y_test.csv
데이터 출처 :https://www.data.go.kr/data/15007122/fileData.do (참고, 데이터 수정)
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/x_test.csv")
display(x_train.head())
display(y_train.head())
ID | 성별코드 | 연령대코드(5세단위) | 신장(5Cm단위) | 체중(5Kg단위) | 허리둘레 | 시력(좌) | 시력(우) | 청력(좌) | 청력(우) | 수축기혈압 | 이완기혈압 | 식전혈당(공복혈당) | 총콜레스테롤 | 트리글리세라이드 | HDL콜레스테롤 | LDL콜레스테롤 | 혈색소 | 요단백 | 혈청크레아티닌 | (혈청지오티)AST | (혈청지오티)ALT | 감마지티피 | 구강검진수검여부 | 치아우식증유무 | 치석 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | F | 40 | 155 | 60 | 81.3 | 1.2 | 1.0 | 1.0 | 1.0 | 114.0 | 73.0 | 94.0 | 215.0 | 82.0 | 73.0 | 126.0 | 12.9 | 1.0 | 0.7 | 18.0 | 19.0 | 27.0 | Y | 0.0 | Y |
1 | 1 | F | 40 | 160 | 60 | 81.0 | 0.8 | 0.6 | 1.0 | 1.0 | 119.0 | 70.0 | 130.0 | 192.0 | 115.0 | 42.0 | 127.0 | 12.7 | 1.0 | 0.6 | 22.0 | 19.0 | 18.0 | Y | 0.0 | Y |
2 | 2 | M | 55 | 170 | 60 | 80.0 | 0.8 | 0.8 | 1.0 | 1.0 | 138.0 | 86.0 | 89.0 | 242.0 | 182.0 | 55.0 | 151.0 | 15.8 | 1.0 | 1.0 | 21.0 | 16.0 | 22.0 | Y | 0.0 | N |
3 | 3 | M | 40 | 165 | 70 | 88.0 | 1.5 | 1.5 | 1.0 | 1.0 | 100.0 | 60.0 | 96.0 | 322.0 | 254.0 | 45.0 | 226.0 | 14.7 | 1.0 | 1.0 | 19.0 | 26.0 | 18.0 | Y | 0.0 | Y |
4 | 4 | F | 40 | 155 | 60 | 86.0 | 1.0 | 1.0 | 1.0 | 1.0 | 120.0 | 74.0 | 80.0 | 184.0 | 74.0 | 62.0 | 107.0 | 12.5 | 1.0 | 0.6 | 16.0 | 14.0 | 22.0 | Y | 0.0 | N |
ID | 흡연상태 | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 0 |
2 | 2 | 1 |
3 | 3 | 0 |
4 | 4 | 0 |
Show code cell content
# print(x_train.info())
# print(x_train.nunique())
# print(x_train.isnull().sum())
# 결측치가 있지만 따로 처리하지 않고 더미화
# 범주형 변수인데 적당히 많은 unique값을 가진 컬럼은 날린다. 구강검진수검여부의 경우 unique값이 한개이기 때문에 제거한다.
drop_col = ['ID','구강검진수검여부']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)
# import sklearn
# print(sklearn.__all__)
# import sklearn.model_selection
# print(dir(sklearn.model_selection))
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x_train_dummies = pd.get_dummies(x_train_drop)
y = y_train['흡연상태']
x_test_dummies = pd.get_dummies(x_test_drop)
# train과 컬럼 순서 동일하게 하기 (더미화 하면서 순서대로 정렬을 이미 하기 때문에 오류가 난다면 해당 컬럼이 누락된것)
x_test_dummies = x_test_dummies[x_train_dummies.columns]
# print(help(train_test_split))
X_train, X_validation, Y_train, Y_validation = train_test_split(x_train_dummies, y, test_size=0.33, random_state=42,stratify=y)
rf = RandomForestClassifier(random_state =23)
rf.fit(X_train,Y_train)
# import sklearn.metrics
# print(dir(sklearn.metrics))
from sklearn.metrics import accuracy_score , f1_score, recall_score, roc_auc_score ,precision_score
#model_score
predict_train_label = rf.predict(X_train)
predict_train_proba = rf.predict_proba(X_train)[:,1]
predict_validation_label = rf.predict(X_validation)
predict_validation_prob = rf.predict_proba(X_validation)[:,1]
# 문제에서 묻는 것에 따라 모델 성능 확인하기
# 정확도 (accuracy) , f1_score , recall , precision -> model.predict로 결과뽑기
# auc , 확률이라는 표현있으면 model.predict_proba로 결과뽑고 첫번째 행의 값을 가져오기 model.predict_proba()[:,1]
print('train accuracy :', accuracy_score(Y_train,predict_train_label))
print('validation accuracy :', accuracy_score(Y_validation,predict_validation_label))
print('\n')
print('train f1_score :', f1_score(Y_train,predict_train_label))
print('validation f1_score :', f1_score(Y_validation,predict_validation_label))
print('\n')
print('train recall_score :', recall_score(Y_train,predict_train_label))
print('validation recall_score :', recall_score(Y_validation,predict_validation_label))
print('\n')
print('train precision_score :', precision_score(Y_train,predict_train_label))
print('validation precision_score :', precision_score(Y_validation,predict_validation_label))
print('\n')
print('train auc :', roc_auc_score(Y_train,predict_train_proba))
print('validation auc :', roc_auc_score(Y_validation,predict_validation_prob))
# test데이터 마찬가지 위와 같은 방식
predict_test_label = rf.predict(x_test_dummies)
predict_test_proba = rf.predict_proba(x_test_dummies)[:,1]
# accuracy, f1_score, recall, precision
#pd.DataFrame({'ID': x_test.ID, '흡연상태': predict_test_label}).to_csv('003000000.csv', index=False)
# auc, 확률
#pd.DataFrame({'ID': x_test.ID, '흡연상태': predict_test_proba}).to_csv('003000000.csv', index=False)
train accuracy : 1.0
validation accuracy : 0.75637624974495
train f1_score : 1.0
validation f1_score : 0.6791472590469365
train recall_score : 1.0
validation recall_score : 0.7025574499629355
train precision_score : 1.0
validation precision_score : 0.657246879334258
train auc : 1.0
validation auc : 0.834807387299372
자동차 보험가입 예측데이터#
Attention
데이터 설명 : 자동차 보험 가입 예측 (종속변수 Response: 1 : 가입 , 0 :미가입)
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/insurance/x_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/insurance/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/insurance/x_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/insurance/y_test.csv
데이터 출처 :https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction(참고, 데이터 수정)
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/insurance/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/insurance/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/insurance/x_test.csv")
display(x_train.head())
display(y_train.head())
ID | Gender | Age | Driving_License | Region_Code | Previously_Insured | Vehicle_Age | Vehicle_Damage | Annual_Premium | Policy_Sales_Channel | Vintage | id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Female | 23 | 1 | 8.0 | 0 | < 1 Year | Yes | 61354.0 | 152.0 | 235 | NaN |
1 | 1 | Male | 27 | 1 | 28.0 | 1 | < 1 Year | No | 38036.0 | 152.0 | 207 | NaN |
2 | 2 | Female | 23 | 1 | 45.0 | 0 | < 1 Year | Yes | 25984.0 | 152.0 | 217 | NaN |
3 | 3 | Male | 22 | 1 | 46.0 | 0 | < 1 Year | No | 39499.0 | 152.0 | 277 | NaN |
4 | 4 | Male | 32 | 1 | 30.0 | 1 | < 1 Year | No | 38771.0 | 152.0 | 251 | NaN |
ID | Response | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 0 |
2 | 2 | 1 |
3 | 3 | 0 |
4 | 4 | 0 |
Show code cell content
# print(x_train.info())
# print(x_train.nunique())
# print(x_train.isnull().sum())
# id는 제가 잘못넣은 컬럼입니다.
drop_col = ['ID','id']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)
# import sklearn
# print(sklearn.__all__)
# import sklearn.model_selection
# print(dir(sklearn.model_selection))
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x_train_dummies = pd.get_dummies(x_train_drop)
y = y_train['Response']
x_test_dummies = pd.get_dummies(x_test_drop)
# train과 컬럼 순서 동일하게 하기 (더미화 하면서 순서대로 정렬을 이미 하기 때문에 오류가 난다면 해당 컬럼이 누락된것)
x_test_dummies = x_test_dummies[x_train_dummies.columns]
# print(help(train_test_split))
X_train, X_validation, Y_train, Y_validation = train_test_split(x_train_dummies, y, test_size=0.33, random_state=42,stratify=y)
rf = RandomForestClassifier(random_state =23)
rf.fit(X_train,Y_train)
# import sklearn.metrics
# print(dir(sklearn.metrics))
from sklearn.metrics import accuracy_score , f1_score, recall_score, roc_auc_score ,precision_score
#model_score
predict_train_label = rf.predict(X_train)
predict_train_proba = rf.predict_proba(X_train)[:,1]
predict_validation_label = rf.predict(X_validation)
predict_validation_prob = rf.predict_proba(X_validation)[:,1]
# 문제에서 묻는 것에 따라 모델 성능 확인하기
# 정확도 (accuracy) , f1_score , recall , precision -> model.predict로 결과뽑기
# auc , 확률이라는 표현있으면 model.predict_proba로 결과뽑고 첫번째 행의 값을 가져오기 model.predict_proba()[:,1]
print('train accuracy :', accuracy_score(Y_train,predict_train_label))
print('validation accuracy :', accuracy_score(Y_validation,predict_validation_label))
print('\n')
print('train f1_score :', f1_score(Y_train,predict_train_label))
print('validation f1_score :', f1_score(Y_validation,predict_validation_label))
print('\n')
print('train recall_score :', recall_score(Y_train,predict_train_label))
print('validation recall_score :', recall_score(Y_validation,predict_validation_label))
print('\n')
print('train precision_score :', precision_score(Y_train,predict_train_label))
print('validation precision_score :', precision_score(Y_validation,predict_validation_label))
print('\n')
print('train auc :', roc_auc_score(Y_train,predict_train_proba))
print('validation auc :', roc_auc_score(Y_validation,predict_validation_prob))
# test데이터 마찬가지 위와 같은 방식
predict_test_label = rf.predict(x_test_dummies)
predict_test_proba = rf.predict_proba(x_test_dummies)[:,1]
# accuracy, f1_score, recall, precision
#pd.DataFrame({'ID': x_test.ID, 'Response': predict_test_label}).to_csv('003000000.csv', index=False)
# auc, 확률
#pd.DataFrame({'ID': x_test.ID, 'Response': predict_test_proba}).to_csv('003000000.csv', index=False)
train accuracy : 0.9999314646014666
validation accuracy : 0.8655143967479352
train f1_score : 0.9997203691127712
validation f1_score : 0.18118003025718607
train recall_score : 0.9995606502376483
validation recall_score : 0.12140134620063255
train precision_score : 0.9998801390387151
validation precision_score : 0.3569384835479256
train auc : 0.9999998704194665
validation auc : 0.8340167689531695
비행탑승 경험 만족도 데이터#
Attention
데이터 설명 : 비행탑승 경험 만족도 (satisfaction 컬럼 : ‘neutral or dissatisfied’ or satisfied ) (83123, 24) shape
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/x_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/x_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/y_test.csv
데이터 출처 :https://www.kaggle.com/teejmahal20/airline-passenger-satisfaction?select=train.csv (참고, 데이터 수정)
Attention
test 데이터에 대해서 neutral or dissatisfied라고 예측할 확률을 구하고 그 확률 값을 제출하라
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/x_test.csv")
display(x_train.head())
display(y_train.head())
ID | Gender | Customer Type | Age | Type of Travel | Class | Flight Distance | Inflight wifi service | Departure/Arrival time convenient | Ease of Online booking | Gate location | Food and drink | Online boarding | Seat comfort | Inflight entertainment | On-board service | Leg room service | Baggage handling | Checkin service | Inflight service | Cleanliness | Departure Delay in Minutes | Arrival Delay in Minutes | id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Female | Loyal Customer | 54 | Personal Travel | Eco | 1068 | 3 | 4 | 3 | 1 | 5 | 4 | 4 | 5 | 5 | 3 | 5 | 3 | 5 | 3 | 47 | 22.0 | NaN |
1 | 2 | Male | Loyal Customer | 20 | Personal Travel | Eco | 1546 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 3 | 3 | 4 | 4 | 4 | 4 | 5 | 2.0 | NaN |
2 | 3 | Male | Loyal Customer | 59 | Business travel | Business | 2962 | 0 | 4 | 0 | 4 | 2 | 4 | 5 | 1 | 1 | 1 | 1 | 5 | 1 | 4 | 54 | 46.0 | NaN |
3 | 4 | Male | Loyal Customer | 35 | Business travel | Eco Plus | 106 | 5 | 4 | 4 | 4 | 5 | 5 | 5 | 5 | 2 | 1 | 5 | 4 | 4 | 5 | 130 | 121.0 | NaN |
4 | 5 | Female | Loyal Customer | 9 | Business travel | Business | 2917 | 3 | 3 | 3 | 3 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | 4 | 3 | 4 | 0 | 0.0 | NaN |
ID | satisfaction | |
---|---|---|
0 | 0 | neutral or dissatisfied |
1 | 2 | neutral or dissatisfied |
2 | 3 | satisfied |
3 | 4 | satisfied |
4 | 5 | satisfied |
Show code cell content
# print(x_train.info())
# print(x_train.nunique())
print(x_train.isnull().sum())
mean_values = x_train['Arrival Delay in Minutes'].mean()
x_train['Arrival Delay in Minutes'] = x_train['Arrival Delay in Minutes'].fillna(mean_values)
# data leakage 때문에 결측치는 train값으로 채우는게 원칙이나 신경쓰기 어렵다면 test의 결측치는 test의 평균값으로 대치하세요
x_test['Arrival Delay in Minutes'] = x_test['Arrival Delay in Minutes'].fillna(mean_values)
# id는 제가 잘못넣은 컬럼입니다.
drop_col = ['ID','id']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)
# import sklearn
# print(sklearn.__all__)
# import sklearn.model_selection
# print(dir(sklearn.model_selection))
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x_train_dummies = pd.get_dummies(x_train_drop)
y = y_train['satisfaction']
x_test_dummies = pd.get_dummies(x_test_drop)
x_test_dummies = x_test_dummies[x_train_dummies.columns]
# print(help(train_test_split))
X_train, X_validation, Y_train, Y_validation = train_test_split(x_train_dummies, y, test_size=0.33, random_state=42,stratify=y)
rf = RandomForestClassifier(random_state =23)
rf.fit(X_train,Y_train)
# import sklearn.metrics
# print(dir(sklearn.metrics))
from sklearn.metrics import accuracy_score , f1_score, recall_score, roc_auc_score ,precision_score
#model_score
predict_train_label = rf.predict(X_train)
predict_train_proba = rf.predict_proba(X_train)[:,1]
predict_validation_label = rf.predict(X_validation)
predict_validation_prob = rf.predict_proba(X_validation)[:,1]
# 문제에서 묻는 것에 따라 모델 성능 확인하기
# 라벨이 숫자가 아닌 문자로 냅두고 학습을 했다는 pos_label값을 지정해줘야한다.
# 정확도 (accuracy) , f1_score , recall , precision -> model.predict로 결과뽑기
print('train accuracy :', accuracy_score(Y_train,predict_train_label))
print('validation accuracy :', accuracy_score(Y_validation,predict_validation_label))
print('\n')
print('train f1_score :', f1_score(Y_train,predict_train_label ,pos_label = 'neutral or dissatisfied'))
print('validation f1_score :', f1_score(Y_validation,predict_validation_label ,pos_label = 'neutral or dissatisfied'))
print('\n')
print('train recall_score :', recall_score(Y_train,predict_train_label,pos_label = 'neutral or dissatisfied'))
print('validation recall_score :', recall_score(Y_validation,predict_validation_label,pos_label = 'neutral or dissatisfied'))
print('\n')
print('train precision_score :', precision_score(Y_train,predict_train_label,pos_label = 'neutral or dissatisfied'))
print('validation precision_score :', precision_score(Y_validation,predict_validation_label,pos_label = 'neutral or dissatisfied'))
print('\n')
print('train auc :', roc_auc_score(Y_train,predict_train_proba))
print('validation auc :', roc_auc_score(Y_validation,predict_validation_prob))
# test데이터 마찬가지 위와 같은 방식
predict_test_label = rf.predict(x_test_dummies)
# 'neutral or dissatisfied'는 0번 클래스이기 떄문에 predict_proba후 [:,0]으로 지정해준다
# 헷갈리는 경우가 있기 때문에 처음부터 y값을 라벨인코딩 해주고 시작해주는것도 좋음
# dic = {'neutral or dissatisfied':1 , 'satisfied':0}
# y = y_train['satisfaction'].map(dic)
# 위의 경우에는 neutral or dissatisfied는 1클래스를 가지므로 predict_proba()[:,1]로 지정해준다
predict_test_proba = rf.predict_proba(x_test_dummies)[:,0]
# accuracy, f1_score, recall, precision
#pd.DataFrame({'ID': x_test.ID, 'satisfaction': predict_test_label}).to_csv('003000000.csv', index=False)
# auc, 확률
#pd.DataFrame({'ID': x_test.ID, 'satisfaction': predict_test_proba}).to_csv('003000000.csv', index=False)
ID 0
Gender 0
Customer Type 0
Age 0
Type of Travel 0
Class 0
Flight Distance 0
Inflight wifi service 0
Departure/Arrival time convenient 0
Ease of Online booking 0
Gate location 0
Food and drink 0
Online boarding 0
Seat comfort 0
Inflight entertainment 0
On-board service 0
Leg room service 0
Baggage handling 0
Checkin service 0
Inflight service 0
Cleanliness 0
Departure Delay in Minutes 0
Arrival Delay in Minutes 0
id 83123
dtype: int64
train accuracy : 1.0
validation accuracy : 0.9611388574969925
train f1_score : 1.0
validation f1_score : 0.9661221636051611
train recall_score : 1.0
validation recall_score : 0.9778692743180648
train precision_score : 1.0
validation precision_score : 0.954653937947494
train auc : 1.0
validation auc : 0.9936992375795042
수질 음용성 여부 데이터#
Attention
데이터 설명 : 수질 음용성 여부 (Potablillity 컬럼 : 0 ,1 )
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/x_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/x_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/y_test.csv
데이터 출처 :https://www.kaggle.com/adityakadiwal/water-potability
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/x_test.csv")
display(x_train.head())
display(y_train.head())
ID | ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 8.662710 | 173.531947 | 20333.079495 | 5.636388 | 439.787938 | 459.633120 | 16.283311 | 89.924253 | 5.120103 |
1 | 1 | NaN | 226.270824 | 15380.124079 | 6.661474 | NaN | 392.558205 | 14.083110 | 50.286395 | 4.516870 |
2 | 2 | 7.583770 | 217.283262 | 36343.407055 | 8.532726 | 375.964391 | 393.877683 | 17.442301 | 77.722257 | 3.642289 |
3 | 3 | 6.584813 | 182.375456 | 24723.106296 | 6.238920 | NaN | 414.350751 | 17.582615 | 78.213738 | 4.404132 |
4 | 4 | 7.179864 | 180.854211 | 10859.553752 | 8.263503 | 341.302486 | 358.056264 | 12.065317 | 83.329918 | 3.878447 |
ID | Potability | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 1 |
2 | 2 | 0 |
3 | 3 | 0 |
4 | 4 | 0 |
Show code cell content
# print(x_train.info())
# print(x_train.nunique())
print(x_train.isnull().sum())
for col in x_train.isnull().sum().where(lambda x : x !=0).dropna().index:
mean_values = x_train[col].mean()
x_train[col] = x_train[col].fillna(mean_values)
# data leakage 때문에 결측치는 train값으로 채우는게 원칙이나 신경쓰기 어렵다면 test의 결측치는 test의 평균값으로 대치하세요
x_test[col] = x_test[col].fillna(mean_values)
drop_col = ['ID']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)
# import sklearn
# print(sklearn.__all__)
# import sklearn.model_selection
# print(dir(sklearn.model_selection))
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x_train_dummies = pd.get_dummies(x_train_drop)
y = y_train['Potability']
x_test_dummies = pd.get_dummies(x_test_drop)
x_test_dummies = x_test_dummies[x_train_dummies.columns]
# print(help(train_test_split))
X_train, X_validation, Y_train, Y_validation = train_test_split(x_train_dummies, y, test_size=0.33, random_state=42,stratify=y)
rf = RandomForestClassifier(random_state =23)
rf.fit(X_train,Y_train)
# import sklearn.metrics
# print(dir(sklearn.metrics))
from sklearn.metrics import accuracy_score , f1_score, recall_score, roc_auc_score ,precision_score
#model_score
predict_train_label = rf.predict(X_train)
predict_train_proba = rf.predict_proba(X_train)[:,1]
predict_validation_label = rf.predict(X_validation)
predict_validation_prob = rf.predict_proba(X_validation)[:,1]
# 문제에서 묻는 것에 따라 모델 성능 확인하기
# 라벨이 숫자가 아닌 문자로 냅두고 학습을 했다는 pos_label값을 지정해줘야한다.
# 정확도 (accuracy) , f1_score , recall , precision -> model.predict로 결과뽑기
print('train accuracy :', accuracy_score(Y_train,predict_train_label))
print('validation accuracy :', accuracy_score(Y_validation,predict_validation_label))
print('\n')
print('train f1_score :', f1_score(Y_train,predict_train_label))
print('validation f1_score :', f1_score(Y_validation,predict_validation_label ))
print('\n')
print('train recall_score :', recall_score(Y_train,predict_train_label))
print('validation recall_score :', recall_score(Y_validation,predict_validation_label))
print('\n')
print('train precision_score :', precision_score(Y_train,predict_train_label))
print('validation precision_score :', precision_score(Y_validation,predict_validation_label))
print('\n')
print('train auc :', roc_auc_score(Y_train,predict_train_proba))
print('validation auc :', roc_auc_score(Y_validation,predict_validation_prob))
# test데이터 마찬가지 위와 같은 방식
predict_test_label = rf.predict(x_test_dummies)
predict_test_proba = rf.predict_proba(x_test_dummies)[:,0]
# accuracy, f1_score, recall, precision
#pd.DataFrame({'ID': x_test.ID, 'Potability': predict_test_label}).to_csv('003000000.csv', index=False)
# auc, 확률
#pd.DataFrame({'ID': x_test.ID, 'Potability': predict_test_proba}).to_csv('003000000.csv', index=False)
ID 0
ph 0
Hardness 0
Solids 0
Chloramines 0
Sulfate 0
Conductivity 0
Organic_carbon 0
Trihalomethanes 0
Turbidity 0
dtype: int64
train accuracy : 1.0
validation accuracy : 0.6497109826589595
train f1_score : 1.0
validation f1_score : 0.4116504854368932
train recall_score : 1.0
validation recall_score : 0.314540059347181
train precision_score : 1.0
validation precision_score : 0.5955056179775281
train auc : 1.0
validation auc : 0.6309515780954951
약물 분류 데이터#
Attention
데이터 설명 : 투약하는 약을 분류 (종속변수 :Drug)
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/x_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/x_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/y_test.csv
데이터 출처 :https://www.kaggle.com/prathamtripathi/drug-classification(참고, 데이터 수정)
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/x_test.csv")
display(x_train.head())
display(y_train.head())
ID | Age | Sex | BP | Cholesterol | Na_to_K | |
---|---|---|---|---|---|---|
0 | 0 | 36 | F | NORMAL | HIGH | 16.753 |
1 | 1 | 47 | F | LOW | HIGH | 11.767 |
2 | 2 | 69 | F | NORMAL | HIGH | 10.065 |
3 | 3 | 35 | M | LOW | NORMAL | 9.170 |
4 | 4 | 49 | M | LOW | NORMAL | 11.014 |
ID | Drug | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 3 |
2 | 2 | 4 |
3 | 3 | 4 |
4 | 4 | 4 |
사기회사 분류 데이터#
Attention
데이터 설명 : 사기회사 분류 (종속변수 : Risk 1: 사기 , 0 : 정상)
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/x_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/x_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/y_test.csv
데이터 출처 :https://www.kaggle.com/sid321axn/audit-data(참고, 데이터 수정)
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/x_test.csv")
display(x_train.head())
display(y_train.head())
ID | Sector_score | LOCATION_ID | PARA_A | Score_A | Risk_A | PARA_B | Score_B | Risk_B | TOTAL | ... | PROB | RiSk_E | History | Prob | Risk_F | Score | Inherent_Risk | CONTROL_RISK | Detection_Risk | Audit_Risk | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2.37 | 16 | 0.01 | 0.2 | 0.002 | 0.007 | 0.2 | 0.0014 | 0.017 | ... | 0.2 | 0.4 | 0 | 0.2 | 0.0 | 2.0 | 1.4034 | 0.4 | 0.5 | 0.28068 |
1 | 2 | 55.57 | 9 | 1.06 | 0.4 | 0.424 | 0.000 | 0.2 | 0.0000 | 1.060 | ... | 0.2 | 0.4 | 0 | 0.2 | 0.0 | 2.2 | 1.8240 | 0.4 | 0.5 | 0.36480 |
2 | 3 | 55.57 | 16 | 2.42 | 0.6 | 1.452 | 3.530 | 0.6 | 2.1180 | 5.950 | ... | 0.2 | 0.4 | 0 | 0.2 | 0.0 | 3.8 | 7.4940 | 0.4 | 0.5 | 1.49880 |
3 | 4 | 2.37 | 9 | 0.31 | 0.2 | 0.062 | 0.690 | 0.2 | 0.1380 | 1.000 | ... | 0.2 | 0.4 | 0 | 0.2 | 0.0 | 2.0 | 1.6000 | 0.4 | 0.5 | 0.32000 |
4 | 5 | 55.57 | 6 | 0.62 | 0.2 | 0.124 | 0.420 | 0.2 | 0.0840 | 1.040 | ... | 0.2 | 0.4 | 0 | 0.2 | 0.0 | 2.0 | 1.6080 | 0.4 | 0.5 | 0.32160 |
5 rows × 27 columns
ID | Risk | |
---|---|---|
0 | 0 | 0 |
1 | 2 | 0 |
2 | 3 | 1 |
3 | 4 | 0 |
4 | 5 | 0 |
센서데이터 동작유형 분류 데이터#
Attention
데이터 설명 : 센서데이터로 동작 유형 분류 (종속변수 pose : 0 ,1 구분)
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/y_test.csv
데이터 출처 :https://www.kaggle.com/kyr7plus/emg-4(참고, 데이터 수정)
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_test.csv")
display(x_train.head())
display(y_train.head())
ID | motion_0 | motion_1 | motion_2 | motion_3 | motion_4 | motion_5 | motion_6 | motion_7 | motion_8 | ... | motion_54 | motion_55 | motion_56 | motion_57 | motion_58 | motion_59 | motion_60 | motion_61 | motion_62 | motion_63 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1.0 | -2.0 | -1.0 | 4.0 | -5.0 | -4.0 | 1.0 | 0.0 | -15.0 | ... | 0.0 | -1.0 | -13.0 | -3.0 | 1.0 | -1.0 | -32.0 | -22.0 | -2.0 | -3.0 |
1 | 2 | 20.0 | 0.0 | 0.0 | 1.0 | 5.0 | 6.0 | -52.0 | 18.0 | 15.0 | ... | -70.0 | -55.0 | -38.0 | -14.0 | -12.0 | -8.0 | -34.0 | -63.0 | -87.0 | -77.0 |
2 | 4 | 1.0 | -1.0 | 1.0 | 4.0 | -5.0 | -8.0 | 1.0 | -3.0 | -14.0 | ... | 1.0 | 12.0 | -25.0 | 0.0 | 0.0 | 3.0 | 2.0 | -27.0 | 1.0 | 0.0 |
3 | 5 | 13.0 | 2.0 | 1.0 | -3.0 | 1.0 | 3.0 | 28.0 | 3.0 | 12.0 | ... | 0.0 | -21.0 | -17.0 | -2.0 | 0.0 | -4.0 | -17.0 | -21.0 | -21.0 | 25.0 |
4 | 6 | -2.0 | -7.0 | -4.0 | -8.0 | 16.0 | 44.0 | 1.0 | 3.0 | -16.0 | ... | -1.0 | 2.0 | -1.0 | 1.0 | 4.0 | 4.0 | -17.0 | -38.0 | -3.0 | 3.0 |
5 rows × 65 columns
ID | pose | |
---|---|---|
0 | 0 | 1 |
1 | 2 | 0 |
2 | 4 | 1 |
3 | 5 | 0 |
4 | 6 | 1 |
당뇨여부판단 데이터#
Attention
데이터 설명 : 당뇨여부 판단하기 (종속변수 Outcome : 1 당뇨 , 0 :정상)
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/x_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/x_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/y_test.csv
데이터 출처 :https://www.kaggle.com/pritsheta/diabetes-dataset(참고, 데이터 수정)
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/x_test.csv")
display(x_train.head())
display(y_train.head())
ID | Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 8 | 126 | 88 | 36 | 108 | 38.5 | 0.349 | 49 |
1 | 1 | 0 | 74 | 52 | 10 | 36 | 27.8 | 0.269 | 22 |
2 | 2 | 1 | 140 | 74 | 26 | 180 | 24.1 | 0.828 | 23 |
3 | 3 | 6 | 162 | 62 | 0 | 0 | 24.3 | 0.178 | 50 |
4 | 4 | 2 | 94 | 68 | 18 | 76 | 26.0 | 0.561 | 21 |
ID | Outcome | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 0 |
2 | 2 | 0 |
3 | 3 | 1 |
4 | 4 | 0 |
회귀#
학생성적 예측 데이터#
Attention
데이터 설명 : 학생성적 예측 (종속변수 :G3)
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/X_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/X_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/y_test.csv
데이터 출처 :https://www.kaggle.com/datasets/ishandutta/student-performance-data-set (참고, 데이터 수정)
캐글 공유 코드 저장소
본인만의 코드를 작성하고 upvote를 받아 broze medal을 획득 해보세요 캐글 노트북 링크
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/X_test.csv")
display(x_train.head())
display(y_train.head())
StudentID | school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | ... | romantic | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1714 | GP | F | 18 | U | GT3 | T | 4 | 3 | other | ... | no | 4 | 3 | 3 | 1 | 1 | 3 | 0 | 14 | 13 |
1 | 1254 | GP | F | 17 | U | GT3 | T | 4 | 3 | health | ... | yes | 4 | 4 | 3 | 1 | 3 | 4 | 0 | 13 | 15 |
2 | 1639 | GP | F | 16 | R | GT3 | T | 4 | 4 | health | ... | no | 2 | 4 | 4 | 2 | 3 | 4 | 6 | 10 | 11 |
3 | 1118 | GP | M | 16 | U | GT3 | T | 4 | 4 | services | ... | no | 5 | 3 | 3 | 1 | 3 | 5 | 0 | 15 | 13 |
4 | 1499 | GP | M | 19 | U | GT3 | T | 3 | 2 | services | ... | yes | 4 | 5 | 4 | 1 | 1 | 4 | 0 | 5 | 0 |
5 rows × 33 columns
StudentID | G3 | |
---|---|---|
0 | 1714 | 14 |
1 | 1254 | 15 |
2 | 1639 | 11 |
3 | 1118 | 13 |
4 | 1499 | 0 |
Show code cell content
#x_train.isnull().sum()
#x_test.isnull().sum()
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error , mean_absolute_error , mean_absolute_percentage_error ,r2_score
import numpy as np
drop_col = ['StudentID']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)
y_train_target = y_train['G3']
x_train_dum = pd.get_dummies(x_train_drop)
x_test_dum = pd.get_dummies(x_test_drop)[x_train_dum.columns]
xtr , xt , ytr, yt = train_test_split(x_train_dum , y_train_target)
rf = RandomForestRegressor(random_state =42)
rf.fit(xtr,ytr)
y_validation_pred = rf.predict(xt)
# 모델평가
# mse : mean_squared_error / mae : mean_absolute_error / mape : mean_absolute_percentage_error
# rmse : root_mean_squerd_error -> 패키지 없음 np.sqrt(mean_squared_error) 해줘야함
# y_true ,y_pred 순서 help로 잘 확인 하시고 사용하셔요
#mse
print('validation mse' ,mean_squared_error(yt,y_validation_pred))
#mae
print('validation mae' ,mean_absolute_error(yt,y_validation_pred))
#mape
print('validation mape' ,mean_absolute_percentage_error(yt,y_validation_pred))
#rmse
print('validation rmse' ,np.sqrt(mean_absolute_percentage_error(yt,y_validation_pred)))
#r2
print('validation r2 score' ,r2_score(yt,y_validation_pred))
validation mse 1.2227864771241832
validation mae 0.7917254901960783
validation mape 178554479343983.25
validation rmse 13362427.898551343
validation r2 score 0.8846521985576391
중고차 가격 예측 데이터#
Attention
데이터 설명 : 중고차 가격 예측 데이터 (종속변수 :G3)
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/carsprice/X_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/carsprice/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/carsprice/X_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/carsprice/y_test.csv
데이터 출처 :https://www.kaggle.com/datasets/adityadesai13/used-car-dataset-ford-and-mercedes?select=vw.csv (참고, 데이터 수정)
캐글 공유 코드 저장소
본인만의 코드를 작성하고 upvote를 받아 broze medal을 획득 해보세요 캐글 노트북 링크
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/carsprice/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/carsprice/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/carsprice/X_test.csv")
display(x_train.head())
display(y_train.head())
carID | brand | model | year | transmission | mileage | fuelType | tax | mpg | engineSize | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 13207 | hyundi | Santa Fe | 2019 | Semi-Auto | 4223 | Diesel | 145.0 | 39.8 | 2.2 |
1 | 17314 | vauxhall | GTC | 2015 | Manual | 47870 | Diesel | 125.0 | 60.1 | 2.0 |
2 | 12342 | audi | RS4 | 2019 | Automatic | 5151 | Petrol | 145.0 | 29.1 | 2.9 |
3 | 13426 | vw | Scirocco | 2016 | Automatic | 20423 | Diesel | 30.0 | 57.6 | 2.0 |
4 | 16004 | skoda | Scala | 2020 | Semi-Auto | 3569 | Petrol | 145.0 | 47.1 | 1.0 |
carID | price | |
---|---|---|
0 | 13207 | 31995 |
1 | 17314 | 7700 |
2 | 12342 | 58990 |
3 | 13426 | 12999 |
4 | 16004 | 16990 |
#x_train.isnull().sum()
#x_test.isnull().sum()
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error , mean_absolute_error , mean_absolute_percentage_error ,r2_score
import numpy as np
drop_col = ['carID']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)
y_train_target = y_train['price']
# train에만 있고 test에는 없는 model들이 있기에 합쳐서 원핫 인코딩을 진행
# data leakage 이슈가 있으나 빅분기 시험에서는 이러한 방식으로 해결하는 것으로 태클걸지 않음
# 정석은 train과 동일한 컬럼을 만들고 test에서 없는 경우에는 0으로 채우고, test에는 있으나 train에 없는 경우에는 데이터 제거
combined = pd.concat([x_train_drop, x_test_drop])
combined_encoded = pd.get_dummies(combined)
train_encoded = combined_encoded[:len(x_train_drop)]
test_encoded = combined_encoded[len(x_train_drop):]
xtr , xt , ytr, yt = train_test_split(train_encoded , y_train_target)
rf = RandomForestRegressor(random_state =42)
rf.fit(xtr,ytr)
y_validation_pred = rf.predict(xt)
# 모델평가
# mse : mean_squared_error / mae : mean_absolute_error / mape : mean_absolute_percentage_error
# rmse : root_mean_squerd_error -> 패키지 없음 np.sqrt(mean_squared_error) 해줘야함
# y_true ,y_pred 순서 help로 잘 확인 하시고 사용하셔요
#mse
print('validation mse' ,mean_squared_error(yt,y_validation_pred))
#mae
print('validation mae' ,mean_absolute_error(yt,y_validation_pred))
#mape
print('validation mape' ,mean_absolute_percentage_error(yt,y_validation_pred))
#rmse
print('validation rmse' ,np.sqrt(mean_absolute_percentage_error(yt,y_validation_pred)))
#r2
print('validation r2 score' ,r2_score(yt,y_validation_pred))
validation mse 14447219.441083966
validation mae 1990.275561392729
validation mape 0.09895579384297525
validation rmse 0.31457239841247236
validation r2 score 0.9385284823923195
의료 비용 예측 데이터#
Attention
데이터 설명 : 의료비용 예측문제 (종속변수 :charges)
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/MedicalCost/x_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/MedicalCost/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/MedicalCost/x_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/MedicalCost/y_test.csv
데이터 출처 :https://www.kaggle.com/mirichoi0218/insurance/code(참고, 데이터 수정)
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/MedicalCost/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/MedicalCost/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/MedicalCost/x_test.csv")
display(x_train.head())
display(y_train.head())
ID | age | sex | bmi | children | smoker | region | |
---|---|---|---|---|---|---|---|
0 | 2 | 35 | female | 35.860 | 2 | no | southeast |
1 | 3 | 28 | female | 23.845 | 2 | no | northwest |
2 | 4 | 23 | female | 32.780 | 2 | yes | southeast |
3 | 6 | 52 | female | 25.300 | 2 | yes | southeast |
4 | 7 | 63 | male | 39.800 | 3 | no | southwest |
ID | charges | |
---|---|---|
0 | 2 | 5836.52040 |
1 | 3 | 4719.73655 |
2 | 4 | 36021.01120 |
3 | 6 | 24667.41900 |
4 | 7 | 15170.06900 |
킹카운티 주거지 가격예측문제 데이터#
Attention
데이터 설명 : 킹카운티 주거지 가격 예측문제 (종속변수 :price)
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice/x_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice/x_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice/y_test.csv
데이터 출처 :https://www.kaggle.com/harlfoxem/housesalesprediction (참고, 데이터 수정)
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice/x_test.csv")
display(x_train.head())
display(y_train.head())
ID | id | date | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | ... | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 8651400730 | 20150428T000000 | 3 | 1.00 | 840 | 5525 | 1.0 | 0 | 0 | ... | 6 | 840 | 0 | 1969 | 0 | 98042 | 47.3607 | -122.085 | 920 | 5330 |
1 | 3 | 3163600130 | 20150317T000000 | 3 | 1.00 | 1250 | 8000 | 1.0 | 0 | 0 | ... | 7 | 1250 | 0 | 1956 | 0 | 98146 | 47.5065 | -122.337 | 1040 | 6973 |
2 | 4 | 5045700330 | 20140725T000000 | 4 | 2.50 | 2200 | 6400 | 2.0 | 0 | 0 | ... | 8 | 2200 | 0 | 2010 | 0 | 98059 | 47.4856 | -122.156 | 2600 | 5870 |
3 | 5 | 1036100130 | 20140808T000000 | 3 | 2.50 | 1980 | 39932 | 2.0 | 0 | 0 | ... | 8 | 1980 | 0 | 1994 | 0 | 98011 | 47.7433 | -122.196 | 2610 | 12769 |
4 | 6 | 7696630080 | 20140506T000000 | 3 | 1.75 | 1690 | 7735 | 1.0 | 0 | 0 | ... | 7 | 1060 | 630 | 1976 | 0 | 98001 | 47.3324 | -122.280 | 1580 | 7503 |
5 rows × 21 columns
ID | price | |
---|---|---|
0 | 2 | 191000.0 |
1 | 3 | 234900.0 |
2 | 4 | 460000.0 |
3 | 5 | 442000.0 |
4 | 6 | 197000.0 |
대학원 입학가능성 데이터#
Attention
데이터 설명 : 대학원 입학 가능성 예측 (종속변수 :Chance of Admit)
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/admission/x_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/admission/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/admission/x_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/admission/y_test.csv
데이터 출처 :https://www.kaggle.com/mohansacharya/graduate-admissions(참고, 데이터 수정)
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/admission/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/admission/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/admission/x_test.csv")
display(x_train.head())
display(y_train.head())
ID | Serial No. | GRE Score | TOEFL Score | University Rating | SOP | LOR | CGPA | Research | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 67 | 327 | 114 | 3 | 3.0 | 3.0 | 9.02 | 0 |
1 | 1 | 112 | 321 | 109 | 4 | 4.0 | 4.0 | 8.68 | 1 |
2 | 2 | 495 | 301 | 99 | 3 | 2.5 | 2.0 | 8.45 | 1 |
3 | 3 | 356 | 317 | 106 | 2 | 2.0 | 3.5 | 8.12 | 0 |
4 | 4 | 250 | 321 | 111 | 3 | 3.5 | 4.0 | 8.83 | 1 |
ID | Chance of Admit | |
---|---|---|
0 | 0 | 0.61 |
1 | 1 | 0.69 |
2 | 2 | 0.68 |
3 | 3 | 0.73 |
4 | 4 | 0.77 |
레드 와인 퀄리티 예측 데이터#
Attention
데이터 설명 : 레드 와인 퀄리티 예측문제 (종속변수 :quality)
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/redwine/x_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/redwine/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/redwine/x_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/redwine/y_test.csv
데이터 출처 :https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009(참고, 데이터 수정)
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/redwine/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/redwine/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/redwine/x_test.csv")
display(x_train.head())
display(y_train.head())
ID | fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 10.6 | 0.44 | 0.68 | 4.1 | 0.114 | 6.0 | 24.0 | 0.99700 | 3.06 | 0.66 | 13.4 |
1 | 2 | 7.0 | 0.60 | 0.30 | 4.5 | 0.068 | 20.0 | 110.0 | 0.99914 | 3.30 | 1.17 | 10.2 |
2 | 3 | 8.0 | 0.43 | 0.36 | 2.3 | 0.075 | 10.0 | 48.0 | 0.99760 | 3.34 | 0.46 | 9.4 |
3 | 4 | 7.9 | 0.53 | 0.24 | 2.0 | 0.072 | 15.0 | 105.0 | 0.99600 | 3.27 | 0.54 | 9.4 |
4 | 5 | 8.0 | 0.45 | 0.23 | 2.2 | 0.094 | 16.0 | 29.0 | 0.99620 | 3.21 | 0.49 | 10.2 |
ID | quality | |
---|---|---|
0 | 1 | 6 |
1 | 2 | 5 |
2 | 3 | 5 |
3 | 4 | 6 |
4 | 5 | 6 |
현대 차량 가격 분류문제 데이터#
Attention
데이터 설명 : 현대 차량가격 분류문제 (종속변수 :price)
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/hyundai/x_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/hyundai/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/hyundai/x_test.csv
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/hyundai/y_test.csv
데이터 출처 :https://www.kaggle.com/mysarahmadbhat/hyundai-used-car-listing(참고, 데이터 수정)
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/hyundai/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/hyundai/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/hyundai/x_test.csv")
display(x_train.head())
display(y_train.head())
ID | model | year | transmission | mileage | fuelType | tax(£) | mpg | engineSize | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | I30 | 2019 | Manual | 21 | Petrol | 150 | 34.0 | 2.0 |
1 | 1 | Santa Fe | 2018 | Semi-Auto | 10500 | Diesel | 145 | 39.8 | 2.2 |
2 | 2 | Tucson | 2017 | Manual | 29968 | Diesel | 30 | 61.7 | 1.7 |
3 | 3 | Kona | 2018 | Manual | 27317 | Petrol | 145 | 52.3 | 1.0 |
4 | 4 | Tucson | 2018 | Semi-Auto | 31459 | Diesel | 145 | 57.7 | 1.7 |
ID | price | |
---|---|---|
0 | 0 | 23995 |
1 | 1 | 28490 |
2 | 2 | 13251 |
3 | 3 | 14990 |
4 | 4 | 17591 |
이 포스팅은 쿠팡 파트너스 활동의 일환으로, 이에 따른 일정액의 수수료를 제공받습니다