모의고사 1회차#
광고 한번 눌러주시겠습니까
캐글 링크
작업 1유형#
Attention
데이터 출처 : https://archive.ics.uci.edu/ml/datasets/Bank+Marketing (후처리 작업)
데이터 설명 : 은행의 전화 마케팅에 대해 고객의 반응 여부
dataurl : https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/train.csv
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/train.csv')
df.head()
ID | age | job | marital | education | default | balance | housing | loan | contact | day | month | campaign | pdays | previous | poutcome | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 13829 | 29 | technician | single | tertiary | no | 18254 | no | no | cellular | 11 | may | 2 | -1 | 0 | unknown | no |
1 | 22677 | 26 | services | single | secondary | no | 512 | yes | yes | unknown | 5 | jun | 3 | -1 | 0 | unknown | no |
2 | 10541 | 30 | management | single | secondary | no | 135 | no | no | cellular | 14 | aug | 2 | -1 | 0 | unknown | no |
3 | 13689 | 41 | technician | married | unknown | no | 30 | yes | no | cellular | 10 | jul | 1 | -1 | 0 | unknown | no |
4 | 11304 | 27 | admin. | single | secondary | no | 321 | no | yes | unknown | 2 | sep | 1 | -1 | 0 | unknown | no |
Question1
마케팅 응답 고객들의 나이를 10살 단위로 변환 했을 때, 가장 많은 인원을 가진 나이대는? (0~9 : 0 , 10~19 : 10)
Show code cell source
result = (df.age//10 *10).value_counts().index[0]
print(result)
30
Question2
마케팅 응답 고객들의 나이를 10살 단위로 변환 했을 때, 가장 많은 나이대 구간의 인원은 몇명인가?
Show code cell source
result = (df.age//10 *10).value_counts().values[0]
print(result)
5056
Question3
나이가 25살 이상 29살 미만인 응답 고객들중 housing컬럼의 값이 yes인 고객의 수는?
Show code cell source
result = df[(df.age >=25) & (df.age<29) & (df.housing =='yes')].shape[0]
print(result)
504
Question4
numeric한 값을 가지지 않은 컬럼들중 unique한 값을 가장 많이 가지는 컬럼은?
pd.DataFrame(lst).sort_values(1,ascending=False)
0 | 1 | |
---|---|---|
0 | job | 12 |
7 | month | 12 |
2 | education | 4 |
8 | poutcome | 4 |
1 | marital | 3 |
6 | contact | 3 |
3 | default | 2 |
4 | housing | 2 |
5 | loan | 2 |
9 | y | 2 |
Show code cell source
lst= []
for col in df.select_dtypes(exclude='int'):
target = df[col]
lst.append([col,target.nunique()])
dfs = pd.DataFrame(lst).sort_values(1,ascending=False)
result = dfs[dfs[1] ==dfs[1].max()][0].values
print(result)
['job' 'month']
Question5
balance 컬럼값들의 평균값 이상을 가지는 데이터를 ID값을 기준으로 내림차순 정렬했을때 상위 100개 데이터의 balance값의 평균은?
Show code cell source
result = df[df.balance >= df.balance.mean()].sort_values('ID',ascending=False).head(100).balance.mean()
print(result)
3473.73
Question6
가장 많은 광고를 집행했던 날짜는 언제인가? (데이터 그대로 일(숫자),달(영문)으로 표기)
Show code cell source
result = df[['day','month']].value_counts().index[0]
print(result)
(15, 'may')
Question7
데이터의 job이 unknown 상태인 고객들의 age 컬럼 값의 정규성을 검정하고자 한다. 샤피로 검정의 p-value값을 구하여라
Show code cell source
from scipy.stats import shapiro
result = shapiro(df[df.job =='unknown'].age)[1]
print(result)
0.1961131989955902
Question8
age와 balance의 상관계수를 구하여라
Show code cell source
result = df[['age','balance']].corr().iloc[0,1]
print(result)
0.10198734763981472
Question9
y 변수와 education 변수는 독립인지 카이제곱검정을 통해 확인하려한다. p-value값을 출력하라
df
ID | age | job | marital | education | default | balance | housing | loan | contact | day | month | campaign | pdays | previous | poutcome | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 13829 | 29 | technician | single | tertiary | no | 18254 | no | no | cellular | 11 | may | 2 | -1 | 0 | unknown | no |
1 | 22677 | 26 | services | single | secondary | no | 512 | yes | yes | unknown | 5 | jun | 3 | -1 | 0 | unknown | no |
2 | 10541 | 30 | management | single | secondary | no | 135 | no | no | cellular | 14 | aug | 2 | -1 | 0 | unknown | no |
3 | 13689 | 41 | technician | married | unknown | no | 30 | yes | no | cellular | 10 | jul | 1 | -1 | 0 | unknown | no |
4 | 11304 | 27 | admin. | single | secondary | no | 321 | no | yes | unknown | 2 | sep | 1 | -1 | 0 | unknown | no |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12865 | 14023 | 47 | technician | married | secondary | no | 1167 | yes | no | cellular | 30 | apr | 1 | 87 | 5 | failure | yes |
12866 | 17259 | 31 | unknown | married | secondary | no | 111 | no | no | cellular | 21 | nov | 2 | 93 | 2 | failure | yes |
12867 | 15200 | 37 | unemployed | single | tertiary | no | 1316 | yes | no | cellular | 18 | nov | 1 | 172 | 2 | failure | no |
12868 | 13775 | 42 | management | married | tertiary | no | 479 | yes | no | unknown | 28 | may | 2 | -1 | 0 | unknown | no |
12869 | 20137 | 24 | services | single | secondary | no | 0 | no | no | unknown | 16 | may | 1 | -1 | 0 | unknown | no |
12870 rows × 17 columns
Show code cell source
v = pd.crosstab(df.y,df.education)
from scipy.stats import chi2_contingency
chi2 , p ,dof, expected = chi2_contingency(v)
display(v)
print(p)
education | primary | secondary | tertiary | unknown |
---|---|---|---|---|
y | ||||
no | 1424 | 4555 | 2559 | 365 |
yes | 456 | 1813 | 1516 | 182 |
7.901201277473551e-29
Question10
각 job에 따라 divorced/married 인원의 비율을 확인 했을 때 그 값이 가장 높은 값은?
Show code cell source
t = df.groupby(['job','marital']).size().reset_index()
pivotdf = t.pivot_table(index='job',columns='marital')[0]
pivotdf = pivotdf.fillna(0)
pivotdf['ratio'] = pivotdf['divorced'] / pivotdf['married']
result = pivotdf.sort_values('ratio').ratio.values[-1]
print(result)
0.2831050228310502
작업 2유형#
Attention
데이터 출처 : https://archive.ics.uci.edu/ml/datasets/Bank+Marketing (후처리 작업)
데이터 설명 : 은행의 전화 마케팅에 대해 고객의 반응 여부
train : https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/train.csv
test : https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/test.csv
submission : https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/submission.csv
import pandas as pd
train= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/train.csv')
test= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/test.csv')
submission= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/submission.csv')
display(train.head())
display(test.head())
display(submission.head())
ID | age | job | marital | education | default | balance | housing | loan | contact | day | month | campaign | pdays | previous | poutcome | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 13829 | 29 | technician | single | tertiary | no | 18254 | no | no | cellular | 11 | may | 2 | -1 | 0 | unknown | no |
1 | 22677 | 26 | services | single | secondary | no | 512 | yes | yes | unknown | 5 | jun | 3 | -1 | 0 | unknown | no |
2 | 10541 | 30 | management | single | secondary | no | 135 | no | no | cellular | 14 | aug | 2 | -1 | 0 | unknown | no |
3 | 13689 | 41 | technician | married | unknown | no | 30 | yes | no | cellular | 10 | jul | 1 | -1 | 0 | unknown | no |
4 | 11304 | 27 | admin. | single | secondary | no | 321 | no | yes | unknown | 2 | sep | 1 | -1 | 0 | unknown | no |
ID | age | job | marital | education | default | balance | housing | loan | contact | day | month | campaign | pdays | previous | poutcome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 53608 | 32 | management | single | tertiary | no | 12569 | no | no | cellular | 1 | jul | 2 | 295 | 2 | success |
1 | 51055 | 25 | services | single | secondary | no | 801 | no | no | cellular | 5 | jun | 2 | -1 | 0 | unknown |
2 | 52573 | 46 | blue-collar | married | secondary | no | 1728 | yes | no | unknown | 26 | may | 2 | -1 | 0 | unknown |
3 | 50458 | 39 | management | divorced | secondary | no | 51 | no | no | unknown | 17 | jun | 2 | -1 | 0 | unknown |
4 | 52272 | 31 | services | single | tertiary | no | 1626 | no | no | unknown | 31 | jul | 1 | -1 | 0 | unknown |
ID | predict | |
---|---|---|
0 | 53608 | 0.0 |
1 | 51055 | 0.0 |
2 | 52573 | 0.0 |
3 | 50458 | 0.0 |
4 | 52272 | 0.0 |
간단한 모델링 작업
모델링 및 submission파일 생성까지
Show code cell source
from sklearn.model_selection import train_test_split
x = train.drop(columns =['ID','y'])
xd = pd.get_dummies(x)
y = train['y']
x_train,x_test,y_train,y_test = train_test_split(xd,y,stratify =y ,random_state=1)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(x_train,y_train)
pred = rf.predict_proba(x_test)
from sklearn.metrics import roc_auc_score,classification_report
print('test roc score : ',roc_auc_score(y_test,pred[:,1]))
test_pred = rf.predict_proba(pd.get_dummies(test.drop(columns=['ID'])))
submission['predict'] = test_pred[:,1]
print('submission file')
display(submission.head())
submission.to_csv('00000000000000.csv',index=False)
test roc score : 0.77612408703591
submission file
ID | predict | |
---|---|---|
0 | 53608 | 0.73 |
1 | 51055 | 0.80 |
2 | 52573 | 0.01 |
3 | 50458 | 0.23 |
4 | 52272 | 0.33 |