작업 1유형#
Attention
DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p1_1.csv 월마트 판매량 데이터 : https://www.kaggle.com/datasets/asahu40/walmart-data-analysis-and-forcasting
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p1_1.csv')
df.head(5)
Store | Date | Weekly_Sales | Holiday_Flag | Temperature | Fuel_Price | CPI | Unemployment | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 05-02-2010 | 1643690.90 | 0 | 42.31 | 2.572 | 211.096358 | 8.106 |
1 | 1 | 12-02-2010 | 1641957.44 | 1 | 38.51 | 2.548 | 211.242170 | 8.106 |
2 | 1 | 19-02-2010 | 1611968.17 | 0 | 39.93 | 2.514 | 211.289143 | 8.106 |
3 | 1 | 26-02-2010 | 1409727.59 | 0 | 46.63 | 2.561 | 211.319643 | 8.106 |
4 | 1 | 05-03-2010 | 1554806.68 | 0 | 46.5 | 2.625 | 211.350143 | 8.106 |
1-1
Temperature컬럼에서 숫자가 아닌 문자들을 제거후 숫자 타입으로 바꾸고 3분위수에서 1분위수의 차이를 소숫점 이하 2자리까지 구하여라
Show code cell source
df.loc[:,'Temperature_f'] = df['Temperature'].str.replace('*','',regex=True).astype('float')
result = round(df['Temperature_f'].quantile(0.75) -df['Temperature_f'].quantile(0.25),3)
print(result)
27.48
Attention
DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p1_2.csv 유튜브 영상 통계량 : https://www.kaggle.com/datasets/advaypatil/youtube-statistics?select=videos-stats.csv
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p1_2.csv')
df.head(5)
Title | Video ID | Published At | Keyword | Likes | Comments | Views | |
---|---|---|---|---|---|---|---|
0 | Apple Pay Is Killing the Physical Wallet After... | wAZZ-UWGVHI | 2022-08-23 | tech | 3407.0 | 672.0 | 135612.0 |
1 | The most EXPENSIVE thing I own. | b3x28s61q3c | 2022-08-24 | tech | 76779.0 | 4306.0 | 1758063.0 |
2 | My New House Gaming Setup is SICK! | 4mgePWWCAmA | 2022-08-23 | tech | 63825.0 | 3338.0 | 1564007.0 |
3 | Petrol Vs Liquid Nitrogen | Freezing Experimen... | kXiYSI7H2b0 | 2022-08-23 | tech | 71566.0 | 1426.0 | 922918.0 |
4 | Best Back to School Tech 2022! | ErMwWXQxHp0 | 2022-08-08 | tech | 96513.0 | 5155.0 | 1855644.0 |
1-2
Likes를 Comments로 나눈 비율이 20이상이면서 Keyword값이 minecraft인 영상들의 Views값의 평균을 정수로 구하여라
Show code cell source
result = int(df[((df['Likes'] / df['Comments']) >=20) & (df.Keyword =='minecraft')].Views.mean())
print(result)
1789084
Attention
DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p1_3.csv 넷플릭스 영상 메타정보 : https://www.kaggle.com/datasets/akashguna/netflix-prize-shows-information
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p1_3.csv')
df.head(5)
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | NaN | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
1 | s2 | TV Show | Blood & Water | NaN | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
3 | s4 | TV Show | Jailbirds New Orleans | NaN | NaN | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
4 | s5 | TV Show | Kota Factory | NaN | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
1-3
date_added가 2018년 1월 이면서 country가 United Kingdom 단독 제작인 데이터의 갯수
Show code cell source
result = df[(df.date_added_dt.dt.strftime("%Y-%m") =='2018-01') & (df.country =='United Kingdom')].shape[0]
print(result)
6
작업 2유형#
Attention
유저 분류 : https://www.kaggle.com/datasets/kaushiksuresh147/customer-segmentation train = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p2_train.csv test = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p2_test.csv
예측 변수 Segmentation, test.csv에 대해 ID별로 Segmentation의 클래스를 예측해서 저장후 제출, 제출 데이터 컬럼은 ID와 Segmentation 두개만 존재해야함. 평가지표는 macro f1 score
import pandas as pd
train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p2_train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p2_test.csv')
display(train.head(2))
test.head(2)
ID | Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | Segmentation | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 464357 | Male | No | 40 | Yes | Artist | 7.0 | Low | 1.0 | Cat_6 | A |
1 | 459624 | Male | No | 18 | No | Healthcare | NaN | Low | 5.0 | Cat_4 | D |
ID | Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 460406 | Male | Yes | 36 | Yes | Healthcare | 3.0 | Low | 2.0 | Cat_6 |
1 | 466890 | Male | Yes | 47 | Yes | Artist | 0.0 | Average | 6.0 | Cat_7 |
Show code cell source
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
# 결측치 채우기
MAX_EVER_MARRIED = train['Ever_Married'].value_counts().index[0]
MAX_GRADUATED = train['Graduated'].value_counts().index[0]
MAX_PROFESSION = train['Profession'].value_counts().index[0]
MEAD_Work_Experience = train['Work_Experience'].mean()
MAX_Family_Size = train['Family_Size'].value_counts().index[0]
MAX_Var_1 = train['Var_1'].value_counts().index[0]
fillna_dic = {
'Ever_Married':MAX_EVER_MARRIED,
'Graduated':MAX_GRADUATED,
'Profession':MAX_PROFESSION,
'Work_Experience':MEAD_Work_Experience,
'Family_Size':MAX_Family_Size,
'Var_1':MAX_Var_1
}
for col in fillna_dic.keys():
train.loc[:,col] = train.loc[:,col].fillna(fillna_dic[col])
test.loc[:,col] = test.loc[:,col].fillna(fillna_dic[col])
# Family_Size 범주형 변경
train['Family_Size'] = train['Family_Size'].astype('str')
test['Family_Size'] = test['Family_Size'].astype('str')
# ID 컬럼제거
IDS = test['ID']
train_drop = train.drop(columns =['ID','Segmentation'])
train_y = train['Segmentation']
test_drop = test.drop(columns =['ID'])
train_dum = pd.get_dummies(train_drop)
test_dum = pd.get_dummies(test_drop)[train_dum.columns]
# validtaion - test 셋 구분
x_train, x_test, y_train,y_test = train_test_split(train_dum , train_y,random_state =1,stratify=train_y)
# 모델링
rf = RandomForestClassifier(random_state =1)
rf.fit(x_train,y_train)
pred = rf.predict(x_test)
print('valiation test f1score',f1_score(y_test,pred,average='macro'))
pred_test = rf.predict(test_dum)
sumission = pd.DataFrame()
sumission['ID'] = IDS
sumission['Segmentation'] = pred_test
sumission.to_csv('000000000.csv',index=False)
valiation test f1score 0.4107784704089142
작업 3유형#
Attention
어느 대학교의 신입생의 학과별 성별에 대한 데이터이다. DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p3_1_.csv 이 데이터를 바탕으로, 학생들의 학과와 성별이 서로 독립적인지 여부를 확인하기 위해 카이제곱 독립성 검정을 실시 하려한다.
import pandas as pd
df= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p3_1_.csv')
df.head()
학번 | 학과 | 성별 | |
---|---|---|---|
0 | DFSKJK_1 | 경영학과 | 여 |
1 | DFSKJK_2 | 사회학과 | 여 |
2 | DFSKJK_3 | 기계공학과 | 여 |
3 | DFSKJK_4 | 경제학과 | 남 |
4 | DFSKJK_5 | 기계공학과 | 여 |
3-1-a
학과 평균 인원에 대한 값을 소숫점 이하 3자리까지 구하여라
Show code cell source
result= round(df.groupby(['학과']).size().mean(),3)
print(result)
170.333
3-1-b
카이제곱검정 독립성 검정 통계량을 소숫점 이하 3자리까지 구하여라
Show code cell source
from scipy.stats import chi2_contingency
crosstab = pd.crosstab(df['학과'], df['성별'])
s, p, _, _ = chi2_contingency(crosstab)
round_s = round(s,3)
print(round_s)
5.646
3-1-c
카이제곱검정 독립성 검정의 pvalue를 소숫점 이하 3자리까지 구하여라. 유의수준 0.05하에서 귀무가설과 대립가설중 유의한 것을 출력하라
Show code cell source
round_p = round(p,3)
print(p)
print('귀무')
0.3421500573497989
귀무
Attention
어느 학교에서 수학 시험을 본 학생 100명 중 60명이 60점 이상을 받았다. 이 학교의 수학 시험의 평균 점수가 50점 이상인지 95%의 신뢰 수준에서 검정하려한다.
3-1-a
검정 통계량을 소숫점 이하 3자리에서 구하시오
Show code cell source
import numpy as np
n = 100
p_hat = 0.6
p = 0.5
alpha = 0.05
# 검정 통계량 계산
z = round((p_hat - p) / np.sqrt(p * (1 - p) / n),5)
print(z)
2.0
3-1-b
pvalue를 소숫점 이하 3자리까지 구하고 귀무가설과 대립가설중 유의한 것을 출력하라
Show code cell source
from scipy.stats import norm
p_value = round(1 - norm.cdf(z),3)
print(p_value)
print('대립')
0.023
대립