์์ 1์ ํ#
9ํ ๋น ๋ถ๊ธฐ ์ค๊ธฐ ๋๋น ๊ฐ์, ๋ธ๋ก๊ทธ๋ง์ผ๋ก๋ ๋์ ํ ์๋๊ฒ ๋คใ ํ์๋๋ถ๋ค์๊ฒ ์ถ์ฒํฉ๋๋ค.
๋ณธ๊ฒฉ์ ์ผ๋ก ์์ํ์๊ธฐ ์ ์ ์น๊ด๊ณ ํ๋ฒ ๋๋ฌ์ฃผ์ธ์
๋ฌธ์ ํ์ด ์์ : https://www.youtube.com/watch?v=hzHPpybySJE
๋น ๋ถ๊ธฐ, adp ์ ๋ณด๊ณต์ ์คํ์นดํก๋ฐฉ1
@@@์ฐธ์ฌ ๋งํฌ@@@
์ฐธ์ฌ ์ฝ๋ : dbscan (์์ ์ ๋ฐ์ดํธ, ์นดํก ํ๋ฉด ํ๋จ์ ๋ฌธ์ ํ์ธ ํด์ฃผ์ธ์)๋น ๋ถ๊ธฐ, adp ์ ๋ณด๊ณต์ ์คํ์นดํก๋ฐฉ2
@@@์ฐธ์ฌ ๋งํฌ@@@
๋ค์ํ ๋ฐ์ดํฐ#
์ ํ๋ธ ์ธ๊ธฐ๋์์ ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/rsrishav/youtube-trending-video-dataset?select=KR_youtube_trending_data.csv
๋ฐ์ดํฐ ์ค๋ช
: ์ ํ๋ธ ๋ฐ์ผ๋ฆฌ ์ธ๊ธฐ๋์์ (ํ๊ตญ)
dataurl : https://raw.githubusercontent.com/Datamanim/datarepo/main/youtube/youtube.csv
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/youtube/youtube.csv",index_col=0)
df.head()
title | channelTitle | categoryId | view_count | likes | dislikes | comment_count | channelId | trending_date2 | |
---|---|---|---|---|---|---|---|---|---|
0 | [์ ๋ณ] ๋ฌผ์์ฐฝ๊ณ | ์ฅ์์ญ | 23 | 1893473 | 38249 | 730 | 8595 | UChbE5OZQ6dRHECsX0tEPEZQ | 2021-01-01 |
1 | RAIN(๋น) - ๋๋ก ๋ฐ๊พธ์ Switch to me (duet with JYP) MV | RAIN's Official Channel | 10 | 2600864 | 0 | 0 | 20129 | UCxXgIeE5hxWxHG6dz9Scg2w | 2021-01-01 |
2 | 2020๋ ์ ์ผ์ ์ข ์จ๋ผ์ธ ํ์ข ํ์ฌ | ๋ณด์ ๊ฐ ํ์ฅ ํ์ฌ๋ ์งํํ์ง ์์ต๋๋ค. | ์์ธ์ ยท Seoul | 29 | 347049 | 3564 | 120 | 178 | UCZUPZW5idAxYp-Asj__lVAA | 2021-01-01 |
3 | ๊ณ ๊ธฐ๋จ์์ ์น ๋ฉด์กฐ ํํฐ | ๊ณ ๊ธฐ๋จ์ MeatMan | 26 | 528458 | 15372 | 280 | 3470 | UCT3CumbFIJiW33uq0UI3zlg | 2021-01-01 |
4 | ๊ณจ๋ชฉ 3mc๋ฅผ ๋ถ๋ ธํ๊ฒ ๋ง๋ ๋ง์๊ณ ์์ด ์ฌํ์ ๊ณต๋ฆ ๋ฐฑ๋ฐ์ง ์ฌ์ฅ๋์ ํธ๋ ?! [์๋ฅ... | ์ค๋ธ์ค๋ฐฅ์ง | 24 | 494904 | 3918 | 111 | 3142 | UCdWgRSfttvDucq4ApcCg5Mw | 2021-01-01 |
Question 1
์ธ๊ธฐ๋์์ ์ ์ํ์๊ฐ ๋ง์ ์ฑ๋ ์์ 10๊ฐ๋ช ์ ์ถ๋ ฅํ๋ผ (๋ ์ง๊ธฐ์ค, ์ค๋ณตํฌํจ)
Show code cell source
answer =list(df.loc[df.channelId.isin(df.channelId.value_counts().head(10).index)].channelTitle.unique())
print(answer)
['์ฅ์์ญ', '์ด๋ช๋ช
', 'ํ๋ฟ๋ฆฌ', '์งคํฐ', '๋ฐ๋๋งจ - ์ค๋ธ์ค ๊ณต์ ์ฑ๋', '์ ๋๋ฃจ๋ง๋๋ฃจ : MBC ๊ณต์ ์ข
ํฉ ์ฑ๋', 'SPOTV', '์ฑ๋ ์ญ์ค์ผ', '์ด๊ณผ์ฅ', 'BANGTANTV']
Question 2
๋ ผ๋์ผ๋ก ์ธ๊ธฐ๋์์์ด ๋ ์ผ์ด์ค๋ฅผ ํ์ธํ๊ณ ์ถ๋ค. dislikes์๊ฐ like ์๋ณด๋ค ๋์ ๋์์์ ์ ์ํ ์ฑ๋์ ๋ชจ๋ ์ถ๋ ฅํ๋ผ
Show code cell source
answer =list(df.loc[df.likes < df.dislikes].channelTitle.unique())
print(answer)
['ํซ๋๊ทธTV', 'ASMR ์ ์ TV', 'ํ์ํธ๋ฆฌHayanTree', '์ํก YangPang', '์ฒ ๊ตฌํ (CHULTUBE)', '์๋๋งจํ์ฐ์ค', '(MUTUBE)์๊พธ๋์ฅ๋ด์ค', '์ค๋ฉํดTV', '์ก์ง๋ด', 'MapleStory_KR', 'ROAD FIGHTING CHAMPIONSHIP', '์ฌ๋์ด ๊น๊ธฐํ', '๋ํผ์์ฐ๋ค STUDIO', 'Gen.G esports']
Question 3
์ฑ๋๋ช ์ ๋ฐ๊พผ ์ผ์ด์ค๊ฐ ์๋์ง ํ์ธํ๊ณ ์ถ๋ค. channelId์ ๊ฒฝ์ฐ ๊ณ ์ ๊ฐ์ด๋ฏ๋ก ์ด๋ฅผ ํตํด ์ฑ๋๋ช ์ ํ๋ฒ์ด๋ผ๋ ๋ฐ๊พผ ์ฑ๋์ ๊ฐฏ์๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
change = df[['channelTitle','channelId']].drop_duplicates().channelId.value_counts()
target = change[change>1]
print(len(target))
71
Question 4
์ผ์์ผ์ ์ธ๊ธฐ์์๋ ์์๋ค์ค ๊ฐ์ฅ๋ง์ ์์ ์ข ๋ฅ(categoryId)๋ ๋ฌด์์ธ๊ฐ?
Show code cell source
df['trending_date2'] = pd.to_datetime(df['trending_date2'])
answer =df.loc[df['trending_date2'].dt.day_name() =='Sunday'].categoryId.value_counts().index[0]
print(answer)
24
Question 5
๊ฐ ์์ผ๋ณ ์ธ๊ธฐ ์์๋ค์ categoryId๋ ๊ฐ๊ฐ ๋ช๊ฐ ์ฉ์ธ์ง ํ๋์ ๋ฐ์ดํฐ ํ๋ ์์ผ๋ก ํํํ๋ผ
Show code cell source
group = df.groupby([df['trending_date2'].dt.day_name(),'categoryId'],as_index=False).size()
answer= group.pivot(index='categoryId',columns='trending_date2')
display(answer)
size | |||||||
---|---|---|---|---|---|---|---|
trending_date2 | Friday | Monday | Saturday | Sunday | Thursday | Tuesday | Wednesday |
categoryId | |||||||
1 | 243 | 263 | 255 | 274 | 246 | 257 | 234 |
2 | 120 | 105 | 119 | 99 | 128 | 119 | 129 |
10 | 833 | 837 | 776 | 830 | 890 | 894 | 917 |
15 | 187 | 215 | 198 | 217 | 207 | 208 | 207 |
17 | 633 | 668 | 592 | 636 | 682 | 708 | 706 |
19 | 90 | 92 | 87 | 91 | 92 | 89 | 85 |
20 | 283 | 298 | 296 | 289 | 282 | 285 | 291 |
22 | 1288 | 1373 | 1289 | 1337 | 1341 | 1375 | 1333 |
23 | 568 | 594 | 570 | 556 | 560 | 569 | 566 |
24 | 2976 | 3148 | 3066 | 3096 | 2954 | 3084 | 3090 |
25 | 444 | 453 | 422 | 437 | 470 | 452 | 468 |
26 | 369 | 378 | 364 | 363 | 375 | 394 | 385 |
27 | 183 | 205 | 183 | 199 | 194 | 194 | 212 |
28 | 171 | 160 | 173 | 167 | 166 | 161 | 165 |
29 | 12 | 10 | 10 | 9 | 13 | 11 | 12 |
Question 6
๋๊ธ์ ์๋ก (comment_count) ์์ ๋ฐ์์ ๋ํ ํ๋จ์ ํ ์ ์๋ค. viewcount๋๋น ๋๊ธ์๊ฐ ๊ฐ์ฅ ๋์ ์์์ ํ์ธํ๋ผ (view_count๊ฐ์ด 0์ธ ๊ฒฝ์ฐ๋ ์ ์ธํ๋ค)
Show code cell source
target2= df.loc[df.view_count!=0]
t = target2.copy()
t['ratio'] = (target2['comment_count']/target2['view_count']).dropna()
result = t.sort_values(by='ratio', ascending=False).iloc[0].title
print(result)
60๋ถ ๋์ ๋๊ธ์ด ๋ฌ๋ฆฌ์ง ์์ผ๋ฉด, ์์์ด ์ญ์ ๋ฉ๋๋ค. (์ฑ๋ฆฐ์ง)
Question 7
๋๊ธ์ ์๋ก (comment_count) ์์ ๋ฐ์์ ๋ํ ํ๋จ์ ํ ์ ์๋ค.viewcount๋๋น ๋๊ธ์๊ฐ ๊ฐ์ฅ ๋ฎ์ ์์์ ํ์ธํ๋ผ (view_counts, ratio๊ฐ์ด 0์ธ๊ฒฝ์ฐ๋ ์ ์ธํ๋ค.)
Show code cell source
ratio = (df['comment_count'] / df['view_count']).dropna().sort_values()
ratio[ratio!=0].index[0]
result= df.iloc[ratio[ratio!=0].index[0]].title
print(result)
Join the BTS #PermissiontoDance Challenge only on YouTube #Shorts
Question 8
like ๋๋น dislike์ ์๊ฐ ๊ฐ์ฅ ์ ์ ์์์ ๋ฌด์์ธ๊ฐ? (like, dislike ๊ฐ์ด 0์ธ๊ฒฝ์ฐ๋ ์ ์ธํ๋ค)
Show code cell source
target = df.loc[(df.likes !=0) & (df.dislikes !=0)]
num = (target['dislikes']/target['likes']).sort_values().index[0]
answer = df.iloc[num].title
print(answer)
[์คํฐ๋ทฐ] *์ต์ด๊ณต๊ฐ* ์ฌ๋ถ์์ฆ๐ฐ๐ถ์ ๋น๊ณต์ ์ด๋ฆ์ ์์ด๋ผ์ธ์ฆ? ๊ฟ์กฐํฉ ํฐํคํ์นด ๊ฐ๋ํ NCT 127 ๋์&์ ์ฐ์ ์คํฐ๋ทฐ
Question 9
๊ฐ์ฅ๋ง์ ํธ๋ ๋ ์์์ ์ ์ํ ์ฑ๋์ ์ด๋ฆ์ ๋ฌด์์ธ๊ฐ? (๋ ์ง๊ธฐ์ค, ์ค๋ณตํฌํจ)
Show code cell source
answer = df.loc[df.channelId ==df.channelId.value_counts().index[0]].channelTitle.unique()[0]
print(answer)
์งคํฐ
Question 10
20ํ(20์ผ)์ด์ ์ธ๊ธฐ๋์์ ๋ฆฌ์คํธ์ ํฌํจ๋ ๋์์์ ์ซ์๋?
Show code cell source
answer= (df[['title','channelId']].value_counts()>=20).sum()
print(answer)
40
์ ํ๋ธ ๊ณต๋ฒ์ปจํ ์ธ ๋์์ ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/kukuroo3/youtube-episodic-contents-kr(์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ์ ํ๋ธ โ๊ณต๋ฒโ ์ปจํ
์ธ ๋์์ ์ ๋ณด ( 10๋ถ ๊ฐ๊ฒฉ ์์ง)
dataurl1 (๋น๋์ค ์ ๋ณด) = https://raw.githubusercontent.com/Datamanim/datarepo/main/youtube/videoInfo.csv
dataurl2 (์ฐธ๊ฐ์ ์ฑ๋ ์ ๋ณด)= https://raw.githubusercontent.com/Datamanim/datarepo/main/youtube/channelInfo.csv
import pandas as pd
channel =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/youtube/channelInfo.csv')
video =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/youtube/videoInfo.csv')
display(channel.head())
display(video.head())
channelid | subcnt | viewcnt | videocnt | ct | channelname | |
---|---|---|---|---|---|---|
0 | UCkQCwnkQfgSuPTTnw_Y7v7w | 1310000 | 410238653 | 736 | 2021-09-30 03:01:03 | ๊ฝ๋๋ฃน |
1 | UCkQCwnkQfgSuPTTnw_Y7v7w | 1310000 | 412531322 | 736 | 2021-09-30 09:01:03 | ๊ฝ๋๋ฃน |
2 | UCkQCwnkQfgSuPTTnw_Y7v7w | 1310000 | 412531322 | 735 | 2021-09-30 15:01:03 | ๊ฝ๋๋ฃน |
3 | UCkQCwnkQfgSuPTTnw_Y7v7w | 1310000 | 412531322 | 737 | 2021-09-30 21:01:03 | ๊ฝ๋๋ฃน |
4 | UCkQCwnkQfgSuPTTnw_Y7v7w | 1320000 | 412531322 | 737 | 2021-10-01 03:01:04 | ๊ฝ๋๋ฃน |
videopk | viewcnt | likecnt | dislikecnt | favoritecnt | cmcnt | ct | videoname | |
---|---|---|---|---|---|---|---|---|
0 | c5JQp6xafqc | 1667010 | 30474 | 706 | 0 | 6587 | 2021-10-10 15:20:03 | ๊ณต๋ฒ EP1 |
1 | c5JQp6xafqc | 1669089 | 30495 | 707 | 0 | 6589 | 2021-10-10 15:30:03 | ๊ณต๋ฒ EP1 |
2 | c5JQp6xafqc | 1674759 | 30522 | 711 | 0 | 6596 | 2021-10-10 15:40:02 | ๊ณต๋ฒ EP1 |
3 | c5JQp6xafqc | 1677026 | 30555 | 712 | 0 | 6604 | 2021-10-10 15:50:03 | ๊ณต๋ฒ EP1 |
4 | c5JQp6xafqc | 1681824 | 30585 | 713 | 0 | 6600 | 2021-10-10 16:00:03 | ๊ณต๋ฒ EP1 |
Question 11
๊ฐ ๋ฐ์ดํฐ์ โctโ์ปฌ๋ผ์ ์๊ฐ์ผ๋ก ์ธ์ํ ์ ์๊ฒ datatype์ ๋ณ๊ฒฝํ๊ณ video ๋ฐ์ดํฐ์ videoname์ ๊ฐ value ๋ง๋ค ๋ช๊ฐ์ ๋ฐ์ดํฐ์ฉ ๊ฐ์ง๊ณ ์๋์ง ํ์ธํ๋ผ
Show code cell source
video['ct'] = pd.to_datetime(video['ct'])
answer = video.videoname.value_counts()
print(answer)
๊ณต๋ฒ EP1 3492
๊ณต๋ฒ EP2 3204
๊ณต๋ฒ EP3 2568
๊ณต๋ฒ EP4 2280
๊ณต๋ฒ EP5 1562
๊ณต๋ฒ EP6 1274
๊ณต๋ฒ EP7 555
๊ณต๋ฒ EP8 266
Name: videoname, dtype: int64
Question 12
์์ง๋ ๊ฐ video์ ๊ฐ์ฅ ์ต์ ํ ๋ ๋ ์ง์ viewcount๊ฐ์ ์ถ๋ ฅํ๋ผ
Show code cell source
answer = video.sort_values(['videoname','ct']).drop_duplicates('videoname',keep='last')[['viewcnt','videoname','ct']].reset_index(drop=True)
display(answer)
viewcnt | videoname | ct | |
---|---|---|---|
0 | 3180532 | ๊ณต๋ฒ EP1 | 2021-11-01 15:30:03 |
1 | 2199328 | ๊ณต๋ฒ EP2 | 2021-11-01 15:30:03 |
2 | 1671294 | ๊ณต๋ฒ EP3 | 2021-11-01 15:30:03 |
3 | 1818493 | ๊ณต๋ฒ EP4 | 2021-11-01 15:30:03 |
4 | 1503435 | ๊ณต๋ฒ EP5 | 2021-11-01 15:30:04 |
5 | 1750222 | ๊ณต๋ฒ EP6 | 2021-11-01 15:30:04 |
6 | 1630200 | ๊ณต๋ฒ EP7 | 2021-11-01 15:30:05 |
7 | 1289088 | ๊ณต๋ฒ EP8 | 2021-11-01 15:30:05 |
Question 13
Channel ๋ฐ์ดํฐ์ค 2021-10-03์ผ ์ดํ ๊ฐ ์ฑ๋์ ์ฒ์ ๊ธฐ๋ก ๋๋ ๊ตฌ๋ ์ ์(subcnt)๋ฅผ ์ถ๋ ฅํ๋ผ
Show code cell source
channel.ct = pd.to_datetime(channel.ct)
target = channel[channel.ct >= pd.to_datetime('2021-10-03')].sort_values(['ct','channelname']).drop_duplicates('channelname')
answer = target[['channelname','subcnt']].reset_index(drop=True)
print(answer)
channelname subcnt
0 ๋
ผ๋ฆฌ์ ์ ๊ธฐ 922000
1 ์ผ์ ์ฝ์งฑ์ฌ 257000
2 ์ธ์ ChuJeong 322000
3 ๊น๋๋ฐ์ ๋๋ฐํ ์ถ 7520
4 ๋ฆดํ Lilpearl 10100
5 ์๊ธ์๊ธ WagleWagle 55000
6 ์กฐ๋๋จ 215000
7 ํ์ฌ!ํ์ ๋๋ค 14900
8 Balming Tiger 54300
9 ๊ณฝํ ๋ฆฌ kwak tori 471000
10 ๊ฝ๋๋ฃน 1330000
Question 14
๊ฐ์ฑ๋์ 2021-10-03 03:00:00 ~ 2021-11-01 15:00:00 ๊น์ง ๊ตฌ๋ ์์ (subcnt) ์ ์ฆ๊ฐ๋์ ๊ตฌํ์ฌ๋ผ
Show code cell source
end = channel.loc[channel.ct.dt.strftime('%Y-%m-%d %H') =='2021-11-01 15']
start = channel.loc[channel.ct.dt.strftime('%Y-%m-%d %H') =='2021-10-03 03']
end_df = end[['channelname','subcnt']].reset_index(drop=True)
start_df = start[['channelname','subcnt']].reset_index(drop=True)
end_df.columns = ['channelname','end_sub']
start_df.columns = ['channelname','start_sub']
tt = pd.merge(start_df,end_df)
tt['del'] = tt['end_sub'] - tt['start_sub']
result = tt[['channelname','del']]
display(result)
channelname | del | |
---|---|---|
0 | ๊ฝ๋๋ฃน | 70000 |
1 | ์ผ์ ์ฝ์งฑ์ฌ | 11000 |
2 | ์ธ์ ChuJeong | 1000 |
3 | ๋ ผ๋ฆฌ์ ์ ๊ธฐ | -11000 |
4 | ์๊ธ์๊ธ WagleWagle | 0 |
5 | ์กฐ๋๋จ | 12000 |
6 | ํ์ฌ!ํ์ ๋๋ค | 10300 |
7 | ๊น๋๋ฐ์ ๋๋ฐํ ์ถ | 1540 |
8 | ๋ฆดํ Lilpearl | 11000 |
9 | ๊ณฝํ ๋ฆฌ kwak tori | -2000 |
10 | Balming Tiger | 2500 |
Question 15
๊ฐ ๋น๋์ค๋ 10๋ถ ๊ฐ๊ฒฉ์ผ๋ก ๊ตฌ๋ ์์, ์ข์์, ์ซ์ด์์, ๋๊ธ์๊ฐ ์์ง๋๊ฒ์ผ๋ก ์๋ ค์ก๋ค. ๊ณต๋ฒ EP1์ ๋น๋์ค์ ๋ณด ๋ฐ์ดํฐ์ค ์์ง๊ฐ๊ฒฉ์ด 5๋ถ ์ดํ, 20๋ถ์ด์์ธ ๋ฐ์ดํฐ ๊ตฌ๊ฐ( ํด๋น ์์ ์ ,ํ) ์ ์๊ฐ์ ๋ชจ๋ ์ถ๋ ฅํ๋ผ
Show code cell source
import datetime
ep_one = video.loc[video.videoname.str.contains('1')].sort_values('ct').reset_index(drop=True)
ep_one[
(ep_one.ct.diff(1) >=datetime.timedelta(minutes=20)) | \
(ep_one.ct.diff(1) <=datetime.timedelta(minutes=5))
]
answer = ep_one[ep_one.index.isin([720,721,722,723,1635,1636,1637])]
display(answer)
videopk | viewcnt | likecnt | dislikecnt | favoritecnt | cmcnt | ct | videoname | |
---|---|---|---|---|---|---|---|---|
720 | c5JQp6xafqc | 2153676 | 34104 | 830 | 0 | 7120 | 2021-10-12 19:20:03 | ๊ณต๋ฒ EP1 |
721 | c5JQp6xafqc | 2228250 | 34559 | 849 | 0 | 7191 | 2021-10-13 09:41:37 | ๊ณต๋ฒ EP1 |
722 | c5JQp6xafqc | 2228250 | 34559 | 849 | 0 | 7191 | 2021-10-13 09:41:37 | ๊ณต๋ฒ EP1 |
723 | c5JQp6xafqc | 2228669 | 34566 | 849 | 0 | 7194 | 2021-10-13 09:50:03 | ๊ณต๋ฒ EP1 |
1635 | c5JQp6xafqc | 2706691 | 37893 | 987 | 0 | 7410 | 2021-10-19 17:50:02 | ๊ณต๋ฒ EP1 |
1636 | c5JQp6xafqc | 2707933 | 37901 | 988 | 0 | 7410 | 2021-10-19 18:20:03 | ๊ณต๋ฒ EP1 |
1637 | c5JQp6xafqc | 2708448 | 37903 | 988 | 0 | 7410 | 2021-10-19 18:30:03 | ๊ณต๋ฒ EP1 |
Question 16
๊ฐ ์ํผ์๋์ ์์๋ ์ง(๋ -์-์ผ)๋ฅผ ์ํผ์๋ ์ด๋ฆ๊ณผ ๋ฌถ์ด ๋ฐ์ดํฐ ํ๋ ์์ผ๋ก ๋ง๋ค๊ณ ์ถ๋ ฅํ๋ผ
Show code cell source
start_date = video.sort_values(['ct','videoname']).drop_duplicates('videoname')[['ct','videoname']]
start_date['date'] = start_date.ct.dt.date
answer = start_date[['date','videoname']]
display(answer)
date | videoname | |
---|---|---|
10 | 2021-10-07 | ๊ณต๋ฒ EP1 |
3496 | 2021-10-09 | ๊ณต๋ฒ EP2 |
6696 | 2021-10-14 | ๊ณต๋ฒ EP3 |
9264 | 2021-10-16 | ๊ณต๋ฒ EP4 |
11544 | 2021-10-21 | ๊ณต๋ฒ EP5 |
13661 | 2021-10-23 | ๊ณต๋ฒ EP6 |
13106 | 2021-10-28 | ๊ณต๋ฒ EP7 |
14935 | 2021-10-30 | ๊ณต๋ฒ EP8 |
Question 17
โ๊ณต๋ฒโ ์ปจํ ์ธ ์ ๊ฒฝ์ฐ 19:00์์ ๊ณต๊ฐ ๋๋๊ฒ์ผ๋ก ์๋ ค์ ธ์๋ค. ๊ณต๊ฐ๋ ๋ ์ 21์์ viewcnt, ct, videoname ์ผ๋ก ๊ตฌ์ฑ๋ ๋ฐ์ดํฐ ํ๋ ์์ viewcnt๋ฅผ ๋ด๋ฆผ์ฐจ์์ผ๋ก ์ ๋ ฌํ์ฌ ์ถ๋ ฅํ๋ผ
Show code cell source
video['time']= video.ct.dt.hour
answer = video.loc[video['time'] ==21] \
.sort_values(['videoname','ct'])\
.drop_duplicates('videoname') \
.sort_values('viewcnt',ascending=False)[['videoname','viewcnt','ct']]\
.reset_index(drop=True)
display(answer)
videoname | viewcnt | ct | |
---|---|---|---|
0 | ๊ณต๋ฒ EP8 | 264029 | 2021-10-30 21:00:08 |
1 | ๊ณต๋ฒ EP7 | 252032 | 2021-10-28 21:00:04 |
2 | ๊ณต๋ฒ EP4 | 217674 | 2021-10-16 21:00:04 |
3 | ๊ณต๋ฒ EP6 | 213899 | 2021-10-23 21:00:06 |
4 | ๊ณต๋ฒ EP5 | 201179 | 2021-10-21 21:00:04 |
5 | ๊ณต๋ฒ EP2 | 148144 | 2021-10-09 21:00:03 |
6 | ๊ณต๋ฒ EP3 | 147183 | 2021-10-14 21:00:04 |
7 | ๊ณต๋ฒ EP1 | 117340 | 2021-10-07 21:00:03 |
Question 18
video ์ ๋ณด์ ๊ฐ์ฅ ์ต๊ทผ ๋ฐ์ดํฐ๋ค์์ ๊ฐ ์ํผ์๋์ ์ซ์ด์/์ข์์ ๋น์จ์ ratio ์ปฌ๋ผ์ผ๋ก ๋ง๋ค๊ณ videoname, ratio๋ก ๊ตฌ์ฑ๋ ๋ฐ์ดํฐ ํ๋ ์์ ratio๋ฅผ ์ค๋ฆ์ฐจ์์ผ๋ก ์ ๋ ฌํ๋ผ
Show code cell source
target = video.sort_values('ct').drop_duplicates('videoname',keep='last')
target['ratio'] =target['dislikecnt'] / target['likecnt']
answer = target.sort_values('ratio')[['videoname','ratio']].reset_index(drop=True)
answer
videoname | ratio | |
---|---|---|
0 | ๊ณต๋ฒ EP7 | 0.013636 |
1 | ๊ณต๋ฒ EP6 | 0.015881 |
2 | ๊ณต๋ฒ EP4 | 0.017971 |
3 | ๊ณต๋ฒ EP5 | 0.021128 |
4 | ๊ณต๋ฒ EP3 | 0.022360 |
5 | ๊ณต๋ฒ EP1 | 0.027118 |
6 | ๊ณต๋ฒ EP2 | 0.029844 |
7 | ๊ณต๋ฒ EP8 | 0.141357 |
Question 19
2021-11-01 00:00:00 ~ 15:00:00๊น์ง ๊ฐ ์ํผ์๋๋ณ viewcnt์ ์ฆ๊ฐ๋์ ๋ฐ์ดํฐ ํ๋ ์์ผ๋ก ๋ง๋์์ค
Show code cell source
start = pd.to_datetime("2021-11-01 00:00:00")
end = pd.to_datetime("2021-11-01 15:00:00")
target = video.loc[(video["ct"] >= start) & (video['ct'] <= end)].reset_index(drop=True)
def check(x):
result = max(x) - min(x)
return result
answer = target[['videoname','viewcnt']].groupby("videoname").agg(check)
answer
viewcnt | |
---|---|
videoname | |
๊ณต๋ฒ EP1 | 13298 |
๊ณต๋ฒ EP2 | 10300 |
๊ณต๋ฒ EP3 | 9927 |
๊ณต๋ฒ EP4 | 9824 |
๊ณต๋ฒ EP5 | 10824 |
๊ณต๋ฒ EP6 | 14141 |
๊ณต๋ฒ EP7 | 26949 |
๊ณต๋ฒ EP8 | 89147 |
Question 20
video ๋ฐ์ดํฐ ์ค์์ ์ค๋ณต๋๋ ๋ฐ์ดํฐ๊ฐ ์กด์ฌํ๋ค. ์ค๋ณต๋๋ ๊ฐ ๋ฐ์ดํฐ์ ์๊ฐ๋์ videoname ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
answer = video[video.index.isin(set(video.index) - set(video.drop_duplicates().index))]
result = answer[['videoname','ct']]
display(result)
videoname | ct | |
---|---|---|
722 | ๊ณต๋ฒ EP1 | 2021-10-13 09:41:37 |
3927 | ๊ณต๋ฒ EP2 | 2021-10-13 09:41:37 |
์๋์ปต ์ถ์ ์ ์ ๊ณจ๊ธฐ๋ก ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/darinhawley/fifa-world-cup-goalscorers-19302018(์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: 1930 ~2018๋
๋ ์๋์ปต ์ถ์ ์ ์ ๊ณจ๊ธฐ๋ก
data url = https://raw.githubusercontent.com/Datamanim/datarepo/main/worldcup/worldcupgoals.csv
import pandas as pd
df= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/worldcup/worldcupgoals.csv')
df.head()
Player | Goals | Years | Country | |
---|---|---|---|---|
0 | Miroslav Klose | 16 | 2002-2006-2010-2014 | Germany |
1 | Ronaldo | 15 | 1998-2002-2006 | Brazil |
2 | Gerd Muller | 14 | 1970-1974 | Germany |
3 | Just Fontaine | 13 | 1958 | France |
4 | Pele | 12 | 1958-1962-1966-1970 | Brazil |
Question 21
์ฃผ์ด์ง ์ ์ฒด ๊ธฐ๊ฐ์ ๊ฐ ๋๋ผ๋ณ ๊ณจ๋์ ์ ์์ 5๊ฐ ๊ตญ๊ฐ์ ๊ทธ ๋์ ์๋ฅผ ๋ฐ์ดํฐํ๋ ์ํํ๋ก ์ถ๋ ฅํ๋ผ
Show code cell source
result = df.groupby('Country').sum().sort_values('Goals',ascending=False).head(5)
display(result)
Goals | |
---|---|
Country | |
Brazil | 228 |
Germany | 226 |
Argentina | 135 |
Italy | 123 |
France | 115 |
Question 22
์ฃผ์ด์ง ์ ์ฒด๊ธฐ๊ฐ๋์ ๊ณจ๋์ ์ ํ ์ ์๊ฐ ๊ฐ์ฅ ๋ง์ ๋๋ผ ์์ 5๊ฐ ๊ตญ๊ฐ์ ๊ทธ ์ ์ ์ซ์๋ฅผ ๋ฐ์ดํฐ ํ๋ ์ ํ์์ผ๋ก ์ถ๋ ฅํ๋ผ
Show code cell source
result = df.groupby('Country').size().sort_values(ascending=False).head(5)
print(result)
Country
Brazil 81
Germany 78
Italy 60
Argentina 59
France 58
dtype: int64
Question 23
Years ์ปฌ๋ผ์ ๋ ๋ -๋ ๋ ํ์์ผ๋ก ๊ตฌ์ฑ๋์ด์๊ณ , ๊ฐ ๋ ๋๋ 4์๋ฆฌ ์ซ์์ด๋ค. ๋ ๋ ํ๊ธฐ๊ฐ 4์๋ฆฌ ์ซ์๋ก ์๋ ์ผ์ด์ค๊ฐ ์กด์ฌํ๋ค. ํด๋น ๊ฑด์ ๋ช๊ฑด์ธ์ง ์ถ๋ ฅํ๋ผ
Show code cell source
df['yearLst'] = df.Years.str.split('-')
def checkFour(x):
for value in x:
if len(str(value)) != 4:
return False
return True
df['check'] = df['yearLst'].apply(checkFour)
result = len(df[df.check ==False])
result
45
Question 24
**Q3์์ ๋ฐ์ํ ์์ธ ์ผ์ด์ค๋ฅผ ์ ์ธํ ๋ฐ์ดํฐํ๋ ์์ df2๋ผ๊ณ ์ ์ํ๊ณ ๋ฐ์ดํฐ์ ํ์ ์ซ์๋ฅผ ์ถ๋ ฅํ๋ผ (์๋ ๋ฌธ์ ๋ถํฐ๋ df2๋ก ํ์ดํ๊ฒ ์ต๋๋ค) **
Show code cell source
df2 = df[df.check ==True].reset_index(drop=True)
print(df2.shape[0])
1250
Question 25
์๋์ปต ์ถ์ ํ์๋ฅผ ๋ํ๋ด๋ โLenCupโ ์ปฌ๋ผ์ ์ถ๊ฐํ๊ณ 4ํ ์ถ์ ํ ์ ์์ ์ซ์๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
df2['LenCup'] =df2['yearLst'].str.len()
result = df2['LenCup'].value_counts()[4]
print(result)
16
Question 26
Yugoslavia ๊ตญ๊ฐ์ ์๋์ปต ์ถ์ ํ์๊ฐ 2ํ์ธ ์ ์๋ค์ ์ซ์๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = len(df2[(df2.LenCup==2) & (df2.Country =='Yugoslavia')])
print(result)
7
Question 27
2002๋ ๋์ ์ถ์ ํ ์ ์ฒด ์ ์๋ ๋ช๋ช ์ธ๊ฐ?
Show code cell source
result =len(df2[df2.Years.str.contains('2002')])
print(result)
156
Question 28
์ด๋ฆ์ โcarlosโ ๋จ์ด๊ฐ ๋ค์ด๊ฐ๋ ์ ์์ ์ซ์๋ ๋ช ๋ช ์ธ๊ฐ? (๋, ์๋ฌธ์ ๊ตฌ๋ถ x)
Show code cell source
result = len(df2[df2.Player.str.lower().str.contains('carlos')])
print(result)
13
Question 29
์๋์ปต ์ถ์ ํ์๊ฐ 1ํ๋ฟ์ธ ์ ์๋ค ์ค์์ ๊ฐ์ฅ ๋ง์ ๋์ ์ ์ฌ๋ ธ๋ ์ ์๋ ๋๊ตฌ์ธ๊ฐ?
Show code cell source
result = df2[df2.LenCup==1].sort_values('Goals',ascending=False).Player.values[0]
print(result)
Just Fontaine
Question 30
์๋์ปต ์ถ์ ํ์๊ฐ 1ํ ๋ฟ์ธ ์ ์๋ค์ด ๊ฐ์ฅ ๋ง์ ๊ตญ๊ฐ๋ ์ด๋์ธ๊ฐ?
Show code cell source
result= df2[df2.LenCup==1].Country.value_counts().index[0]
print(result)
Brazil
์์ธ์ ๋ฐ๋ฆ์ด ์ด์ฉ์ ๋ณด ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.data.go.kr/data/15051872/fileData.do(์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ์์ธํน๋ณ์_๊ณต๊ณต์์ ๊ฑฐ ์๊ฐ๋๋ณ ์ด์ฉ์ ๋ณด
data url = https://raw.githubusercontent.com/Datamanim/datarepo/main/bicycle/seoul_bi.csv
import pandas as pd
df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/bicycle/seoul_bi.csv')
df.head()
๋์ฌ์ผ์ | ๋์ฌ์๊ฐ | ๋์ฌ์๋ฒํธ | ๋์ฌ๊ตฌ๋ถ์ฝ๋ | ์ฑ๋ณ | ์ฐ๋ น๋์ฝ๋ | ์ด์ฉ๊ฑด์ | ์ด๋๋ | ํ์๋ | ์ด๋๊ฑฐ๋ฆฌ | ์ฌ์ฉ์๊ฐ | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-06-01 | 0 | 3541 | ์ ๊ธฐ๊ถ | F | ~10๋ | 1 | 0.00 | 0.00 | 0.00 | 8 |
1 | 2021-06-01 | 0 | 765 | ์ ๊ธฐ๊ถ | F | ~10๋ | 1 | 27.21 | 0.35 | 1526.81 | 19 |
2 | 2021-06-01 | 0 | 2637 | ์ ๊ธฐ๊ถ | F | ~10๋ | 1 | 41.40 | 0.37 | 1608.56 | 18 |
3 | 2021-06-01 | 0 | 2919 | ์ ๊ธฐ๊ถ | F | ~10๋ | 1 | 0.00 | 0.00 | 0.00 | 75 |
4 | 2021-06-01 | 0 | 549 | ์ ๊ธฐ๊ถ | F | ~10๋ | 1 | 13.04 | 0.17 | 731.55 | 6 |
Question 31
๋์ฌ์ผ์๋ณ ๋ฐ์ดํฐ์ ์๋ฅผ ๋ฐ์ดํฐํ๋ ์์ผ๋ก ์ถ๋ ฅํ๊ณ , ๊ฐ์ฅ ๋ง์ ๋ฐ์ดํฐ๊ฐ ์๋ ๋ ์ง๋ฅผ ์ถ๋ ฅํ๋ผ
Show code cell source
result = df['๋์ฌ์ผ์'].value_counts().sort_index().to_frame()
answer = result[result.๋์ฌ์ผ์ == result.๋์ฌ์ผ์.max()].index[0]
display(result)
print(answer)
๋์ฌ์ผ์ | |
---|---|
2021-06-01 | 48081 |
2021-06-02 | 56100 |
2021-06-03 | 18985 |
2021-06-04 | 56788 |
2021-06-05 | 52730 |
2021-06-06 | 47787 |
2021-06-07 | 52623 |
2021-06-08 | 55623 |
2021-06-09 | 54507 |
2021-06-10 | 36992 |
2021-06-11 | 52089 |
2021-06-12 | 48569 |
2021-06-13 | 43131 |
2021-06-14 | 54945 |
2021-06-04
Question 32
๊ฐ ์ผ์์ ์์ผ์ ํ๊ธฐํ๊ณ (โMondayโ ~โSundayโ) โday_nameโ์ปฌ๋ผ์ ์ถ๊ฐํ๊ณ ์ด๋ฅผ ์ด์ฉํ์ฌ ๊ฐ ์์ผ๋ณ ์ด์ฉ ํ์์ ์ดํฉ์ ๋ฐ์ดํฐ ํ๋ ์์ผ๋ก ์ถ๋ ฅํ๋ผ
Show code cell source
df['๋์ฌ์ผ์'] = pd.to_datetime(df['๋์ฌ์ผ์'])
df['day_name'] = df['๋์ฌ์ผ์'].dt.day_name()
result = df.day_name.value_counts().to_frame()
print(result)
day_name
Wednesday 110607
Friday 108877
Monday 107568
Tuesday 103704
Saturday 101299
Sunday 90918
Thursday 55977
Question 33
๊ฐ ์์ผ๋ณ ๊ฐ์ฅ ๋ง์ด ์ด์ฉํ ๋์ฌ์์ ์ด์ฉํ์์ ๋์ฌ์ ๋ฒํธ๋ฅผ ๋ฐ์ดํฐ ํ๋ ์์ผ๋ก ์ถ๋ ฅํ๋ผ
Show code cell source
result = df.groupby(['day_name','๋์ฌ์๋ฒํธ']).size().to_frame('size').sort_values(['day_name','size'],ascending=False).reset_index()
answer = result.drop_duplicates('day_name',keep='first').reset_index(drop=True)
display(answer)
day_name | ๋์ฌ์๋ฒํธ | size | |
---|---|---|---|
0 | Wednesday | 502 | 282 |
1 | Tuesday | 502 | 267 |
2 | Thursday | 2715 | 137 |
3 | Sunday | 502 | 372 |
4 | Saturday | 502 | 378 |
5 | Monday | 502 | 242 |
6 | Friday | 502 | 277 |
Question 34
๋์ด๋๋ณ ๋์ฌ๊ตฌ๋ถ ์ฝ๋์ (์ผ์ผ๊ถ/์ ์ฒดํ์) ๋น์จ์ ๊ตฌํ ํ ๊ฐ์ฅ ๋์ ๋น์จ์ ๊ฐ์ง๋ ๋์ด๋๋ฅผ ํ์ธํ๋ผ. ์ผ์ผ๊ถ์ ๊ฒฝ์ฐ ์ผ์ผ๊ถ ๊ณผ ์ผ์ผ๊ถ(๋นํ์)์ ๋ชจ๋ ํฌํจํ๋ผ
Show code cell source
daily = df[df.๋์ฌ๊ตฌ๋ถ์ฝ๋.isin(['์ผ์ผ๊ถ','์ผ์ผ๊ถ(๋นํ์)'])].์ฐ๋ น๋์ฝ๋.value_counts().sort_index()
total = df.์ฐ๋ น๋์ฝ๋.value_counts().sort_index()
ratio = (daily /total).sort_values(ascending=False)
print(ratio)
print('max ratio age ',ratio.index[0])
~10๋ 0.312525
20๋ 0.239913
30๋ 0.166702
40๋ 0.110585
50๋ 0.065414
70๋~ 0.048962
60๋ 0.042462
Name: ์ฐ๋ น๋์ฝ๋, dtype: float64
max ratio age ~10๋
Question 35
์ฐ๋ น๋๋ณ ํ๊ท ์ด๋๊ฑฐ๋ฆฌ๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = df[['์ฐ๋ น๋์ฝ๋','์ด๋๊ฑฐ๋ฆฌ']].groupby(['์ฐ๋ น๋์ฝ๋']).mean()
print(result)
์ด๋๊ฑฐ๋ฆฌ
์ฐ๋ น๋์ฝ๋
20๋ 3211.890552
30๋ 3341.443859
40๋ 3514.857416
50๋ 3593.668100
60๋ 3538.145737
70๋~ 3085.039641
~10๋ 2634.426279
Question 36
์ฐ๋ น๋ ์ฝ๋๊ฐ 20๋์ธ ๋ฐ์ดํฐ๋ฅผ ์ถ์ถํ๊ณ ,์ด๋๊ฑฐ๋ฆฌ๊ฐ์ด ์ถ์ถํ ๋ฐ์ดํฐ์ ์ด๋๊ฑฐ๋ฆฌ๊ฐ์ ํ๊ท ์ด์์ธ ๋ฐ์ดํฐ๋ฅผ ์ถ์ถํ๋ค.์ต์ข ์ถ์ถ๋ ๋ฐ์ดํฐ๋ฅผ ๋์ฌ์ผ์, ๋์ฌ์ ๋ฒํธ ์์๋ก ๋ด๋ฆผ์ฐจ์ ์ ๋ ฌ ํ 1ํ๋ถํฐ 200ํ๊น์ง์ ํ์๋์ ํ๊ท ์ ์์ซ์ 3์งธ ์๋ฆฌ๊น์ง ๊ตฌํ์ฌ๋ผ
Show code cell source
tw = df[df.์ฐ๋ น๋์ฝ๋ =='20๋'].reset_index(drop=True)
tw_mean = tw[tw.์ด๋๊ฑฐ๋ฆฌ >= tw.์ด๋๊ฑฐ๋ฆฌ.mean()].reset_index(drop=True)
tw_mean['ํ์๋'] =tw_mean['ํ์๋'].astype('float')
target =tw_mean.sort_values(['๋์ฌ์ผ์','๋์ฌ์๋ฒํธ'], ascending=False).reset_index(drop=True).iloc[:200].ํ์๋
result = round(target.sum()/len(target),3)
print(result)
1.613
Question 37
6์ 7์ผ ~10๋์ โ์ด์ฉ๊ฑด์โ์ ์ค์๊ฐ์?
Show code cell source
df['๋์ฌ์ผ์'] =pd.to_datetime(df['๋์ฌ์ผ์'])
result = df[(df.์ฐ๋ น๋์ฝ๋ =='~10๋') & (df.๋์ฌ์ผ์ ==pd.to_datetime('2021-06-07'))].์ด์ฉ๊ฑด์.median()
print(result)
1.0
Question 38
ํ์ผ (์~๊ธ) ์ถ๊ทผ ์๊ฐ๋(์ค์ 6,7,8์)์ ๋์ฌ์๋ณ ์ด์ฉ ํ์๋ฅผ ๊ตฌํด์ ๋ฐ์ดํฐ ํ๋ ์ ํํ๋ก ํํํ ํ ๊ฐ ๋์ฌ์๊ฐ๋ณ ์ด์ฉ ํ์์ ์์ 3๊ฐ ๋์ฌ์์ ์ด์ฉํ์๋ฅผ ์ถ๋ ฅํ๋ผ
Show code cell source
target = df[(df.day_name.isin(['Tuesday', 'Wednesday', 'Thursday', 'Friday','Monday'])) & (df.๋์ฌ์๊ฐ.isin([6,7,8]))]
result = target.groupby(['๋์ฌ์๊ฐ','๋์ฌ์๋ฒํธ']).size().to_frame('์ด์ฉ ํ์')
answer = result.sort_values(['๋์ฌ์๊ฐ','์ด์ฉ ํ์'],ascending=False).groupby('๋์ฌ์๊ฐ').head(3)
display(answer)
์ด์ฉ ํ์ | ||
---|---|---|
๋์ฌ์๊ฐ | ๋์ฌ์๋ฒํธ | |
8 | 2701 | 119 |
646 | 115 | |
1152 | 92 | |
7 | 259 | 104 |
230 | 77 | |
726 | 77 | |
6 | 2744 | 45 |
1125 | 40 | |
1028 | 36 |
Question 39
์ด๋๊ฑฐ๋ฆฌ์ ํ๊ท ์ด์์ ์ด๋๊ฑฐ๋ฆฌ ๊ฐ์ ๊ฐ์ง๋ ๋ฐ์ดํฐ๋ฅผ ์ถ์ถํ์ฌ ์ถ์ถ๋ฐ์ดํฐ์ ์ด๋๊ฑฐ๋ฆฌ์ ํ๋ณธํ์คํธ์ฐจ ๊ฐ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = df[df.์ด๋๊ฑฐ๋ฆฌ >= df.์ด๋๊ฑฐ๋ฆฌ.mean()].reset_index(drop=True).์ด๋๊ฑฐ๋ฆฌ.std()
print(result)
5092.139707505305
Question 40
๋จ์ฑ(โMโ or โmโ)๊ณผ ์ฌ์ฑ(โFโ or โfโ)์ ์ด๋๊ฑฐ๋ฆฌ๊ฐ์ ํ๊ท ๊ฐ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
df['sex'] = df['์ฑ๋ณ'].map(lambda x: '๋จ' if x in ['M','m'] else '์ฌ')
answer = df[['sex','์ด๋๊ฑฐ๋ฆฌ']].groupby('sex').mean()
display(answer)
์ด๋๊ฑฐ๋ฆฌ | |
---|---|
sex | |
๋จ | 3209.110871 |
์ฌ | 3468.575025 |
์ ์ธ๊ณ ํ๋ณต๋ ์งํ ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/unsdsn/world-happiness(์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ์ ์ธ๊ณ ํ๋ณต๋ ์งํ ์กฐ์ฌ
data url = https://raw.githubusercontent.com/Datamanim/datarepo/main/happy2/happiness.csv
import pandas as pd
df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/happy2/happiness.csv',encoding='utf-8')
df.head()
ํ๋ณต๋ญํน | ๋๋ผ๋ช | ์ ์ | ์๋GDP | ์ฌํ์ ์ง์ | ํ๋ณต๊ธฐ๋์น | ์ ํ์ ์์ ๋ | ๊ด๋ํจ | ๋ถํจ์ ๋ํ์ธ์ | ๋ ๋ | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Finland | 7.769 | 1.340 | 1.587 | 0.986 | 0.596 | 0.153 | 0.393 | 2019 |
1 | 2 | Denmark | 7.600 | 1.383 | 1.573 | 0.996 | 0.592 | 0.252 | 0.410 | 2019 |
2 | 3 | Norway | 7.554 | 1.488 | 1.582 | 1.028 | 0.603 | 0.271 | 0.341 | 2019 |
3 | 4 | Iceland | 7.494 | 1.380 | 1.624 | 1.026 | 0.591 | 0.354 | 0.118 | 2019 |
4 | 5 | Netherlands | 7.488 | 1.396 | 1.522 | 0.999 | 0.557 | 0.322 | 0.298 | 2019 |
Question 41
๋ฐ์ดํฐ๋ 2018๋ ๋์ 2019๋ ๋์ ์ ์ธ๊ณ ํ๋ณต ์ง์๋ฅผ ํํํ๋ค. ๊ฐ๋ ๋์ ํ๋ณต๋ญํน 10์๋ฅผ ์ฐจ์งํ ๋๋ผ์ ํ๋ณต์ ์์ ํ๊ท ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = df[df.ํ๋ณต๋ญํน ==10]['์ ์'].mean()
print(result)
7.259
Question 42
๋ฐ์ดํฐ๋ 2018๋ ๋์ 2019๋ ๋์ ์ ์ธ๊ณ ํ๋ณต ์ง์๋ฅผ ํํํ๋ค. ๊ฐ๋ ๋์ ํ๋ณต๋ญํน 50์์ด๋ด์ ๋๋ผ๋ค์ ๊ฐ๊ฐ์ ํ๋ณต์ ์ ํ๊ท ์ ๋ฐ์ดํฐํ๋ ์์ผ๋ก ํ์ํ๋ผ
Show code cell source
result = df[df.ํ๋ณต๋ญํน<=50][['๋
๋','์ ์']].groupby('๋
๋').mean()
print(result)
์ ์
๋
๋
2018 6.64678
2019 6.67002
Question 43
2018๋ ๋ ๋ฐ์ดํฐ๋ค๋ง ์ถ์ถํ์ฌ ํ๋ณต์ ์์ ๋ถํจ์ ๋ํ ์ธ์์ ๋ํ ์๊ด๊ณ์๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = df[df.๋
๋ ==2018][['์ ์','๋ถํจ์ ๋ํ์ธ์']].corr().iloc[0,1]
print(result)
0.40529152271510027
Question 44
2018๋ ๋์ 2019๋ ๋์ ํ๋ณต๋ญํน์ด ๋ณํํ์ง ์์ ๋๋ผ๋ช ์ ์๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = len(df[['ํ๋ณต๋ญํน','๋๋ผ๋ช
']]) - len(df[['ํ๋ณต๋ญํน','๋๋ผ๋ช
']].drop_duplicates())
print(result)
15
Question 45
2019๋ ๋ ๋ฐ์ดํฐ๋ค๋ง ์ถ์ถํ์ฌ ๊ฐ๋ณ์๊ฐ ์๊ด๊ณ์๋ฅผ ๊ตฌํ๊ณ ๋ด๋ฆผ์ฐจ์์ผ๋ก ์ ๋ ฌํ ํ ์์ 5๊ฐ๋ฅผ ๋ฐ์ดํฐ ํ๋ ์์ผ๋ก ์ถ๋ ฅํ๋ผ. ์ปฌ๋ผ๋ช ์ v1,v2,corr์ผ๋ก ํ์ํ๋ผ
Show code cell source
zz = df[df.๋
๋ ==2019].corr().unstack().to_frame().reset_index().dropna()
result = zz[zz[0] !=1].sort_values(0,ascending=False).drop_duplicates(0)
answer = result.head(5).reset_index(drop=True)
answer.columns = ['v1','v2','corr']
display(answer)
v1 | v2 | corr | |
---|---|---|---|
0 | ํ๋ณต๊ธฐ๋์น | ์๋GDP | 0.835462 |
1 | ์๋GDP | ์ ์ | 0.793883 |
2 | ํ๋ณต๊ธฐ๋์น | ์ ์ | 0.779883 |
3 | ์ฌํ์ ์ง์ | ์ ์ | 0.777058 |
4 | ์ฌํ์ ์ง์ | ์๋GDP | 0.754906 |
Question 46
๊ฐ ๋ ๋๋ณ ํ์ ํ๋ณต์ ์์ ํ์ 5๊ฐ ๊ตญ๊ฐ์ ํ๊ท ํ๋ณต์ ์๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = df.groupby('๋
๋').tail(5).groupby('๋
๋').mean()[['์ ์']]
print(result)
์ ์
๋
๋
2018 3.1800
2019 3.1408
Question 47
2019๋ ๋ฐ์ดํฐ๋ฅผ ์ถ์ถํ๊ณ ํด๋น๋ฐ์ดํฐ์ ์๋ GDP ํ๊ท ์ด์์ ๋๋ผ๋ค๊ณผ ํ๊ท ์ดํ์ ๋๋ผ๋ค์ ํ๋ณต์ ์ ํ๊ท ์ ๊ฐ๊ฐ ๊ตฌํ๊ณ ๊ทธ ์ฐจ์ด๊ฐ์ ์ถ๋ ฅํ๋ผ
Show code cell source
over = df[df.์๋GDP >= df.์๋GDP.mean()]['์ ์'].mean()
under = df[df.์๋GDP <= df.์๋GDP.mean()]['์ ์'].mean()
result= over - under
print(result)
1.4212148925537225
Question 48
๊ฐ๋ ๋์ ๋ถํจ์ ๋ํ์ธ์์ ๋ด๋ฆผ์ฐจ์ ์ ๋ ฌํ์๋ ์์ 20๊ฐ ๊ตญ๊ฐ์ ๋ถํจ์ ๋ํ์ธ์์ ํ๊ท ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = df.sort_values(['๋
๋','๋ถํจ์ ๋ํ์ธ์'],ascending=False).groupby('๋
๋').head(20).groupby(['๋
๋']).mean()[['๋ถํจ์ ๋ํ์ธ์']]
print(result)
๋ถํจ์ ๋ํ์ธ์
๋
๋
2018 0.3267
2019 0.3201
Question 49
2018๋ ๋ ํ๋ณต๋ญํน 50์ ์ด๋ด์ ํฌํจ๋๋ค๊ฐ 2019๋ 50์ ๋ฐ์ผ๋ก ๋ฐ๋ ค๋ ๊ตญ๊ฐ์ ์ซ์๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = set(df[(df.๋
๋==2018) & (df.ํ๋ณต๋ญํน <=50)].๋๋ผ๋ช
) -set(df[(df.๋
๋==2019) & (df.ํ๋ณต๋ญํน <=50)].๋๋ผ๋ช
)
answer = len(result)
print(answer)
4
Question 50
2018๋ ,2019๋ ๋ชจ๋ ๊ธฐ๋ก์ด ์๋ ๋๋ผ๋ค ์ค ๋ ๋๋ณ ํ๋ณต์ ์๊ฐ ๊ฐ์ฅ ์ฆ๊ฐํ ๋๋ผ์ ๊ทธ ์ฆ๊ฐ ์์น๋?
Show code cell source
count = df.๋๋ผ๋ช
.value_counts()
target = count[count>=2].index
df2 =df.copy()
multiple = df2[df2.๋๋ผ๋ช
.isin(target)].reset_index(drop=True)
multiple.loc[multiple['๋
๋']==2018,'์ ์'] = multiple[multiple.๋
๋ ==2018]['์ ์'].values * (-1)
result = multiple.groupby('๋๋ผ๋ช
').sum()['์ ์'].sort_values().to_frame().iloc[-1]
result
์ ์ 0.87
Name: Burundi, dtype: float64
์ง์ญ๊ตฌ ์๋์ง ์๋น๋ ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ : https://archive.ics.uci.edu/ml/datasets/Power+consumption+of+Tetouan+city (์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ๊ธฐ์จ, ์ต๋,๋ฐ๋ํ์์ ๋ฐ๋ฅธ ๋์์ 3๊ฐ ์ง์ญ๊ตฌ์ ์๋์ง ์๋น๋
data url = https://raw.githubusercontent.com/Datamanim/datarepo/main/consum/Tetuan City power consumption.csv
import pandas as pd
df= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/consum/Tetuan%20City%20power%20consumption.csv')
df.head()
DateTime | Temperature | Humidity | Wind Speed | general diffuse flows | diffuse flows | Zone 1 Power Consumption | Zone 2 Power Consumption | Zone 3 Power Consumption | |
---|---|---|---|---|---|---|---|---|---|
0 | 1/1/2017 0:00 | 6.559 | 73.8 | 0.083 | 0.051 | 0.119 | 34055.69620 | 16128.87538 | 20240.96386 |
1 | 1/1/2017 0:10 | 6.414 | 74.5 | 0.083 | 0.070 | 0.085 | 29814.68354 | 19375.07599 | 20131.08434 |
2 | 1/1/2017 0:20 | 6.313 | 74.5 | 0.080 | 0.062 | 0.100 | 29128.10127 | 19006.68693 | 19668.43373 |
3 | 1/1/2017 0:30 | 6.121 | 75.0 | 0.083 | 0.091 | 0.096 | 28228.86076 | 18361.09422 | 18899.27711 |
4 | 1/1/2017 0:40 | 5.921 | 75.7 | 0.081 | 0.048 | 0.085 | 27335.69620 | 17872.34043 | 18442.40964 |
Question 51
DateTime์ปฌ๋ผ์ ํตํด ๊ฐ ์๋ณ๋ก ๋ช๊ฐ์ ๋ฐ์ดํฐ๊ฐ ์๋์ง ๋ฐ์ดํฐ ํ๋ ์์ผ๋ก ๊ตฌํ์ฌ๋ผ
Show code cell source
df['DateTime'] = pd.to_datetime(df['DateTime'])
result = df['DateTime'].dt.month.value_counts().sort_index().to_frame()
print(result)
DateTime
1 4464
2 4032
3 4464
4 4320
5 4464
6 4320
7 4464
8 4464
9 4320
10 4464
11 4320
12 4320
Question 52
3์๋ฌ์ ๊ฐ ์๊ฐ๋๋ณ ์จ๋์ ํ๊ท ๋ค ์ค ๊ฐ์ฅ ๋ฎ์ ์๊ฐ๋์ ์จ๋๋ฅผ ์ถ๋ ฅํ๋ผ
Show code cell source
target = df[df.DateTime.dt.month ==3]
result = target.groupby(target.DateTime.dt.hour)['Temperature'].mean().sort_values().values[0]
print(result)
11.506612903225806
Question 53
3์๋ฌ์ ๊ฐ ์๊ฐ๋๋ณ ์จ๋์ ํ๊ท ๋ค ์ค ๊ฐ์ฅ ๋์ ์๊ฐ๋์ ์จ๋๋ฅผ ์ถ๋ ฅํ๋ผ
Show code cell source
target = df[df.DateTime.dt.month ==3]
result = target.groupby(target.DateTime.dt.hour)['Temperature'].mean().sort_values().values[-1]
print(result)
18.393602150537635
Question 54
Zone 1 Power Consumption ์ปฌ๋ผ์ value๊ฐ์ ํฌ๊ธฐ๊ฐ Zone 2 Power Consumption ์ปฌ๋ผ์ value๊ฐ์ ํฌ๊ธฐ๋ณด๋ค ํฐ ๋ฐ์ดํฐ๋ค์ Humidity์ ํ๊ท ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = df[df['Zone 1 Power Consumption'] > df['Zone 2 Power Consumption']].Humidity.mean()
print(result)
68.23624448055094
Question 55
๊ฐ zone์ ์๋์ง ์๋น๋์ ์๊ด๊ด๊ณ๋ฅผ ๊ตฌํด์ ๋ฐ์ดํฐ ํ๋ ์์ผ๋ก ํ๊ธฐํ๋ผ
Show code cell source
result = df.iloc[:,-3:].corr()
display(result)
Zone 1 Power Consumption | Zone 2 Power Consumption | Zone 3 Power Consumption | |
---|---|---|---|
Zone 1 Power Consumption | 1.000000 | 0.834519 | 0.750733 |
Zone 2 Power Consumption | 0.834519 | 1.000000 | 0.570932 |
Zone 3 Power Consumption | 0.750733 | 0.570932 | 1.000000 |
Question 56
Temperature์ ๊ฐ์ด 10๋ฏธ๋ง์ ๊ฒฝ์ฐ A, 10์ด์ 20๋ฏธ๋ง์ ๊ฒฝ์ฐ B,20์ด์ 30๋ฏธ๋ง์ ๊ฒฝ์ฐ C, ๊ทธ ์ธ์ ๊ฒฝ์ฐ D๋ผ๊ณ ํ ๋ ๊ฐ ๋จ๊ณ์ ๋ฐ์ดํฐ ์ซ์๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
def split_data(x):
if x<10:
return "A"
elif x<20:
return 'B'
elif x<30:
return 'C'
else:
return 'D'
df['sp'] = df.Temperature.map(split_data)
result = df['sp'].value_counts()
display(result)
B 26993
C 21105
A 2874
D 1444
Name: sp, dtype: int64
Question 57
6์ ๋ฐ์ดํฐ์ค 12์์ Temperature์ ํ์คํธ์ฐจ๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
result =df[(df.DateTime.dt.month ==6) & (df.DateTime.dt.hour ==12)].Temperature.std()
print(result)
2.049941782795103
Question 58
6์ ๋ฐ์ดํฐ์ค 12์์ Temperature์ ๋ถ์ฐ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
result =df[(df.DateTime.dt.month ==6) & (df.DateTime.dt.hour ==12)].Temperature.var()
print(result)
4.202261312849164
Question 59
Temperature์ ํ๊ท ์ด์์ Temperature์ ๊ฐ์ ๊ฐ์ง๋ ๋ฐ์ดํฐ๋ฅผ Temperature๋ฅผ ๊ธฐ์ค์ผ๋ก ์ ๋ ฌ ํ์๋ 4๋ฒ์งธ ํ์ Humidity ๊ฐ์?
Show code cell source
result = df[df.Temperature >= df.Temperature.mean()].sort_values('Temperature').Humidity.values[3]
print(result)
87.9
Question 60
**Temperature์ ์ค๊ฐ๊ฐ ์ด์์ Temperature์ ๊ฐ์ ๊ฐ์ง๋ ๋ฐ์ดํฐ๋ฅผTemperature๋ฅผ ๊ธฐ์ค์ผ๋ก ์ ๋ ฌ ํ์๋ 4๋ฒ์งธ ํ์ Humidity ๊ฐ์?
**
Show code cell source
result = df[df.Temperature >= df.Temperature.median()].sort_values('Temperature').Humidity.values[3]
print(result)
80.3
ํฌ์ผ๋ชฌ ์ ๋ณด ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ : https://www.kaggle.com/abcsds/pokemon (์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ํฌ์ผ๋ชฌ ์ ๋ณด
data url = https://raw.githubusercontent.com/Datamanim/datarepo/main/pok/Pokemon.csv
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/pok/Pokemon.csv')
df.head()
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
Question 61
Legendary ์ปฌ๋ผ์ ์ ์คํฌ์ผ๋ชฌ ์ ๋ฌด๋ฅผ ๋ํ๋ธ๋ค.์ ์คํฌ์ผ๋ชฌ๊ณผ ๊ทธ๋ ์ง ์์ ํฌ์ผ๋ชฌ๋ค์ HPํ๊ท ์ ์ฐจ์ด๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
target = df.groupby('Legendary').mean()['HP']
result = target.values[1] -target.values[0]
print(result)
25.55614861329147
Question 62
Type 1์ ์ฃผ์์ฑ Type 2 ๋ ๋ถ์์ฑ์ ๋ํ๋ธ๋ค. ๊ฐ์ฅ ๋ง์ ๋ถ์์ฑ ์ข ๋ฅ๋ ๋ฌด์์ธ๊ฐ?
Show code cell source
result = df['Type 2'].value_counts().index[0]
print(result)
Flying
Question 63
๊ฐ์ฅ ๋ง์ Type 1 ์ ์ข ์ ํ๊ท Attack ์ ํ๊ท Defense๋ก ๋๋๊ฐ์?
Show code cell source
Max = df['Type 1'].value_counts().index[0]
result = df[df['Type 1']== Max].Attack.mean() /df[df['Type 1']== Max].Defense.mean()
print(result)
1.0165238678090576
Question 64
ํฌ์ผ๋ชฌ ์ธ๋(Generation) ์ค ๊ฐ์ฅ๋ง์ Legendary๋ฅผ ๋ณด์ ํ ์ธ๋๋ ๋ช์ธ๋์ธ๊ฐ?
Show code cell source
result =df[df.Legendary==True].Generation.value_counts().index[0]
result
3
Question 65
โHPโ, โAttackโ, โDefenseโ, โSp. Atkโ, โSp. Defโ, โSpeedโ ๊ฐ์ ์๊ด ๊ณ์์ค ๊ฐ์ฅ ์ ๋๊ฐ์ด ํฐ ๋ ๋ณ์์ ๊ทธ ๊ฐ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
target = df[[ 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']].corr().unstack().reset_index().rename(columns={0: "corr"})
result = target[target['corr']!=1].sort_values('corr',ascending=False).iloc[0]
print(result)
level_0 Sp. Def
level_1 Defense
corr 0.510747
Name: 26, dtype: object
Question 66
๊ฐ Generation์ Attack์ผ๋ก ์ค๋ฆ์ฐจ์ ์ ๋ ฌ์ ์์ 3๊ฐ ๋ฐ์ดํฐ๋ค(18๊ฐ)์ Attack์ ์ ์ฒด ํ๊ท ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = df.sort_values(['Generation','Attack']).groupby('Generation').head(3).Attack.mean()
print(result)
19.5
Question 67
๊ฐ Generation์ Attack์ผ๋ก ๋ด๋ฆผ์ฐจ์ ์ ๋ ฌ์ ์์ 5๊ฐ ๋ฐ์ดํฐ๋ค(30๊ฐ)์ Attack์ ์ ์ฒด ํ๊ท ์ ๊ตฌํ์ฌ๋ผ
result = df.sort_values(['Generation','Attack'],ascending=False).groupby('Generation').head(5).Attack.mean()
print(result)
157.23333333333332
Question 68
๊ฐ์ฅ ํํ๊ฒ ๋ฐ๊ฒฌ๋๋ (Type1 , Type2) ์ ์์ ๋ฌด์์ธ๊ฐ?
Show code cell source
result = df[['Type 1','Type 2']].value_counts().head(1)
print(result)
Type 1 Type 2
Normal Flying 24
dtype: int64
Question 69
ํ๋ฒ์ฉ๋ง ์กด์ฌํ๋ (Type1 , Type2)์ ์์ ๊ฐฏ์๋ ๋ช๊ฐ์ธ๊ฐ?
Show code cell source
target = df[['Type 1','Type 2']].value_counts()
result = len(target[target==1])
print(result)
39
Question 70
ํ๋ฒ์ฉ๋ง ์กด์ฌํ๋ (Type1 , Type2)์ ์์ ๊ฐ ์ธ๋(Generation)์ ๊ฐ๊ฐ ๋ช๊ฐ์ฉ ๊ฐ์ง๊ณ ์๋๊ฐ?
Show code cell source
target = df[['Type 1','Type 2']].value_counts()
target2 =target[target==1]
lst = []
for value in target2.reset_index().values:
t1 = value[0]
t2 = value[1]
sp = df[(df['Type 1']==t1) & (df['Type 2']==t2)]
lst.append(sp)
result = pd.concat(lst).reset_index(drop=True).Generation.value_counts().sort_index()
print(result)
1 1
2 4
3 5
4 13
5 7
6 9
Name: Generation, dtype: int64
๋ํ๋ฏผ๊ตญ ์ฒด๋ ฅ์ฅ ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ : ๊ตญ๋ฏผ์ฒด์ก์งํฅ๊ณต๋จ (๋ฌธํ ๋น
๋ฐ์ดํฐํ๋ซํผ) (์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ๋ํ๋ฏผ๊ตญ ๊ตญ๋ฏผ ์ฒด๋ ฅ์ฅ ํ๊ฐ
data url = https://raw.githubusercontent.com/Datamanim/datarepo/main/body/body.csv
import pandas as pd
df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/body/body.csv')
df.head()
์ธก์ ๋์ด | ์ธก์ ํ์์ฑ๋ณ | ์ ์ฅ : cm | ์ฒด์ค : kg | ์ฒด์ง๋ฐฉ์จ : % | ์ด์๊ธฐํ์(์ต์ ) : mmHg | ์์ถ๊ธฐํ์(์ต๊ณ ) : mmHg | ์ ๋ ฅD : kg | ์์์๋ชธ์์ผ๋ก๊ตฝํ๊ธฐ : cm | ๊ต์ฐจ์๋ชธ์ผ์ผํค๊ธฐ : ํ | ์ ์๋ฆฌ ๋ฉ๋ฆฌ๋ฐ๊ธฐ : cm | ๋ฑ๊ธ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 59.0 | M | 175.0 | 70.6 | 19.2 | 91.0 | 150.0 | 40.6 | 12.2 | 30.0 | 179.0 | C |
1 | 40.0 | F | 161.4 | 45.8 | 21.7 | 62.0 | 119.0 | 22.1 | 16.0 | 32.0 | 165.0 | B |
2 | 27.0 | M | 176.6 | 84.8 | 19.2 | 87.0 | 153.0 | 45.9 | 13.2 | 61.0 | 216.0 | B |
3 | 38.0 | M | 167.6 | 76.7 | 24.7 | 63.0 | 132.0 | 43.6 | 16.0 | 45.0 | 231.0 | A |
4 | 21.0 | M | 165.2 | 66.2 | 21.5 | 83.0 | 106.0 | 33.5 | 10.6 | 46.0 | 198.0 | C |
Question 71
์ ์ฒด๋ฐ์ดํฐ์ ์์ถ๊ธฐํ์(์ต๊ณ ) - ์ด์๊ธฐํ์(์ต์ )์ ํ๊ท ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = (df['์์ถ๊ธฐํ์(์ต๊ณ ) : mmHg'] - df['์ด์๊ธฐํ์(์ต์ ) : mmHg']).mean()
print(result)
52.19539414750672
Question 72
50~59์ธ์ ์ ์ฅํ๊ท ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = df[(df.์ธก์ ๋์ด <60) & (df.์ธก์ ๋์ด>=50)].iloc[:,2].mean()
print(result)
164.07490107405295
Question 73
์ฐ๋ น๋ (20~29 : 20๋ โฆ) ๋ณ ์ธ์์๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
df['์ฐ๋ น๋'] =df.์ธก์ ๋์ด //10 *10
result = df['์ฐ๋ น๋'].value_counts()
print(result)
20.0 5831
30.0 2660
40.0 1801
50.0 1769
60.0 1335
Name: ์ฐ๋ น๋, dtype: int64
Question 74
์ฐ๋ น๋ (20~29 : 20๋ โฆ) ๋ณ ๋ฑ๊ธ์ ์ซ์๋ฅผ ๋ฐ์ดํฐ ํ๋ ์์ผ๋ก ํํํ๋ผ
Show code cell source
result = df.groupby(['์ฐ๋ น๋','๋ฑ๊ธ'],as_index=False).size()
print(result)
์ฐ๋ น๋ ๋ฑ๊ธ size
0 20.0 A 1585
1 20.0 B 1443
2 20.0 C 1455
3 20.0 D 1348
4 30.0 A 743
5 30.0 B 697
6 30.0 C 626
7 30.0 D 594
8 40.0 A 386
9 40.0 B 428
10 40.0 C 455
11 40.0 D 532
12 50.0 A 321
13 50.0 B 410
14 50.0 C 474
15 50.0 D 564
16 60.0 A 314
17 60.0 B 371
18 60.0 C 339
19 60.0 D 311
Question 75
๋จ์ฑ ์ค A๋ฑ๊ธ๊ณผ D๋ฑ๊ธ์ ์ฒด์ง๋ฐฉ๋ฅ ํ๊ท ์ ์ฐจ์ด(ํฐ ๊ฐ์์ ์์ ๊ฐ์ ์ฐจ)๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = abs(df[(df.์ธก์ ํ์์ฑ๋ณ =='M') &(df.๋ฑ๊ธ =='A')].iloc[:,4].mean() -df[(df.์ธก์ ํ์์ฑ๋ณ =='M') &(df.๋ฑ๊ธ =='D')].iloc[:,4].mean())
print(result)
7.932086486137457
Question 76
์ฌ์ฑ ์ค A๋ฑ๊ธ๊ณผ D๋ฑ๊ธ์ ์ฒด์ค์ ํ๊ท ์ ์ฐจ์ด(ํฐ ๊ฐ์์ ์์ ๊ฐ์ ์ฐจ)๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = abs(df[(df.์ธก์ ํ์์ฑ๋ณ =='F') &(df.๋ฑ๊ธ =='A')].iloc[:,3].mean() -df[(df.์ธก์ ํ์์ฑ๋ณ =='F') &(df.๋ฑ๊ธ =='D')].iloc[:,3].mean())
print(result)
5.176211590296511
Question 77
bmi๋ ์์ ์ ๋ชธ๋ฌด๊ฒ(kg)๋ฅผ ํค์ ์ ๊ณฑ(m)์ผ๋ก ๋๋๊ฐ์ด๋ค. ๋ฐ์ดํฐ์ bmi ๋ฅผ ๊ตฌํ ์๋ก์ด ์ปฌ๋ผ์ ๋ง๋ค๊ณ ๋จ์ฑ์ bmi ํ๊ท ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
df['bmi'] = df['์ฒด์ค : kg'] / (df['์ ์ฅ : cm']/100) **2
result = df[df.์ธก์ ํ์์ฑ๋ณ =='M'].bmi.mean()
print(result)
24.461344098193027
Question 78
bmi๋ณด๋ค ์ฒด์ง๋ฐฉ์จ์ด ๋์ ์ฌ๋๋ค์ ์ฒด์คํ๊ท ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
result =df[df.bmi <df['์ฒด์ง๋ฐฉ์จ : %']]['์ฒด์ค : kg'].mean()
print(result)
61.740880639254314
Question 79
๋จ์ฑ๊ณผ ์ฌ์ฑ์ ์ ๋ ฅ ํ๊ท ์ ์ฐจ์ด๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
target= df.groupby('์ธก์ ํ์์ฑ๋ณ')['์
๋ ฅD : kg'].mean()
result = target.M - target.F
print(result)
17.55954185047465
Question 80
๋จ์ฑ๊ณผ ์ฌ์ฑ์ ๊ต์ฐจ์๋ชธ์ผ์ผํค๊ธฐ ํ์์ ํ๊ท ์ ์ฐจ์ด๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
target= df.groupby('์ธก์ ํ์์ฑ๋ณ')['๊ต์ฐจ์๋ชธ์ผ์ผํค๊ธฐ : ํ'].mean()
result = target.M - target.F
print(result)
14.243156833157627
๊ธฐ์จ ๊ฐ์๋ ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ : https://data.kma.go.kr/cmmn/static/staticPage.do?page=intro
๋ฐ์ดํฐ ์ค๋ช
: 2020๋
๋ ์ดํ๋(์์ธ) , ์์๋(๋ถ์ฐ)์ ์๊ฐ๋จ์์ ๊ธฐ์จ๊ณผ ๊ฐ์๋
๋ฐ์ดํฐ url : https://raw.githubusercontent.com/Datamanim/datarepo/main/weather/weather2.csv
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/weather/weather2.csv")
df.head()
time | ์ดํ๋๊ฐ์ | ์ดํ๋๊ธฐ์จ | ์์๋๊ฐ์ | ์์๋๊ธฐ์จ | |
---|---|---|---|---|---|
0 | 2020-01-01 00:00:00 | 0.1 | -3.9 | 0.0 | 2.5 |
1 | 2020-01-01 01:00:00 | 0.0 | -3.1 | 0.0 | 3.4 |
2 | 2020-01-01 02:00:00 | 0.0 | -1.8 | 0.0 | 3.8 |
3 | 2020-01-01 03:00:00 | 0.0 | -0.8 | 0.0 | 4.6 |
4 | 2020-01-01 04:00:00 | 0.0 | -0.1 | 0.0 | 5.1 |
Question 81
์ฌ๋ฆ์ฒ (6์,7์,8์) ์ดํ๋์ด ์์๋๋ณด๋ค ๋์ ๊ธฐ์จ์ ๊ฐ์ง ์๊ฐ๋๋ ๋ช๊ฐ์ธ๊ฐ?
Show code cell source
df.time = pd.to_datetime(df.time)
summer = df.loc[df.time.dt.month.isin([6,7,8])].reset_index(drop=True)
answer = len(summer.loc[summer['์ดํ๋๊ธฐ์จ'] > summer['์์๋๊ธฐ์จ']])
print(answer)
1415
Question 82
์ดํ๋๊ณผ ์์๋์ ์ต๋๊ฐ์๋์ ์๊ฐ๋๋ฅผ ๊ฐ๊ฐ ๊ตฌํ์ฌ๋ผ
Show code cell source
answer1 = df.loc[df['์ดํ๋๊ฐ์']==df['์ดํ๋๊ฐ์'].max()].time.values
answer2 = df.loc[df['์์๋๊ฐ์']==df['์์๋๊ฐ์'].max()].time.values
print(answer1,answer2)
['2020-09-30T09:00:00.000000000'] ['2020-07-23T12:00:00.000000000']
์๋น์ค ์ดํ์์ธก ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ : https://www.kaggle.com/shubh0799/churn-modelling ์์ ๋ณํ
๋ฐ์ดํฐ ์ค๋ช
: ๊ณ ๊ฐ์ ์ ์์ ๋ณด ๋ฐ์ดํฐ๋ฅผ ํตํ ํ์ฌ ์๋น์ค ์ดํ ์์ธก (์ข
์๋ณ์ : Exited)
Data url : https://raw.githubusercontent.com/Datamanim/datarepo/main/churn/train.csv
import pandas as pd
#๋ฐ์ดํฐ ๋ก๋
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/churn/train.csv")
df.head()
RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6842 | 15793491 | Cherkasova | 714 | Germany | Male | 26 | 3 | 119545.48 | 2 | 1 | 0 | 65482.94 | 0 |
1 | 8963 | 15607874 | Keane | 687 | France | Male | 38 | 0 | 144450.58 | 1 | 0 | 1 | 137276.83 | 0 |
2 | 7047 | 15737627 | Rivero | 589 | Germany | Female | 20 | 2 | 121093.29 | 2 | 1 | 0 | 3529.72 | 0 |
3 | 7503 | 15697844 | Whitehouse | 721 | Spain | Female | 32 | 10 | 0.00 | 1 | 1 | 0 | 136119.96 | 1 |
4 | 3439 | 15722404 | Carpenter | 445 | France | Female | 30 | 3 | 0.00 | 2 | 1 | 1 | 127939.19 | 0 |
Question 83
๋จ์ฑ ์ดํ(Exited)์ด ๊ฐ์ฅ ๋ง์ ๊ตญ๊ฐ(Geography)๋ ์ด๋์ด๊ณ ์ดํ ์ธ์์ ๋ช๋ช ์ธ๊ฐ?
Show code cell source
answer = df.loc[df.Gender=='Male'].groupby(['Geography']).sum().Exited.sort_values(ascending=False).head(1)
print(answer)
Geography
Germany 287
Name: Exited, dtype: int64
Question 84
**์นด๋๋ฅผ ์์ (HasCrCard ==1)ํ๊ณ ์์ผ๋ฉด์ ํ์ฑ๋ฉค๋ฒ(IsActiveMember ==1) ์ธ ๊ณ ๊ฐ๋ค์ ํ๊ท ๋์ด๋ฅผ ์์ซ์ ์ดํ 4์๋ฆฌ๊น์ง ๊ตฌํ์ฌ๋ผ? **
Show code cell source
answer = df.loc[(df.HasCrCard==1) &(df.IsActiveMember==1)].Age.mean()
print(round(answer,4))
39.6102
Question 85
Balance ๊ฐ์ด ์ค๊ฐ๊ฐ ์ด์์ ๊ฐ์ง๋ ๊ณ ๊ฐ๋ค์ CreditScore์ ํ์คํธ์ฐจ๋ฅผ ์์ซ์ ์ดํ 3์๋ฆฌ๊น์ง ๊ตฌํ์ฌ๋ผ
Show code cell source
answer =df.loc[df.Balance >= df.Balance.median()].CreditScore.std()
print(round(answer,3))
97.295
์ฑ์ธ ๊ฑด๊ฐ๊ฒ์ง ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.data.go.kr/data/15007122/fileData.do (์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: 2018๋
๋ ์ฑ์ธ์ ๊ฑด๊ฐ๊ฒ ์ง๋ฐ์ดํฐ (ํก์ฐ์ํ 1- ํก์ฐ, 0-๋นํก์ฐ )
๋ฐ์ดํฐ url : https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/train.csv
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/train.csv")
df.head()
์ฑ๋ณ์ฝ๋ | ์ฐ๋ น๋์ฝ๋(5์ธ๋จ์) | ์ ์ฅ(5Cm๋จ์) | ์ฒด์ค(5Kg๋จ์) | ํ๋ฆฌ๋๋ | ์๋ ฅ(์ข) | ์๋ ฅ(์ฐ) | ์ฒญ๋ ฅ(์ข) | ์ฒญ๋ ฅ(์ฐ) | ์์ถ๊ธฐํ์ | ... | ํ์์ | ์๋จ๋ฐฑ | ํ์ฒญํฌ๋ ์ํฐ๋ | (ํ์ฒญ์ง์คํฐ)AST | (ํ์ฒญ์ง์คํฐ)ALT | ๊ฐ๋ง์งํฐํผ | ํก์ฐ์ํ | ๊ตฌ๊ฐ๊ฒ์ง์๊ฒ์ฌ๋ถ | ์น์์ฐ์์ฆ์ ๋ฌด | ์น์ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | F | 55 | 145 | 55 | 73.0 | 0.7 | 0.5 | 1.0 | 1.0 | 129.0 | ... | 12.9 | 1.0 | 0.9 | 172.0 | 209.0 | 15.0 | 0 | Y | 0.0 | Y |
1 | M | 40 | 180 | 55 | 74.0 | 1.5 | 1.5 | 1.0 | 1.0 | 102.0 | ... | 15.6 | 1.0 | 0.8 | 30.0 | 19.0 | 23.0 | 1 | Y | 0.0 | Y |
2 | F | 55 | 150 | 50 | 72.0 | 1.0 | 0.2 | 1.0 | 1.0 | 116.0 | ... | 14.1 | 1.0 | 1.0 | 31.0 | 19.0 | 10.0 | 0 | Y | 0.0 | N |
3 | M | 40 | 170 | 85 | 89.0 | 1.0 | 0.8 | 1.0 | 1.0 | 124.0 | ... | 16.7 | 1.0 | 1.0 | 26.0 | 38.0 | 108.0 | 1 | Y | 1.0 | Y |
4 | F | 40 | 155 | 45 | 62.0 | 0.5 | 1.0 | 1.0 | 1.0 | 127.0 | ... | 12.9 | 1.0 | 0.7 | 20.0 | 11.0 | 13.0 | 0 | Y | 0.0 | N |
5 rows ร 26 columns
Question 86
์์ถ๊ธฐํ์๊ณผ ์ด์๊ธฐ ํ์๊ธฐ ์์น์ ์ฐจ์ด๋ฅผ ์๋ก์ด ์ปฌ๋ผ(โํ์์ฐจโ) ์ผ๋ก ์์ฑํ๊ณ , ์ฐ๋ น๋ ์ฝ๋๋ณ ๊ฐ ๊ทธ๋ฃน ์ค โํ์์ฐจโ ์ ๋ถ์ฐ์ด 5๋ฒ์งธ๋ก ํฐ ์ฐ๋ น๋ ์ฝ๋๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
df['ํ์์ฐจ'] =df['์์ถ๊ธฐํ์'] -df['์ด์๊ธฐํ์']
answer = df.groupby('์ฐ๋ น๋์ฝ๋(5์ธ๋จ์)')['ํ์์ฐจ'].var().sort_values()
print(answer.index[-5])
60
Question 87
๋น๋ง๋๋ฅผ ๋ํ๋ด๋ ์งํ์ธ WHtR๋ ํ๋ฆฌ๋๋ / ํค๋ก ํํํ๋ค. ์ผ๋ฐ์ ์ผ๋ก 0.58์ด์์ด๋ฉด ๋น๋ง์ผ๋ก ๋ถ๋ฅํ๋ค. ๋ฐ์ดํฐ์ค WHtR ์งํ์ ๋น๋ง์ธ ์ธ์์ ๋จ/์ฌ ๋น์จ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
df['๋น๋ง']=df['ํ๋ฆฌ๋๋ ']/df['์ ์ฅ(5Cm๋จ์)']
data = df.loc[df['๋น๋ง']>=0.58].์ฑ๋ณ์ฝ๋.value_counts()
answer = data['M']/data['F']
print(answer)
1.1693877551020408
์๋์ฐจ ๋ณดํ๊ฐ์ ์์ธก๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction(์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ์๋์ฐจ ๋ณดํ ๊ฐ์
์์ธก
Data url : https://raw.githubusercontent.com/Datamanim/datarepo/main/insurance/train.csv
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/insurance/train.csv")
df.head()
id | Gender | Age | Driving_License | Region_Code | Previously_Insured | Vehicle_Age | Vehicle_Damage | Annual_Premium | Policy_Sales_Channel | Vintage | Response | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 188957 | Female | 25 | 1 | 17.0 | 1 | < 1 Year | No | 38785.0 | 152.0 | 59 | 0 |
1 | 275631 | Male | 20 | 1 | 22.0 | 0 | < 1 Year | No | 2630.0 | 160.0 | 113 | 0 |
2 | 329036 | Female | 40 | 1 | 41.0 | 0 | 1-2 Year | Yes | 33165.0 | 124.0 | 188 | 0 |
3 | 227288 | Female | 28 | 1 | 30.0 | 1 | < 1 Year | No | 29038.0 | 152.0 | 233 | 0 |
4 | 300441 | Female | 50 | 1 | 28.0 | 0 | 1-2 Year | Yes | 31325.0 | 124.0 | 258 | 0 |
Question 88
Vehicle_Age ๊ฐ์ด 2๋
์ด์์ธ ์ฌ๋๋ค๋ง ํํฐ๋ง ํ๊ณ ๊ทธ์ค์์
Annual_Premium ๊ฐ์ด ์ ์ฒด ๋ฐ์ดํฐ์ ์ค๊ฐ๊ฐ ์ด์์ธ ์ฌ๋๋ค์ ์ฐพ๊ณ , ๊ทธ๋ค์ Vintage๊ฐ์ ํ๊ท ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
answer = df[(df['Vehicle_Age']=='> 2 Years') & (df['Annual_Premium'] >= df['Annual_Premium'].median())]['Vintage'].mean()
print(answer)
154.43647182359118
Question 89
vehicle_age์ ๋ฐ๋ฅธ ๊ฐ ์ฑ๋ณ(gender)๊ทธ๋ฃน์ Annual_Premium๊ฐ์ ํ๊ท ์ ๊ตฌํ์ฌ ์๋ ํ ์ด๋ธ๊ณผ ๋์ผํ๊ฒ ๊ตฌํํ๋ผ
Show code cell source
meandf = df.groupby(['Gender','Vehicle_Age'],as_index = False)['Annual_Premium'].mean()
meandf.pivot(index='Vehicle_Age',columns='Gender',values='Annual_Premium')
Gender | Female | Male |
---|---|---|
Vehicle_Age | ||
1-2 Year | 30762.245001 | 30413.088469 |
< 1 Year | 29972.286702 | 30310.982212 |
> 2 Years | 36108.366374 | 35303.870627 |
ํธ๋ํฐ ๊ฐ๊ฒฉ ์์ธก๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/iabhishekofficial/mobile-price-classification?select=train.csv(์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ํธ๋ํฐ ๊ฐ๊ฒฉ์์ธก (price_range์ปฌ๋ผ 0(์ ๋ ด) ~3(๋งค์ฐ๋น์) ๋ฒ์ )
Data url https://raw.githubusercontent.com/Datamanim/datarepo/main/mobile/train.csv
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/mobile/train.csv")
df.head()
battery_power | blue | clock_speed | dual_sim | fc | four_g | int_memory | m_dep | mobile_wt | n_cores | ... | px_height | px_width | ram | sc_h | sc_w | talk_time | three_g | touch_screen | wifi | price_range | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 842 | 0 | 2.2 | 0 | 1 | 0 | 7 | 0.6 | 188 | 2 | ... | 20 | 756 | 2549 | 9 | 7 | 19 | 0 | 0 | 1 | 1 |
1 | 1021 | 1 | 0.5 | 1 | 0 | 1 | 53 | 0.7 | 136 | 3 | ... | 905 | 1988 | 2631 | 17 | 3 | 7 | 1 | 1 | 0 | 2 |
2 | 563 | 1 | 0.5 | 1 | 2 | 1 | 41 | 0.9 | 145 | 5 | ... | 1263 | 1716 | 2603 | 11 | 2 | 9 | 1 | 1 | 0 | 2 |
3 | 615 | 1 | 2.5 | 0 | 0 | 0 | 10 | 0.8 | 131 | 6 | ... | 1216 | 1786 | 2769 | 16 | 8 | 11 | 1 | 0 | 0 | 2 |
4 | 1821 | 1 | 1.2 | 0 | 13 | 1 | 44 | 0.6 | 141 | 2 | ... | 1208 | 1212 | 1411 | 8 | 2 | 15 | 1 | 1 | 0 | 1 |
5 rows ร 21 columns
Question 90
price_range ์ ๊ฐ value๋ฅผ ๊ทธ๋ฃนํํ์ฌ ๊ฐ ๊ทธ๋ฃน์ n_cores ์ ๋น๋๊ฐ ๊ฐ์ฅ๋์ value์ ๊ทธ ๋น๋์๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
answer =df[['price_range','n_cores']].groupby(['price_range','n_cores']).size().sort_values(0).groupby(level=0).tail(1)
answer
/var/folders/0n/sq2gtx6s5qldt4nt05wqc3380000gn/T/ipykernel_3974/413588440.py:1: FutureWarning: In a future version of pandas all arguments of Series.sort_values will be keyword-only
answer =df[['price_range','n_cores']].groupby(['price_range','n_cores']).size().sort_values(0).groupby(level=0).tail(1)
price_range n_cores
0 2 69
3 5 70
2 4 73
1 1 76
dtype: int64
Question 91
price_range ๊ฐ์ด 3์ธ ๊ทธ๋ฃน์์ ์๊ด๊ด๊ณ๊ฐ 2๋ฒ์งธ๋ก ๋์ ๋ ์ปฌ๋ผ๊ณผ ๊ทธ ์๊ด๊ณ์๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
cordf = df.loc[df['price_range']==3].corr().unstack().sort_values(ascending=False)
answer = cordf.loc[cordf!=1].reset_index().iloc[1]
print(answer)
level_0 pc
level_1 fc
0 0.635166
Name: 1, dtype: object
๋นํํ์น ๊ฒฝํ ๋ง์กฑ๋ ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/teejmahal20/airline-passenger-satisfaction?select=train.csv (์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ๋นํํ์น ๊ฒฝํ ๋ง์กฑ๋ (satisfaction ์ปฌ๋ผ : โneutral or dissatisfiedโ or satisfied ) (83123, 24) shape
Data url :https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/train.csv
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/train.csv")
df.head()
id | Gender | Customer Type | Age | Type of Travel | Class | Flight Distance | Inflight wifi service | Departure/Arrival time convenient | Ease of Online booking | ... | Inflight entertainment | On-board service | Leg room service | Baggage handling | Checkin service | Inflight service | Cleanliness | Departure Delay in Minutes | Arrival Delay in Minutes | satisfaction | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 55882 | Male | Loyal Customer | 42 | Business travel | Business | 473 | 1 | 1 | 3 | ... | 5 | 5 | 5 | 5 | 3 | 5 | 3 | 3 | 0.0 | satisfied |
1 | 31906 | Female | Loyal Customer | 31 | Business travel | Business | 3271 | 4 | 4 | 4 | ... | 3 | 3 | 5 | 5 | 3 | 4 | 3 | 1 | 5.0 | satisfied |
2 | 85481 | Male | Loyal Customer | 51 | Personal Travel | Eco | 214 | 5 | 0 | 5 | ... | 3 | 4 | 4 | 4 | 4 | 4 | 3 | 2 | 0.0 | satisfied |
3 | 102134 | Male | Loyal Customer | 44 | Business travel | Business | 3097 | 1 | 1 | 3 | ... | 4 | 4 | 4 | 4 | 3 | 4 | 5 | 0 | 0.0 | satisfied |
4 | 31580 | Male | Loyal Customer | 33 | Business travel | Business | 3286 | 1 | 1 | 1 | ... | 4 | 2 | 5 | 2 | 5 | 1 | 4 | 0 | 0.0 | satisfied |
5 rows ร 24 columns
Question 92
Arrival Delay in Minutes ์ปฌ๋ผ์ด ๊ฒฐ์ธก์น์ธ ๋ฐ์ดํฐ๋ค ์ค โneutral or dissatisfiedโ ๋ณด๋ค โsatisfiedโ์ ์๊ฐ ๋ ๋์ Class๋ ์ด๋ ์ธ๊ฐ?
Show code cell source
answer =df.loc[df['Arrival Delay in Minutes'].isnull()].groupby(['Class','satisfaction'],as_index=False).size().pivot(index='Class',columns='satisfaction')
result =answer[answer['size']['neutral or dissatisfied'] < answer['size']['satisfied']]
result
size | ||
---|---|---|
satisfaction | neutral or dissatisfied | satisfied |
Class | ||
Business | 36 | 76 |
์์ง ์์ฉ์ฑ ์ฌ๋ถ ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/adityakadiwal/water-potability
๋ฐ์ดํฐ ์ค๋ช
: ์์ง ์์ฉ์ฑ ์ฌ๋ถ (Potablillity ์ปฌ๋ผ : 0 ,1 )
Data url :https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/train.csv
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/train.csv")
df.head()
ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 7.918150 | 214.186611 | 23823.492888 | 7.290878 | 341.173322 | 411.424483 | 19.585002 | 25.057375 | 4.028958 | 0 |
1 | 5.422446 | 205.266080 | 18542.957451 | 5.491963 | 306.702227 | 382.080129 | 10.504023 | 67.493450 | 2.911751 | 1 |
2 | 7.341547 | 187.672402 | 21273.457066 | 7.784003 | NaN | 332.084293 | 16.842334 | 55.019151 | 4.025644 | 0 |
3 | 9.056245 | 197.666301 | 17403.532167 | 7.688917 | 337.460176 | 414.766631 | 15.349869 | 63.696746 | 3.319354 | 0 |
4 | 5.039374 | 142.860598 | 40829.353167 | 7.271543 | NaN | 386.803057 | 16.823773 | 52.297113 | 4.957420 | 0 |
Question 93
ph๊ฐ์ ์๋นํ ๋ง์ ๊ฒฐ์ธก์น๋ฅผ ํฌํจํ๋ค. ๊ฒฐ์ธก์น๋ฅผ ์ ์ธํ ๋๋จธ์ง ๋ฐ์ดํฐ๋ค ์ค ์ฌ๋ถ์๊ฐ ๊ธฐ์ค ํ์ 25%์ ๊ฐ๋ค์ ํ๊ท ๊ฐ์?
Show code cell source
target = df['ph'].dropna()
answer =target.loc[target <= target.quantile(0.25)].mean()
print(answer)
5.057093462441732
์๋ฃ ๋น์ฉ ์์ธก ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/mirichoi0218/insurance/code(์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ์๋ฃ๋น์ฉ ์์ธก๋ฌธ์
Data url :https://raw.githubusercontent.com/Datamanim/datarepo/main/MedicalCost/train.csv
import pandas as pd
train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/MedicalCost/train.csv")
train.head()
age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|
0 | 64 | female | 39.330 | 0 | no | northeast | 14901.51670 |
1 | 47 | female | 27.830 | 0 | yes | southeast | 23065.42070 |
2 | 52 | female | 33.300 | 2 | no | southwest | 10806.83900 |
3 | 33 | female | 22.135 | 1 | no | northeast | 5354.07465 |
4 | 30 | male | 31.400 | 1 | no | southwest | 3659.34600 |
Question 94
ํก์ฐ์์ ๋นํก์ฐ์ ๊ฐ๊ฐ charges์ ์์ 10% ๊ทธ๋ฃน์ ํ๊ท ์ ์ฐจ์ด๋?
Show code cell source
high = train.loc[train.smoker =='yes'].charges.quantile(0.9)
high2 = train.loc[train.smoker =='no'].charges.quantile(0.9)
mean_yes = train.loc[(train.smoker =='yes') &(train.charges >=high)].charges.mean()
mean_no = train.loc[(train.smoker =='no') &(train.charges >=high2)].charges.mean()
answer = mean_yes - mean_no
print(answer)
29297.954548156158
ํน์นด์ดํฐ ์ฃผ๊ฑฐ์ง ๊ฐ๊ฒฉ์์ธก๋ฌธ์ ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/harlfoxem/housesalesprediction (์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ํน์นด์ดํฐ ์ฃผ๊ฑฐ์ง ๊ฐ๊ฒฉ ์์ธก๋ฌธ์
Data url :https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice/train.csv
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice//train.csv")
df.head()
id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | ... | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8961990160 | 20150413T000000 | 567500.0 | 3 | 2.5 | 2080 | 4556 | 2.0 | 0 | 0 | ... | 8 | 2080 | 0 | 1999 | 0 | 98074 | 47.6036 | -122.014 | 1530 | 5606 |
1 | 9455200205 | 20140604T000000 | 525000.0 | 3 | 2.0 | 1540 | 7800 | 1.0 | 0 | 0 | ... | 8 | 1540 | 0 | 2004 | 0 | 98125 | 47.7041 | -122.288 | 1510 | 7800 |
2 | 7853220670 | 20140918T000000 | 540000.0 | 3 | 2.5 | 2860 | 8935 | 2.0 | 0 | 0 | ... | 8 | 2860 | 0 | 2004 | 0 | 98065 | 47.5336 | -121.855 | 2650 | 6167 |
3 | 3298201170 | 20141110T000000 | 350000.0 | 3 | 1.0 | 940 | 7811 | 1.0 | 0 | 0 | ... | 6 | 940 | 0 | 1959 | 0 | 98008 | 47.6195 | -122.118 | 1180 | 7490 |
4 | 7972604355 | 20140521T000000 | 218000.0 | 3 | 1.0 | 1020 | 7874 | 1.0 | 0 | 0 | ... | 7 | 1020 | 0 | 1956 | 0 | 98106 | 47.5175 | -122.346 | 1290 | 7320 |
5 rows ร 21 columns
Question 95
bedrooms ์ ๋น๋๊ฐ ๊ฐ์ฅ ๋์ ๊ฐ์ ๊ฐ์ง๋ ๋ฐ์ดํฐ๋ค์ price์ ์์ 10%์ ํ์ 10%๊ฐ์ ์ฐจ์ด๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
answer = df.loc[df.bedrooms ==df.bedrooms.value_counts().index[0]].price.quantile(0.9) \
-\
df.loc[df.bedrooms ==df.bedrooms.value_counts().index[0]].price.quantile(0.1)
print(answer)
505500.0
๋ํ์ ์ ํ๊ฐ๋ฅ์ฑ ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/mohansacharya/graduate-admissions(์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ๋ํ์ ์
ํ ๊ฐ๋ฅ์ฑ ์์ธก
Data url :https://raw.githubusercontent.com/Datamanim/datarepo/main/admission/train.csv
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/admission/train.csv")
df.head()
Serial No. | GRE Score | TOEFL Score | University Rating | SOP | LOR | CGPA | Research | Chance of Admit | |
---|---|---|---|---|---|---|---|---|---|
0 | 246 | 328 | 110 | 4 | 4.0 | 2.5 | 9.02 | 1 | 0.81 |
1 | 100 | 323 | 113 | 3 | 4.0 | 4.0 | 8.88 | 1 | 0.79 |
2 | 79 | 296 | 95 | 2 | 3.0 | 2.0 | 7.54 | 1 | 0.44 |
3 | 53 | 334 | 116 | 4 | 4.0 | 3.0 | 8.00 | 1 | 0.78 |
4 | 444 | 321 | 114 | 5 | 4.5 | 4.5 | 9.16 | 1 | 0.87 |
Question 96
Serial No. ์ปฌ๋ผ์ ์ ์ธํ๊ณ โChance of Admitโ์ ์ข ์๋ณ์, ๋๋จธ์ง ๋ณ์๋ฅผ ๋ ๋ฆฝ๋ณ์๋ผ ํ ๋, ๋๋คํฌ๋ ์คํธ๋ฅผ ํตํด ํ๊ท ์์ธก์ ํ ๋ ๋ณ์์ค์๋ ๊ฐ์ ์ถ๋ ฅํ๋ผ (์๋๊ฐ์ ๋ฐ๋ผ ์์๋ ๋ฌ๋ผ์ง์ ์์)
Show code cell source
from sklearn.ensemble import RandomForestRegressor
df_t = df.drop([df.columns[0]],axis=1)
x = df_t.drop([df.columns[-1]],axis=1)
y = df_t[df.columns[-1]]
ml = RandomForestRegressor()
ml.fit(x,y)
result=pd.DataFrame({'importance':ml.feature_importances_},x.columns).sort_values('importance',ascending=False)
display(result)
importance | |
---|---|
CGPA | 0.820192 |
GRE Score | 0.080724 |
TOEFL Score | 0.032414 |
SOP | 0.023649 |
LOR | 0.022733 |
University Rating | 0.013037 |
Research | 0.007252 |
๋ ๋ ์์ธ ํ๋ฆฌํฐ ์์ธก ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009(์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ๋ ๋ ์์ธ ํ๋ฆฌํฐ ์์ธก๋ฌธ์
Data url :https://raw.githubusercontent.com/Datamanim/datarepo/main/redwine/train.csv
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/redwine/train.csv")
df.head()
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.0 | 0.38 | 0.01 | 1.6 | 0.048 | 26.0 | 60.0 | 0.99084 | 3.70 | 0.75 | 14.0 | 6 |
1 | 5.0 | 0.42 | 0.24 | 2.0 | 0.060 | 19.0 | 50.0 | 0.99170 | 3.72 | 0.74 | 14.0 | 8 |
2 | 7.1 | 0.36 | 0.30 | 1.6 | 0.080 | 35.0 | 70.0 | 0.99693 | 3.44 | 0.50 | 9.4 | 5 |
3 | 7.6 | 0.29 | 0.49 | 2.7 | 0.092 | 25.0 | 60.0 | 0.99710 | 3.31 | 0.61 | 10.1 | 6 |
4 | 7.7 | 0.51 | 0.28 | 2.1 | 0.087 | 23.0 | 54.0 | 0.99800 | 3.42 | 0.74 | 9.2 | 5 |
Question 97
quality ๊ฐ์ด 3์ธ ๊ทธ๋ฃน๊ณผ 8์ธ ๋ฐ์ดํฐ๊ทธ๋ฃน์ ๊ฐ ์ปฌ๋ผ๋ณ ๋ ๋ฆฝ๋ณ์์ ํ์คํธ์ฐจ ๊ฐ์ ์ฐจ์ด๋ฅผ ๊ตฌํ ๋ ๊ทธ๊ฐ์ด ๊ฐ์ฅ ํฐ ์ปฌ๋ผ๋ช ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
answer = (df.loc[df.quality ==8].std() -df.loc[df.quality ==3].std()).sort_values().index[-1]
print(answer)
total sulfur dioxide
์ฝ๋ฌผ ๋ถ๋ฅ ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/prathamtripathi/drug-classification(์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ํฌ์ฝํ๋ ์ฝ์ ๋ถ๋ฅ
Data url :https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/train.csv
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/train.csv")
df.head()
Age | Sex | BP | Cholesterol | Na_to_K | Drug | |
---|---|---|---|---|---|---|
0 | 20 | F | NORMAL | NORMAL | 9.281 | 4 |
1 | 24 | M | HIGH | NORMAL | 9.475 | 1 |
2 | 34 | M | NORMAL | HIGH | 22.456 | 0 |
3 | 17 | M | NORMAL | NORMAL | 10.832 | 4 |
4 | 72 | M | LOW | HIGH | 16.310 | 0 |
Question 98
๋จ์ฑ๋ค์ ์ฐ๋ น๋๋ณ (10์ด์ฉ ๊ตฌ๋ถ 0~9์ธ 10~19์ธ โฆ) Na_to_K๊ฐ์ ํ๊ท ๊ฐ์ ๊ตฌํด์ ๋ฐ์ดํฐ ํ๋ ์์ผ๋ก ํํํ์ฌ๋ผ
Show code cell source
pre = df.loc[df.Sex=='M']
pre2= pre.copy()
pre2['Age2'] = pre.Age//10 *10
answer =pre2.groupby('Age2').mean()['Na_to_K'].to_frame()
display(answer)
Na_to_K | |
---|---|
Age2 | |
10 | 13.627000 |
20 | 18.260769 |
30 | 15.244143 |
40 | 13.897273 |
50 | 14.811273 |
60 | 14.284308 |
70 | 11.363857 |
์ฌ๊ธฐํ์ฌ ๋ถ๋ฅ ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/sid321axn/audit-data(์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ์ฌ๊ธฐํ์ฌ ๋ถ๋ฅ
Data url :https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/train.csv
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/train.csv")
df.head()
Sector_score | LOCATION_ID | PARA_A | Score_A | Risk_A | PARA_B | Score_B | Risk_B | TOTAL | numbers | ... | RiSk_E | History | Prob | Risk_F | Score | Inherent_Risk | CONTROL_RISK | Detection_Risk | Audit_Risk | Risk | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.72 | 14 | 4.53 | 0.6 | 2.718 | 87.93 | 0.6 | 52.758 | 92.46 | 5.0 | ... | 0.4 | 0 | 0.2 | 0.0 | 4.0 | 108.362 | 0.4 | 0.5 | 21.6724 | 1 |
1 | 3.41 | 16 | 12.68 | 0.6 | 7.608 | 41.00 | 0.6 | 24.600 | 53.68 | 5.5 | ... | 0.4 | 0 | 0.2 | 0.0 | 4.4 | 41.936 | 0.4 | 0.5 | 8.3872 | 1 |
2 | 2.72 | 11 | 3.11 | 0.6 | 1.866 | 113.97 | 0.6 | 68.382 | 117.08 | 5.5 | ... | 0.4 | 0 | 0.2 | 0.0 | 4.4 | 88.832 | 0.4 | 0.5 | 17.7664 | 1 |
3 | 3.41 | 8 | 1.12 | 0.4 | 0.448 | 0.00 | 0.2 | 0.000 | 1.12 | 5.0 | ... | 0.4 | 0 | 0.2 | 0.0 | 2.2 | 1.848 | 0.4 | 0.5 | 0.3696 | 0 |
4 | 55.57 | 9 | 1.06 | 0.4 | 0.424 | 0.00 | 0.2 | 0.000 | 1.06 | 5.0 | ... | 0.4 | 0 | 0.2 | 0.0 | 2.2 | 1.824 | 0.4 | 0.5 | 0.3648 | 0 |
5 rows ร 27 columns
Question 99
๋ฐ์ดํฐ์ Risk ๊ฐ์ ๋ฐ๋ฅธ score_a์ score_b์ ํ๊ท ๊ฐ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
answer =df.groupby('Risk').mean()[['Score_A','Score_B']]
display(answer)
Score_A | Score_B | |
---|---|---|
Risk | ||
0 | 0.262234 | 0.225532 |
1 | 0.490164 | 0.444262 |
์ผ์๋ฐ์ดํฐ ๋์์ ํ ๋ถ๋ฅ ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/kyr7plus/emg-4(์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ์ผ์๋ฐ์ดํฐ๋ก ๋์ ์ ํ ๋ถ๋ฅ
Data url :https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/train.csv
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/train.csv")
df.head()
motion_0 | motion_1 | motion_2 | motion_3 | motion_4 | motion_5 | motion_6 | motion_7 | motion_8 | motion_9 | ... | motion_55 | motion_56 | motion_57 | motion_58 | motion_59 | motion_60 | motion_61 | motion_62 | motion_63 | pose | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -6.0 | 8.0 | 5.0 | 6.0 | -12.0 | -27.0 | 5.0 | 24.0 | -8.0 | -5.0 | ... | -1.0 | -3.0 | -6.0 | -9.0 | -7.0 | 36.0 | 68.0 | 3.0 | -3.0 | 1 |
1 | -32.0 | -3.0 | -4.0 | -5.0 | -1.0 | 16.0 | 0.0 | -5.0 | 42.0 | 2.0 | ... | -17.0 | -12.0 | -1.0 | -1.0 | 0.0 | -26.0 | -11.0 | -4.0 | -9.0 | 1 |
2 | 35.0 | -2.0 | -1.0 | 1.0 | -25.0 | -16.0 | -3.0 | -20.0 | 10.0 | 1.0 | ... | 6.0 | -65.0 | 6.0 | -10.0 | -31.0 | -19.0 | -3.0 | -47.0 | -17.0 | 0 |
3 | -15.0 | -6.0 | -3.0 | -5.0 | 27.0 | 37.0 | -2.0 | -2.0 | 19.0 | 5.0 | ... | 2.0 | 11.0 | 0.0 | -2.0 | -4.0 | -7.0 | 15.0 | 3.0 | 14.0 | 1 |
4 | 60.0 | 3.0 | 0.0 | -7.0 | -6.0 | -10.0 | -6.0 | 7.0 | -11.0 | -4.0 | ... | -12.0 | 17.0 | -3.0 | 1.0 | 2.0 | -2.0 | 20.0 | 18.0 | 10.0 | 0 |
5 rows ร 65 columns
Question 100
pose๊ฐ์ ๋ฐ๋ฅธ ๊ฐ motion์ปฌ๋ผ์ ์ค๊ฐ๊ฐ์ ๊ฐ์ฅ ํฐ ์ฐจ์ด๋ฅผ ๋ณด์ด๋ motion์ปฌ๋ผ์ ์ด๋์ด๋ฉฐ ๊ทธ๊ฐ์?
Show code cell source
t= df.groupby('pose').median().T
dfs = abs(t[0] - t[1]).sort_values().reset_index()
dfs[dfs[0] ==dfs[0].max()]['index'].values
array(['motion_54', 'motion_62'], dtype=object)
ํ๋ ์ฐจ๋ ๊ฐ๊ฒฉ ๋ถ๋ฅ๋ฌธ์ ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/mysarahmadbhat/hyundai-used-car-listing(์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ํ๋ ์ฐจ๋๊ฐ๊ฒฉ ๋ถ๋ฅ๋ฌธ์
Data url :hhttps://raw.githubusercontent.com/Datamanim/datarepo/main/hyundai/train.csv
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/hyundai/train.csv")
df.head()
model | year | price | transmission | mileage | fuelType | tax(ยฃ) | mpg | engineSize | |
---|---|---|---|---|---|---|---|---|---|
0 | Tucson | 2016 | 12795 | Manual | 36263 | Diesel | 30 | 61.7 | 1.7 |
1 | I10 | 2012 | 3995 | Manual | 22089 | Petrol | 20 | 61.4 | 1.2 |
2 | I30 | 2019 | 24000 | Manual | 555 | Petrol | 145 | 34.0 | 2.0 |
3 | Tucson | 2017 | 12995 | Manual | 32147 | Diesel | 30 | 61.7 | 1.7 |
4 | Tucson | 2018 | 14991 | Semi-Auto | 32217 | Diesel | 150 | 57.6 | 1.7 |
Question 101
์ ๋ณด(row์)๊ฐ ๊ฐ์ฅ ๋ง์ ์์ 3์ฐจ์ข ์ price๊ฐ์ ๊ฐ ํ๊ท ๊ฐ์?
Show code cell source
answer =df.loc[df.model.isin(df.model.value_counts().index[:3])].groupby('model').mean()['price'].to_frame()
display(answer)
price | |
---|---|
model | |
I10 | 7646.137891 |
I30 | 11651.821759 |
Tucson | 15805.441373 |
๋น๋จ์ฌ๋ถํ๋จ ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/pritsheta/diabetes-dataset(์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ๋น๋จ์ฌ๋ถ ํ๋จํ๊ธฐ
Data url :https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/train.csv
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/train.csv")
df.head()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 3 | 102 | 74 | 0 | 0 | 29.5 | 0.121 | 32 | 0 |
1 | 2 | 144 | 58 | 33 | 135 | 31.6 | 0.422 | 25 | 1 |
2 | 5 | 136 | 82 | 0 | 0 | 0.0 | 0.640 | 69 | 0 |
3 | 13 | 145 | 82 | 19 | 110 | 22.2 | 0.245 | 57 | 0 |
4 | 1 | 117 | 60 | 23 | 106 | 33.8 | 0.466 | 27 | 0 |
Question 102
Outcome ๊ฐ์ ๋ฐ๋ฅธ ๊ฐ ๊ทธ๋ฃน์ ๊ฐ ์ปฌ๋ผ์ ํ๊ท ์ฐจ์ด๋ฅผ ๊ตฌํ์ฌ๋ผ
Show code cell source
answer = df.groupby('Outcome').mean().diff().iloc[1,:]
print(answer)
Pregnancies 1.574159
Glucose 29.943995
BloodPressure 3.202079
SkinThickness 2.952033
Insulin 33.398645
BMI 4.541437
DiabetesPedigreeFunction 0.143646
Age 5.866939
Name: 1, dtype: float64
๋ทํ๋ฆญ์ค ์ฃผ์ ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/pritsheta/netflix-stock-data-from-2002-to-2021(์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
๋ฐ์ดํฐ ์ค๋ช
: ๋ทํ๋ฆญ์ค ์ฃผ์๋ฐ์ดํฐ
Data url :https://raw.githubusercontent.com/Datamanim/datarepo/main/nflx/NFLX.csv
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/nflx/NFLX.csv")
df.head()
Date | Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|---|
0 | 2002-05-23 | 1.156429 | 1.242857 | 1.145714 | 1.196429 | 1.196429 | 104790000 |
1 | 2002-05-24 | 1.214286 | 1.225000 | 1.197143 | 1.210000 | 1.210000 | 11104800 |
2 | 2002-05-28 | 1.213571 | 1.232143 | 1.157143 | 1.157143 | 1.157143 | 6609400 |
3 | 2002-05-29 | 1.164286 | 1.164286 | 1.085714 | 1.103571 | 1.103571 | 6757800 |
4 | 2002-05-30 | 1.107857 | 1.107857 | 1.071429 | 1.071429 | 1.071429 | 10154200 |
Question 103
๋งค๋ 5์๋ฌ์ open๊ฐ๊ฒฉ์ ํ๊ท ๊ฐ์ ๋ฐ์ดํฐ ํ๋ ์์ผ๋ก ํํํ๋ผ
Show code cell source
df['Date'] =pd.to_datetime(df['Date'])
target = df.groupby(df['Date'].dt.strftime('%Y-%m')).mean()
answer = target.loc[target.index.str.contains('-05')].Open
print(answer)
Date
2002-05 1.155833
2003-05 1.641497
2004-05 4.261143
2005-05 1.951905
2006-05 4.163571
2007-05 3.159351
2008-05 4.435034
2009-05 5.790571
2010-05 14.417071
2011-05 34.650272
2012-05 10.523247
2013-05 31.936429
2014-05 51.121292
2015-05 85.057429
2016-05 92.705715
2017-05 158.255455
2018-05 329.779541
2019-05 359.664548
2020-05 433.880499
2021-05 496.923996
Name: Open, dtype: float64
220510์ถ๊ฐ#
nba ์ ์ ๋ฅ๋ ฅ์น ๋ฐ์ดํฐ#
Attention
๋ฐ์ดํฐ ์ถ์ฒ :https://www.kaggle.com/datasets/vivovinco/nba-player-stats(์ฐธ๊ณ , ๋ฐ์ดํฐ ์์ )
Data url :https://raw.githubusercontent.com/Datamanim/datarepo/main/nba/nba.csv
pd.set_option('display.max_columns',50)
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/nba/nba.csv",encoding='latin',sep=';')
df.head()
Rk | Player | Pos | Age | Tm | G | GS | MP | FG | FGA | FG% | 3P | 3PA | 3P% | 2P | 2PA | 2P% | eFG% | FT | FTA | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Precious Achiuwa | C | 22 | TOR | 73 | 28 | 23.6 | 3.6 | 8.3 | 0.439 | 0.8 | 2.1 | 0.359 | 2.9 | 6.1 | 0.468 | 0.486 | 1.1 | 1.8 | 0.595 | 2.0 | 4.5 | 6.5 | 1.1 | 0.5 | 0.6 | 1.2 | 2.1 | 9.1 |
1 | 2 | Steven Adams | C | 28 | MEM | 76 | 75 | 26.3 | 2.8 | 5.1 | 0.547 | 0.0 | 0.0 | 0.000 | 2.8 | 5.0 | 0.548 | 0.547 | 1.4 | 2.6 | 0.543 | 4.6 | 5.4 | 10.0 | 3.4 | 0.9 | 0.8 | 1.5 | 2.0 | 6.9 |
2 | 3 | Bam Adebayo | C | 24 | MIA | 56 | 56 | 32.6 | 7.3 | 13.0 | 0.557 | 0.0 | 0.1 | 0.000 | 7.3 | 12.9 | 0.562 | 0.557 | 4.6 | 6.1 | 0.753 | 2.4 | 7.6 | 10.1 | 3.4 | 1.4 | 0.8 | 2.6 | 3.1 | 19.1 |
3 | 4 | Santi Aldama | PF | 21 | MEM | 32 | 0 | 11.3 | 1.7 | 4.1 | 0.402 | 0.2 | 1.5 | 0.125 | 1.5 | 2.6 | 0.560 | 0.424 | 0.6 | 1.0 | 0.625 | 1.0 | 1.7 | 2.7 | 0.7 | 0.2 | 0.3 | 0.5 | 1.1 | 4.1 |
4 | 5 | LaMarcus Aldridge | C | 36 | BRK | 47 | 12 | 22.3 | 5.4 | 9.7 | 0.550 | 0.3 | 1.0 | 0.304 | 5.1 | 8.8 | 0.578 | 0.566 | 1.9 | 2.2 | 0.873 | 1.6 | 3.9 | 5.5 | 0.9 | 0.3 | 1.0 | 0.9 | 1.7 | 12.9 |
Question 104
Tm ์ปฌ๋ผ์ ๊ฐ ํ์ ์ด๋ฆ์ ์๋ฏธํ๋ค. TORํ์ ํ๊ท ๋์ด๋ฅผ ์์ 4์งธ ์๋ฆฌ๊น์ง ๊ตฌํ์ฌ๋ผ
Show code cell source
result = round(df[df.Tm =='TOR'].Age.mean(),4)
print(result)
24.8696
Question 105
Pos ์ปฌ๋ผ์ ํฌ์ง์ ์ ์๋ฏธํ๋ค. ์ ์ฒด ์ ์ ์ค ์ต์๋์ด๋์ ์ ์๋ค์ ํํฐํ๊ณ ๊ทธ๋ค ์ค ๊ฐ์ฅ ๋ง์ ํฌ์ง์ ์ ๋ฌด์์ธ์ง ํ์ธํ๋ผ
Show code cell source
result = df[df.Age==df.Age.min()].Pos.value_counts().index[0]
print(result)
SG
Question 106
์ ์๋ค์ ์ด๋ฆ์ first_name+ ๊ณต๋ฐฑ + last_name์ผ๋ก ์ด๋ฃจ์ด์ ธ ์๋ค. ๊ฐ์ฅ ๋ง์ first_name์ ๋ฌด์์ด๋ฉฐ ๋ช ํ ๋ฐ์ํ๋์ง ํ์ธํ๋ผ
Show code cell source
result= df['Player'].str.split().str[0].str.lower().value_counts().head(1)
print(result)
justin 14
Name: Player, dtype: int64
Question 107
PTS์ปฌ๋ผ์ ๊ฒฝ๊ธฐ๋น ํ๊ท ๋์ ์ ์ด๋ค. ๊ฐํฌ์ง์ ๋ณ๋ก ๊ฒฝ๊ธฐ๋น ํ๊ท ๋์ ์์ ํ๊ท ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = df.groupby('Pos')['PTS'].mean().sort_values()
print(result)
Pos
SF 7.270253
SG-SF 7.360000
C 7.690769
PF 7.737500
SG 7.810553
PG 8.325161
PF-SF 8.500000
SF-SG 8.660000
SG-PG 9.525000
C-PF 12.850000
PG-SG 22.000000
Name: PTS, dtype: float64
Question 108
PTS์ปฌ๋ผ์ ๊ฒฝ๊ธฐ๋น ํ๊ท ๋์ ์ ์ด๋ค. ๊ฐํฌ์ง์ ๋ณ๋ก ๊ฒฝ๊ธฐ๋น ํ๊ท ๋์ ์์ ํ๊ท ์ ๊ตฌํ์ฌ๋ผ
Question 109
G์ปฌ๋ผ์ ์ฐธ์ฌํ ๊ฒฝ๊ธฐ์ ์ซ์์ด๋ค. ๊ฐ ํ๋ณ๋ก ๊ฐ์ฅ ๋์ ๊ฒฝ๊ธฐ์ฐธ์ฌ ์๋ฅผ ๊ฐ์ง ์ ์๋ค์ ๊ฒฝ๊ธฐ ์ฐธ์ฌ ์ซ์์ ํ๊ท ์ ๊ตฌํ์ฌ๋ผ
Show code cell source
result = df.sort_values(['Tm','G']).groupby('Tm').tail(1).G.mean()
print(result)
77.51612903225806
Question 110
Tm์ ๊ฐ์ด MIA์ด๋ฉฐ Pos๋ C๋๋ PF์ธ ์ ์์ MP๊ฐ์ ํ๊ท ์?
Show code cell source
result = df[(df.Tm =='MIA') & (df.Pos.isin(['C','PF']))].MP.mean()
print(result)
16.7875
Question 111
์ ์ฒด ๋ฐ์ดํฐ์ค G์ ํ๊ท ๊ฐ์ 1.5๋ฐฐ ์ด์์ธ ๋ฐ์ดํฐ๋ค๋ง ์ถ์ถํ์๋ 3P๊ฐ์ ํ๊ท ์?
Show code cell source
result = df[df.G >= df.G.mean()*1.5]['3P'].mean()
print(result)
1.3853658536585365
Question 112
Age์ ํ๊ท ์ด์์ธ ๊ทธ๋ฃน๊ณผ ํ๊ท ๋ฏธ๋ง์ธ ๊ทธ๋ฃน๊ฐ์ G๊ฐ์ ํ๊ท ์ ์ฐจ์ด๋?
Show code cell source
result = df[df.Age >= df.Age.mean()].G.mean() - df[df.Age < df.Age.mean()].G.mean()
print(result)
3.787674551781862
Question 113
ํ๊ท ๋์ด๊ฐ ๊ฐ์ฅ ์ ์ ํ์ ์ด๋์ธ๊ฐ
result = df.groupby('Tm')['Age'].mean().sort_values().index[0]
print(result)
MEM
Question 114
Pos๊ทธ๋ฃน๋ณ ํ๊ท MP๊ฐ์ ๊ตฌํ์ฌ๋ผ
result = df.groupby('Pos')['MP'].mean()
print(result)
Pos
C 16.990000
C-PF 25.350000
PF 17.937500
PF-SF 27.300000
PG 19.547742
PG-SG 37.200000
SF 17.514557
SF-SG 20.340000
SG 18.554271
SG-PG 22.950000
SG-SF 19.620000
Name: MP, dtype: float64