모의고사 3회차#
광.고.광.고.광.고.클.릭
유튜브 링크
캐글 데이터셋 링크
유튜브링크
6월 4일(토) 오전 9시~ 11시 , 6월 5일(일) 오후 19시~21시 라이브 문제 풀이 진행
작업 1유형#
Attention
데이터 출처 : link (후처리 작업)
데이터 설명 : 2010-2019 스포티파이 TOP100 노래
dataurl : https://raw.githubusercontent.com/Datamanim/datarepo/main/spotify/spotify.csv
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/spotify/spotify.csv')
df.head()
title | artist | top genre | year released | added | bpm | nrgy | dnce | dB | live | val | dur | acous | spch | pop | top year | artist type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | STARSTRUKK (feat. Katy Perry) | 3OH!3 | dance pop | 2009.0 | 2022‑02‑17 | 140.0 | 81.0 | 61.0 | -6.0 | 23.0 | 23.0 | 203.0 | 0.0 | 6.0 | 70.0 | 2010.0 | Duo |
1 | My First Kiss (feat. Ke$ha) | 3OH!3 | dance pop | 2010.0 | 2022‑02‑17 | 138.0 | 89.0 | 68.0 | -4.0 | 36.0 | 83.0 | 192.0 | 1.0 | 8.0 | 68.0 | 2010.0 | Duo |
2 | I Need A Dollar | Aloe Blacc | pop soul | 2010.0 | 2022‑02‑17 | 95.0 | 48.0 | 84.0 | -7.0 | 9.0 | 96.0 | 243.0 | 20.0 | 3.0 | 72.0 | 2010.0 | Solo |
3 | Airplanes (feat. Hayley Williams of Paramore) | B.o.B | atl hip hop | 2010.0 | 2022‑02‑17 | 93.0 | 87.0 | 66.0 | -4.0 | 4.0 | 38.0 | 180.0 | 11.0 | 12.0 | 80.0 | 2010.0 | Solo |
4 | Nothin' on You (feat. Bruno Mars) | B.o.B | atl hip hop | 2010.0 | 2022‑02‑17 | 104.0 | 85.0 | 69.0 | -6.0 | 9.0 | 74.0 | 268.0 | 39.0 | 5.0 | 79.0 | 2010.0 | Solo |
Question1
데이터는 현재 년도별 100곡이 인기순으로 정렬되어 있다. 각 년도별 1~100위의 랭킹을 나타내는 rank컬럼을 만들고 매년도 1위의 bpm컬럼의 평균값을 구하여라
Show code cell source
df = df.dropna()
df.loc[:,'rank'] = list(range(1,101))*10
result = df[df['rank'] ==1].bpm.mean()
print(result)
125.6
Question2
2015년도에 가장많은 top100곡을 올린 artist는 누구인가?
Show code cell source
result = df[df['top year'] ==2015].artist.value_counts().index[0]
print(result)
The Weeknd
Question3
년도별 rank값이 1~10위 까지의 곡들 중 두번째로 많은 top genre는 무엇인가?
Show code cell source
result = df[df['rank'].isin(range(1,11))]['top genre'].value_counts().index[2]
print(result)
british soul
Question4
피처링의 경우 title에 표시된다. 피처링을 가장 많이 해준 가수는 누구인가?
Show code cell source
result = df.title.str.split('feat.').str[1].dropna().str[:-1].str.strip().value_counts().index[0]
print(result)
Bruno Mars
Question5
top year 년도를 기준으로 발매일(year released)과 top100에 진입한 일자 (top year)가 다른 곡의 숫자를 count 했을때 가장 많은 년도는?
Show code cell source
year = int(df[df['year released'] != df['top year']]['top year'].value_counts().index[0])
print(year)
2016
Question6
artist 컬럼의 값에 대소문자 상관없이 q 단어가 들어가는 아티스트는 몇명인가?
Show code cell source
result = df[df.artist.str.lower().str.contains('q')].artist.nunique()
print(result)
6
Question7
년도 상관없이 전체데이터에서 1~50위와 51~100위간의 dur 컬럼의 평균값의 차이는?
Show code cell source
result = df[df['rank'].isin(range(1,51))].dur.mean() - df[df['rank'].isin(range(51,101))].dur.mean()
print(result)
0.896000000000015
Question8
title을 띄어쓰기 단어로 구분 했을때 가장 많이 나온 단어는 무엇인가? (대소문자 구분 x)
Show code cell source
result = df.title.str.split('\(feat').str[0].str.split().explode().str.lower().value_counts().index[0]
print(result)
the
df
title | artist | top genre | year released | added | bpm | nrgy | dnce | dB | live | val | dur | acous | spch | pop | top year | artist type | rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | STARSTRUKK (feat. Katy Perry) | 3OH!3 | dance pop | 2009.0 | 2022‑02‑17 | 140.0 | 81.0 | 61.0 | -6.0 | 23.0 | 23.0 | 203.0 | 0.0 | 6.0 | 70.0 | 2010.0 | Duo | 1 |
1 | My First Kiss (feat. Ke$ha) | 3OH!3 | dance pop | 2010.0 | 2022‑02‑17 | 138.0 | 89.0 | 68.0 | -4.0 | 36.0 | 83.0 | 192.0 | 1.0 | 8.0 | 68.0 | 2010.0 | Duo | 2 |
2 | I Need A Dollar | Aloe Blacc | pop soul | 2010.0 | 2022‑02‑17 | 95.0 | 48.0 | 84.0 | -7.0 | 9.0 | 96.0 | 243.0 | 20.0 | 3.0 | 72.0 | 2010.0 | Solo | 3 |
3 | Airplanes (feat. Hayley Williams of Paramore) | B.o.B | atl hip hop | 2010.0 | 2022‑02‑17 | 93.0 | 87.0 | 66.0 | -4.0 | 4.0 | 38.0 | 180.0 | 11.0 | 12.0 | 80.0 | 2010.0 | Solo | 4 |
4 | Nothin' on You (feat. Bruno Mars) | B.o.B | atl hip hop | 2010.0 | 2022‑02‑17 | 104.0 | 85.0 | 69.0 | -6.0 | 9.0 | 74.0 | 268.0 | 39.0 | 5.0 | 79.0 | 2010.0 | Solo | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | SICKO MODE | Travis Scott | hip hop | 2018.0 | 2020‑06‑22 | 155.0 | 73.0 | 83.0 | -4.0 | 12.0 | 45.0 | 313.0 | 1.0 | 22.0 | 86.0 | 2019.0 | Solo | 96 |
996 | EARFQUAKE | Tyler, The Creator | hip hop | 2019.0 | 2020‑06‑22 | 80.0 | 50.0 | 55.0 | -9.0 | 80.0 | 41.0 | 190.0 | 23.0 | 7.0 | 85.0 | 2019.0 | Solo | 97 |
997 | Boasty (feat. Idris Elba) | Wiley | grime | 2019.0 | 2020‑06‑22 | 103.0 | 77.0 | 89.0 | -5.0 | 9.0 | 46.0 | 177.0 | 1.0 | 7.0 | 68.0 | 2019.0 | Solo | 98 |
998 | Strike a Pose (feat. Aitch) | Young T & Bugsey | afroswing | 2019.0 | 2020‑08‑20 | 138.0 | 58.0 | 53.0 | -6.0 | 10.0 | 59.0 | 214.0 | 1.0 | 10.0 | 67.0 | 2019.0 | Duo | 99 |
999 | The London (feat. J. Cole & Travis Scott) | Young Thug | atl hip hop | 2019.0 | 2020‑06‑22 | 98.0 | 59.0 | 80.0 | -7.0 | 13.0 | 18.0 | 200.0 | 2.0 | 15.0 | 75.0 | 2019.0 | Solo | 100 |
1000 rows × 18 columns
Question9
년도별 nrgy값의 평균값을 구할때 최대 평균값과 최소 평균값의 차이를 구하여라
Show code cell source
m = df.groupby(['top year']).nrgy.mean().sort_values().values
result = m[-1] - m[0]
print(result)
13.860000000000007
Question10
artist중 artist type 타입을 여러개 가지고 있는 artist는 누구인가
Show code cell source
result = df[['artist','artist type']].value_counts().reset_index().artist.value_counts().index[0]
print(result)
Rudimental
작업 2유형#
Attention
데이터 설명 : 센서데이터로 동작 유형 분류 (종속변수 pose : 0 ,1 구분)
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_test.csv
출처(참고, 데이터 수정)
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/y_train.csv")
test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_test.csv")
display(x_train.head())
display(y_train.head())
ID | motion_0 | motion_1 | motion_2 | motion_3 | motion_4 | motion_5 | motion_6 | motion_7 | motion_8 | ... | motion_54 | motion_55 | motion_56 | motion_57 | motion_58 | motion_59 | motion_60 | motion_61 | motion_62 | motion_63 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1.0 | -2.0 | -1.0 | 4.0 | -5.0 | -4.0 | 1.0 | 0.0 | -15.0 | ... | 0.0 | -1.0 | -13.0 | -3.0 | 1.0 | -1.0 | -32.0 | -22.0 | -2.0 | -3.0 |
1 | 2 | 20.0 | 0.0 | 0.0 | 1.0 | 5.0 | 6.0 | -52.0 | 18.0 | 15.0 | ... | -70.0 | -55.0 | -38.0 | -14.0 | -12.0 | -8.0 | -34.0 | -63.0 | -87.0 | -77.0 |
2 | 4 | 1.0 | -1.0 | 1.0 | 4.0 | -5.0 | -8.0 | 1.0 | -3.0 | -14.0 | ... | 1.0 | 12.0 | -25.0 | 0.0 | 0.0 | 3.0 | 2.0 | -27.0 | 1.0 | 0.0 |
3 | 5 | 13.0 | 2.0 | 1.0 | -3.0 | 1.0 | 3.0 | 28.0 | 3.0 | 12.0 | ... | 0.0 | -21.0 | -17.0 | -2.0 | 0.0 | -4.0 | -17.0 | -21.0 | -21.0 | 25.0 |
4 | 6 | -2.0 | -7.0 | -4.0 | -8.0 | 16.0 | 44.0 | 1.0 | 3.0 | -16.0 | ... | -1.0 | 2.0 | -1.0 | 1.0 | 4.0 | 4.0 | -17.0 | -38.0 | -3.0 | 3.0 |
5 rows × 65 columns
ID | pose | |
---|---|---|
0 | 0 | 1 |
1 | 2 | 0 |
2 | 4 | 1 |
3 | 5 | 0 |
4 | 6 | 1 |
Show code cell source
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
x = x_train.drop(columns = ['ID'])
test_drop = test.drop(columns = ['ID'])
sc = StandardScaler()
sc.fit(x)
xs = sc.transform(x)
x_test_scaler = sc.transform(test_drop)
X_train, X_test, y_train, y_test = train_test_split(xs, y_train['pose'], test_size=0.33, random_state=42)
lr = LogisticRegression()
lr.fit(X_train,y_train)
pred = lr.predict_proba(X_test)
print('validation_auc : ',roc_auc_score(y_test,pred[:,1]))
# # 아래 코드 예측변수와 수험번호를 개인별로 변경하여 활용
# # pd.DataFrame({'id': test.id, 'stroke': pred}).to_csv('003000000.csv', index=False)
pd.DataFrame({'id': test.ID, 'pose': lr.predict_proba(x_test_scaler)[:,1]}).to_csv('003000000.csv', index=False)
validation_auc : 0.5633199559973924