아파치 스파크(Apache Spark)

빅데이터를 처리하기 위한 하나의 시스템이다.
데이터가 늘어나면 늘어날수록 속도, 시간, 비용 여러면에서 효율적으로 처리해야한다.
다양한 프로그래밍 언어의 API를 제공한다.
머신러닝 등 다양한 패키지가 존재한다.

맵 리듀스(Map Reduce)

데이터가 방대한 양으로 늘어날 때 처리하는 방식에 이슈가 생길 수 있다.
이슈들을 보완하기 위해 데이터가 여러군데 분산 처리 되어있는 형태로 저장되어 있는데, S3 버킷에 저장한 방식처럼 파티션으로 구분된 데이터를 함수나 어떠한 방식에 의해서 맵핑을 하여 필요한 부분만을 줄이는 리듀스 과정을 거치게 된다.

예제

구글 같이 다양한 웹 페이지를 크롤링해서 각 페이지의 노출 랭킹을 분석해야 하는 페이지 랭크(Page Rank)라는 알고리즘을 사용할 때, HTML 안에 들어가는 태그 등의 문법적인 요소들과 컨텐츠를 한 곳에 몰아서 분석하기 보다 아래 그림과 같이 인풋을 병렬적으로 나누어 진행하고 그 다음 어떠한 셔플링 프로세스를 통해서 리듀스하여 결과를 낸다.

AWS EMR

키 페어 생성

pem 생성 후 프로젝트 디렉토리에 위치시킨다.

chmod로 권한을 부여한다.

1	`$ chmod og-rwx 6mini.pem`

마스터 보안 그룹 접속

인바운드 룰 SSH 추가

SSH 이용 웹 연결 활성화

$ ssh -i 6mini.pem hadoop@ec2-15-165-205-235.ap-northeast-2.compute.amazonaws.com

On-cluster application user interfaces에서 제플린 접속

하… 깜빡하고 인스턴스 종료를 안하고 하루종일 켜놨더니 24000원 바로 청구되어버렸다..ㅋㅋㅋㅋㅋㅋㅋㅋㅋㅋㅋㅋ
오늘의 교훈 인스턴스 중지 잘하기!

제플린 파이스파크(Zeppelin Pyspark)

RDD

2번째 셀까지는 파이썬에서 진행하던 방식이다.
3번째 셀부터 스파크의 방식이다.
스파크는 RDD를 기반으로 방대한 데이터를 분산시켜서 맵핑한 후 어플라이해서 얻은 값을 통합하여 받는 구조이다.
스파크 컨텍스트(Spark Context)를 통해 패럴렐라이즈(parallelize)하게 3개의 데이터를 쪼개서 RDD로 나누어 준 결과이다.

%pyspark

row = '{"name": "6mini"}'
import json
print(json.loads(row)['name'])
'''
6mini
'''

%pyspark
raw = ['{"name": "6mini"}', '{"name": "7mini"}', '{"name": "8mini"}']
for entry in raw:
    print(json.loads(entry)["name"])
'''
6mini
7mini
8mini
'''

%pyspark

rdd = sc.parallelize(raw)
rdd.count()
'''
3
'''

%pyspark

rdd.map(json.loads).map(lambda entry: entry['name']).collect()
'''
['6mini', '7mini', '8mini']
'''

데이터프레임(dataFrame)

rdd로 쪼개서 map함수를 통해 쪼개놓은 데이터에 각각 맵핑 시켜주고 선택한다.
sqlContext를 통해 S3에 이전에 저장해 놓았던 데이터를 데이터프레임 형태로 불러와 작업한다.
printSchema()함수를 통해 각각의 값들이 어떤 형태로 들어있는지에 주의를 갖고 살펴봐야 추후에 작업에 어려움이 없다.

%pyspark

raw = sqlContext.read.format('parquet').load('s3://6mini-spotify/top-tracks/dt=2021-10-05/top-tracks.parquet')
print(type(raw))
'''
<class 'pyspark.sql.dataframe.DataFrame'>
'''

%pyspark

raw.printSchema()
'''
root
 |-- id: string (nullable = true)
 |-- artist_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- popularity: long (nullable = true)
 |-- external_url: string (nullable = true)
'''

%pyspark

df = raw.toDF('id', 'artist_id', 'name', 'popularity', 'external_url')
df.show()
'''
+--------------------+--------------------+--------------------+----------+--------------------+
|                  id|           artist_id|                name|popularity|        external_url|
+--------------------+--------------------+--------------------+----------+--------------------+
|3FeVmId7tL5YN8B7R...|3Nrfpe0tUJi4K4DXY...|         My Universe|        92|https://open.spot...|
|3FeVmId7tL5YN8B7R...|3Nrfpe0tUJi4K4DXY...|         My Universe|        92|https://open.spot...|
|3FeVmId7tL5YN8B7R...|3Nrfpe0tUJi4K4DXY...|         My Universe|        92|https://open.spot...|
|3FeVmId7tL5YN8B7R...|3Nrfpe0tUJi4K4DXY...|         My Universe|        92|https://open.spot...|
|2bgTY4UwhfBYhGT4H...|3Nrfpe0tUJi4K4DXY...|              Butter|        91|https://open.spot...|
|2bgTY4UwhfBYhGT4H...|3Nrfpe0tUJi4K4DXY...|              Butter|        91|https://open.spot...|
|2bgTY4UwhfBYhGT4H...|3Nrfpe0tUJi4K4DXY...|              Butter|        91|https://open.spot...|
|2bgTY4UwhfBYhGT4H...|3Nrfpe0tUJi4K4DXY...|              Butter|        91|https://open.spot...|
|5eXBXreN3d1zdj6Sa...|3Nrfpe0tUJi4K4DXY...| Permission to Dance|        81|https://open.spot...|
|5eXBXreN3d1zdj6Sa...|3Nrfpe0tUJi4K4DXY...| Permission to Dance|        81|https://open.spot...|
|5eXBXreN3d1zdj6Sa...|3Nrfpe0tUJi4K4DXY...| Permission to Dance|        81|https://open.spot...|
|5eXBXreN3d1zdj6Sa...|3Nrfpe0tUJi4K4DXY...| Permission to Dance|        81|https://open.spot...|
|3JR0ducCexKbeK47g...|3Nrfpe0tUJi4K4DXY...|Butter (feat. Meg...|        87|https://open.spot...|
|3JR0ducCexKbeK47g...|3Nrfpe0tUJi4K4DXY...|Butter (feat. Meg...|        87|https://open.spot...|
|3JR0ducCexKbeK47g...|3Nrfpe0tUJi4K4DXY...|Butter (feat. Meg...|        87|https://open.spot...|
|3JR0ducCexKbeK47g...|3Nrfpe0tUJi4K4DXY...|Butter (feat. Meg...|        87|https://open.spot...|
|4saklk6nie3yiGePp...|3Nrfpe0tUJi4K4DXY...|            Dynamite|        87|https://open.spot...|
|4saklk6nie3yiGePp...|3Nrfpe0tUJi4K4DXY...|            Dynamite|        87|https://open.spot...|
|4saklk6nie3yiGePp...|3Nrfpe0tUJi4K4DXY...|            Dynamite|        87|https://open.spot...|
|4saklk6nie3yiGePp...|3Nrfpe0tUJi4K4DXY...|            Dynamite|        87|https://open.spot...|
+--------------------+--------------------+--------------------+----------+--------------------+
only showing top 20 rows
'''

컬럼 추출

%pyspark

raw = sqlContext.read.format('parquet').load('s3://6mini-spotify/audio-features/dt=2021-10-05/top-tracks.parquet')
raw.printSchema()
'''
root
 |-- danceability: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- key: long (nullable = true)
 |-- loudness: double (nullable = true)
 |-- mode: long (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- type: string (nullable = true)
 |-- id: string (nullable = true)
 |-- uri: string (nullable = true)
 |-- track_href: string (nullable = true)
 |-- analysis_url: string (nullable = true)
 |-- duration_ms: long (nullable = true)
 |-- time_signature: long (nullable = true)
'''

%pyspark

df1 = raw.toDF('danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature')
df1.show()
'''
+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+--------------+--------------------+--------------------+--------------------+--------------------+-----------+--------------+
|danceability|energy|key|loudness|mode|speechiness|acousticness|instrumentalness|liveness|valence|  tempo|          type|                  id|                 uri|          track_href|        analysis_url|duration_ms|time_signature|
+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+--------------+--------------------+--------------------+--------------------+--------------------+-----------+--------------+
|       0.588| 0.701|  9|   -6.39|   1|     0.0402|     0.00813|             0.0|     0.2|  0.443|104.988|audio_features|3FeVmId7tL5YN8B7R...|spotify:track:3Fe...|https://api.spoti...|https://api.spoti...|     228000|             4|
|       0.588| 0.701|  9|   -6.39|   1|     0.0402|     0.00813|             0.0|     0.2|  0.443|104.988|audio_features|3FeVmId7tL5YN8B7R...|spotify:track:3Fe...|https://api.spoti...|https://api.spoti...|     228000|             4|
|       0.588| 0.701|  9|   -6.39|   1|     0.0402|     0.00813|             0.0|     0.2|  0.443|104.988|audio_features|3FeVmId7tL5YN8B7R...|spotify:track:3Fe...|https://api.spoti...|https://api.spoti...|     228000|             4|
|       0.588| 0.701|  9|   -6.39|   1|     0.0402|     0.00813|             0.0|     0.2|  0.443|104.988|audio_features|3FeVmId7tL5YN8B7R...|spotify:track:3Fe...|https://api.spoti...|https://api.spoti...|     228000|             4|
|       0.759| 0.459|  8|  -5.187|   1|     0.0948|     0.00323|             0.0|  0.0906|  0.695|109.997|audio_features|2bgTY4UwhfBYhGT4H...|spotify:track:2bg...|https://api.spoti...|https://api.spoti...|     164442|             4|
|       0.759| 0.459|  8|  -5.187|   1|     0.0948|     0.00323|             0.0|  0.0906|  0.695|109.997|audio_features|2bgTY4UwhfBYhGT4H...|spotify:track:2bg...|https://api.spoti...|https://api.spoti...|     164442|             4|
|       0.759| 0.459|  8|  -5.187|   1|     0.0948|     0.00323|             0.0|  0.0906|  0.695|109.997|audio_features|2bgTY4UwhfBYhGT4H...|spotify:track:2bg...|https://api.spoti...|https://api.spoti...|     164442|             4|
|       0.759| 0.459|  8|  -5.187|   1|     0.0948|     0.00323|             0.0|  0.0906|  0.695|109.997|audio_features|2bgTY4UwhfBYhGT4H...|spotify:track:2bg...|https://api.spoti...|https://api.spoti...|     164442|             4|
|       0.702| 0.741|  9|   -5.33|   1|     0.0427|     0.00544|             0.0|   0.337|  0.646|124.925|audio_features|5eXBXreN3d1zdj6Sa...|spotify:track:5eX...|https://api.spoti...|https://api.spoti...|     187585|             4|
|       0.702| 0.741|  9|   -5.33|   1|     0.0427|     0.00544|             0.0|   0.337|  0.646|124.925|audio_features|5eXBXreN3d1zdj6Sa...|spotify:track:5eX...|https://api.spoti...|https://api.spoti...|     187585|             4|
|       0.702| 0.741|  9|   -5.33|   1|     0.0427|     0.00544|             0.0|   0.337|  0.646|124.925|audio_features|5eXBXreN3d1zdj6Sa...|spotify:track:5eX...|https://api.spoti...|https://api.spoti...|     187585|             4|
|       0.702| 0.741|  9|   -5.33|   1|     0.0427|     0.00544|             0.0|   0.337|  0.646|124.925|audio_features|5eXBXreN3d1zdj6Sa...|spotify:track:5eX...|https://api.spoti...|https://api.spoti...|     187585|             4|
|       0.739| 0.588|  8|  -4.971|   1|      0.136|     0.00308|             0.0|   0.109|  0.774|109.958|audio_features|3JR0ducCexKbeK47g...|spotify:track:3JR...|https://api.spoti...|https://api.spoti...|     164457|             4|
|       0.739| 0.588|  8|  -4.971|   1|      0.136|     0.00308|             0.0|   0.109|  0.774|109.958|audio_features|3JR0ducCexKbeK47g...|spotify:track:3JR...|https://api.spoti...|https://api.spoti...|     164457|             4|
|       0.739| 0.588|  8|  -4.971|   1|      0.136|     0.00308|             0.0|   0.109|  0.774|109.958|audio_features|3JR0ducCexKbeK47g...|spotify:track:3JR...|https://api.spoti...|https://api.spoti...|     164457|             4|
|       0.739| 0.588|  8|  -4.971|   1|      0.136|     0.00308|             0.0|   0.109|  0.774|109.958|audio_features|3JR0ducCexKbeK47g...|spotify:track:3JR...|https://api.spoti...|https://api.spoti...|     164457|             4|
|       0.746| 0.765|  6|   -4.41|   0|     0.0993|      0.0112|             0.0|  0.0936|  0.737|114.044|audio_features|4saklk6nie3yiGePp...|spotify:track:4sa...|https://api.spoti...|https://api.spoti...|     199054|             4|
|       0.746| 0.765|  6|   -4.41|   0|     0.0993|      0.0112|             0.0|  0.0936|  0.737|114.044|audio_features|4saklk6nie3yiGePp...|spotify:track:4sa...|https://api.spoti...|https://api.spoti...|     199054|             4|
|       0.746| 0.765|  6|   -4.41|   0|     0.0993|      0.0112|             0.0|  0.0936|  0.737|114.044|audio_features|4saklk6nie3yiGePp...|spotify:track:4sa...|https://api.spoti...|https://api.spoti...|     199054|             4|
|       0.746| 0.765|  6|   -4.41|   0|     0.0993|      0.0112|             0.0|  0.0936|  0.737|114.044|audio_features|4saklk6nie3yiGePp...|spotify:track:4sa...|https://api.spoti...|https://api.spoti...|     199054|             4|
+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+--------------+--------------------+--------------------+--------------------+--------------------+-----------+--------------+
only showing top 20 rows
'''

%pyspark

df2 = df1.select(df1['danceability'], df1['id'], df1['acousticness'])
df2.show()
'''
+------------+--------------------+------------+
|danceability|                  id|acousticness|
+------------+--------------------+------------+
|       0.588|3FeVmId7tL5YN8B7R...|     0.00813|
|       0.588|3FeVmId7tL5YN8B7R...|     0.00813|
|       0.588|3FeVmId7tL5YN8B7R...|     0.00813|
|       0.588|3FeVmId7tL5YN8B7R...|     0.00813|
|       0.759|2bgTY4UwhfBYhGT4H...|     0.00323|
|       0.759|2bgTY4UwhfBYhGT4H...|     0.00323|
|       0.759|2bgTY4UwhfBYhGT4H...|     0.00323|
|       0.759|2bgTY4UwhfBYhGT4H...|     0.00323|
|       0.702|5eXBXreN3d1zdj6Sa...|     0.00544|
|       0.702|5eXBXreN3d1zdj6Sa...|     0.00544|
|       0.702|5eXBXreN3d1zdj6Sa...|     0.00544|
|       0.702|5eXBXreN3d1zdj6Sa...|     0.00544|
|       0.739|3JR0ducCexKbeK47g...|     0.00308|
|       0.739|3JR0ducCexKbeK47g...|     0.00308|
|       0.739|3JR0ducCexKbeK47g...|     0.00308|
|       0.739|3JR0ducCexKbeK47g...|     0.00308|
|       0.746|4saklk6nie3yiGePp...|      0.0112|
|       0.746|4saklk6nie3yiGePp...|      0.0112|
|       0.746|4saklk6nie3yiGePp...|      0.0112|
|       0.746|4saklk6nie3yiGePp...|      0.0112|
+------------+--------------------+------------+
only showing top 20 rows
'''

컬럼 필터링

이렇게 불러온 데이터를 내장 함수를 활용해 기본적인 통계량값을 계산할 수 있다.
필요에 따라 사용자 정의 함수(UDF: User Definition Function)를 사용하여 전처리를 할 수 있다.
UDF를 통해 정의한 함수를 불리언 값으로 처리하여 다음과 같이 필터 함수에 적용시켜 컨디션을 줄 수도 있다.

%pyspark

df3 = df2.filter(df2['danceability'] >= 0.30)
df3.show()
'''
+------------+--------------------+------------+
|danceability|                  id|acousticness|
+------------+--------------------+------------+
|       0.588|3FeVmId7tL5YN8B7R...|     0.00813|
|       0.588|3FeVmId7tL5YN8B7R...|     0.00813|
|       0.588|3FeVmId7tL5YN8B7R...|     0.00813|
|       0.588|3FeVmId7tL5YN8B7R...|     0.00813|
|       0.759|2bgTY4UwhfBYhGT4H...|     0.00323|
|       0.759|2bgTY4UwhfBYhGT4H...|     0.00323|
|       0.759|2bgTY4UwhfBYhGT4H...|     0.00323|
|       0.759|2bgTY4UwhfBYhGT4H...|     0.00323|
|       0.702|5eXBXreN3d1zdj6Sa...|     0.00544|
|       0.702|5eXBXreN3d1zdj6Sa...|     0.00544|
|       0.702|5eXBXreN3d1zdj6Sa...|     0.00544|
|       0.702|5eXBXreN3d1zdj6Sa...|     0.00544|
|       0.739|3JR0ducCexKbeK47g...|     0.00308|
|       0.739|3JR0ducCexKbeK47g...|     0.00308|
|       0.739|3JR0ducCexKbeK47g...|     0.00308|
|       0.739|3JR0ducCexKbeK47g...|     0.00308|
|       0.746|4saklk6nie3yiGePp...|      0.0112|
|       0.746|4saklk6nie3yiGePp...|      0.0112|
|       0.746|4saklk6nie3yiGePp...|      0.0112|
|       0.746|4saklk6nie3yiGePp...|      0.0112|
+------------+--------------------+------------+
only showing top 20 rows
'''

%pyspark

df3 = df2.filter((df2['danceability'] >= 0.30) & (df2['acousticness'] >= 0.1)).distinct()
df3.show()
'''
+------------+--------------------+------------+
|danceability|                  id|acousticness|
+------------+--------------------+------------+
|       0.711|3ocm1Cf1Dk1ODrdBd...|       0.434|
|       0.543|0GjEhVFGZW8afUYGC...|       0.358|
|       0.724|15gU2OepIEbeNTBbe...|       0.141|
|       0.707|15XyYC19Xm12H6Ng3...|       0.561|
|       0.571|2IY559smG7SXYk229...|       0.151|
|       0.522|5LNkpie03tsGvZgFx...|       0.412|
|       0.728|3zC02nmo5OrDDMW3U...|       0.269|
|       0.499|3UHPGOkUcE4hE7sqB...|       0.296|
|       0.713|07KYRDFf8Q6sqj4PW...|        0.22|
|       0.772|2zoobJFEB9h15fjYj...|       0.158|
|        0.73|5RuCWf4bJC0rqA05b...|       0.259|
|       0.608|1b2IHNs8Rzo2ICibU...|        0.45|
|        0.54|6gQUbFwwdYXlKdmqR...|       0.435|
|       0.736|7abGerACQx9G1OBqH...|       0.182|
|       0.764|0J2p4KYdr6Mg4ET6J...|       0.457|
|       0.677|1nnhOD9S7wG8FPDMR...|       0.508|
|       0.482|49i5HlNydX8qTJQOy...|       0.298|
|       0.892|7JeKXMQKm6GoLGTkN...|       0.244|
|       0.719|1l3fRzoUngJr6hSMI...|       0.128|
|       0.609|0vQ1iCCAYNCy92g9Q...|       0.123|
+------------+--------------------+------------+
only showing top 20 rows
'''

UDF 생성

%pyspark

import pyspark.sql.functions as F

df_new = df1.select(df1['danceability'], df1['acousticness']).agg(F.avg(df1['danceability']).alias('avg_danability'), F.max(df1['acousticness']).alias('max_acousticness'))
df_new.show()
'''
+------------------+----------------+
|    avg_danability|max_acousticness|
+------------------+----------------+
|0.6186399999999985|           0.734|
+------------------+----------------+
'''

%pyspark

from pyspark.sql.functions import udf
from pyspark.sql.types import *

udf1 = udf(lambda e: e.upper())

@udf(returnType=BooleanType())
def udf2(e):
    if e >= 0.06:
        return True
    else:
        return False
        
df_filtered = df1.filter(udf2(df1['danceability']))
df_filtered.show()
'''
+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+--------------+--------------------+--------------------+--------------------+--------------------+-----------+--------------+
|danceability|energy|key|loudness|mode|speechiness|acousticness|instrumentalness|liveness|valence|  tempo|          type|                  id|                 uri|          track_href|        analysis_url|duration_ms|time_signature|
+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+--------------+--------------------+--------------------+--------------------+--------------------+-----------+--------------+
|       0.588| 0.701|  9|   -6.39|   1|     0.0402|     0.00813|             0.0|     0.2|  0.443|104.988|audio_features|3FeVmId7tL5YN8B7R...|spotify:track:3Fe...|https://api.spoti...|https://api.spoti...|     228000|             4|
|       0.588| 0.701|  9|   -6.39|   1|     0.0402|     0.00813|             0.0|     0.2|  0.443|104.988|audio_features|3FeVmId7tL5YN8B7R...|spotify:track:3Fe...|https://api.spoti...|https://api.spoti...|     228000|             4|
|       0.588| 0.701|  9|   -6.39|   1|     0.0402|     0.00813|             0.0|     0.2|  0.443|104.988|audio_features|3FeVmId7tL5YN8B7R...|spotify:track:3Fe...|https://api.spoti...|https://api.spoti...|     228000|             4|
|       0.588| 0.701|  9|   -6.39|   1|     0.0402|     0.00813|             0.0|     0.2|  0.443|104.988|audio_features|3FeVmId7tL5YN8B7R...|spotify:track:3Fe...|https://api.spoti...|https://api.spoti...|     228000|             4|
|       0.759| 0.459|  8|  -5.187|   1|     0.0948|     0.00323|             0.0|  0.0906|  0.695|109.997|audio_features|2bgTY4UwhfBYhGT4H...|spotify:track:2bg...|https://api.spoti...|https://api.spoti...|     164442|             4|
|       0.759| 0.459|  8|  -5.187|   1|     0.0948|     0.00323|             0.0|  0.0906|  0.695|109.997|audio_features|2bgTY4UwhfBYhGT4H...|spotify:track:2bg...|https://api.spoti...|https://api.spoti...|     164442|             4|
|       0.759| 0.459|  8|  -5.187|   1|     0.0948|     0.00323|             0.0|  0.0906|  0.695|109.997|audio_features|2bgTY4UwhfBYhGT4H...|spotify:track:2bg...|https://api.spoti...|https://api.spoti...|     164442|             4|
|       0.759| 0.459|  8|  -5.187|   1|     0.0948|     0.00323|             0.0|  0.0906|  0.695|109.997|audio_features|2bgTY4UwhfBYhGT4H...|spotify:track:2bg...|https://api.spoti...|https://api.spoti...|     164442|             4|
|       0.702| 0.741|  9|   -5.33|   1|     0.0427|     0.00544|             0.0|   0.337|  0.646|124.925|audio_features|5eXBXreN3d1zdj6Sa...|spotify:track:5eX...|https://api.spoti...|https://api.spoti...|     187585|             4|
|       0.702| 0.741|  9|   -5.33|   1|     0.0427|     0.00544|             0.0|   0.337|  0.646|124.925|audio_features|5eXBXreN3d1zdj6Sa...|spotify:track:5eX...|https://api.spoti...|https://api.spoti...|     187585|             4|
|       0.702| 0.741|  9|   -5.33|   1|     0.0427|     0.00544|             0.0|   0.337|  0.646|124.925|audio_features|5eXBXreN3d1zdj6Sa...|spotify:track:5eX...|https://api.spoti...|https://api.spoti...|     187585|             4|
|       0.702| 0.741|  9|   -5.33|   1|     0.0427|     0.00544|             0.0|   0.337|  0.646|124.925|audio_features|5eXBXreN3d1zdj6Sa...|spotify:track:5eX...|https://api.spoti...|https://api.spoti...|     187585|             4|
|       0.739| 0.588|  8|  -4.971|   1|      0.136|     0.00308|             0.0|   0.109|  0.774|109.958|audio_features|3JR0ducCexKbeK47g...|spotify:track:3JR...|https://api.spoti...|https://api.spoti...|     164457|             4|
|       0.739| 0.588|  8|  -4.971|   1|      0.136|     0.00308|             0.0|   0.109|  0.774|109.958|audio_features|3JR0ducCexKbeK47g...|spotify:track:3JR...|https://api.spoti...|https://api.spoti...|     164457|             4|
|       0.739| 0.588|  8|  -4.971|   1|      0.136|     0.00308|             0.0|   0.109|  0.774|109.958|audio_features|3JR0ducCexKbeK47g...|spotify:track:3JR...|https://api.spoti...|https://api.spoti...|     164457|             4|
|       0.739| 0.588|  8|  -4.971|   1|      0.136|     0.00308|             0.0|   0.109|  0.774|109.958|audio_features|3JR0ducCexKbeK47g...|spotify:track:3JR...|https://api.spoti...|https://api.spoti...|     164457|             4|
|       0.746| 0.765|  6|   -4.41|   0|     0.0993|      0.0112|             0.0|  0.0936|  0.737|114.044|audio_features|4saklk6nie3yiGePp...|spotify:track:4sa...|https://api.spoti...|https://api.spoti...|     199054|             4|
|       0.746| 0.765|  6|   -4.41|   0|     0.0993|      0.0112|             0.0|  0.0936|  0.737|114.044|audio_features|4saklk6nie3yiGePp...|spotify:track:4sa...|https://api.spoti...|https://api.spoti...|     199054|             4|
|       0.746| 0.765|  6|   -4.41|   0|     0.0993|      0.0112|             0.0|  0.0936|  0.737|114.044|audio_features|4saklk6nie3yiGePp...|spotify:track:4sa...|https://api.spoti...|https://api.spoti...|     199054|             4|
|       0.746| 0.765|  6|   -4.41|   0|     0.0993|      0.0112|             0.0|  0.0936|  0.737|114.044|audio_features|4saklk6nie3yiGePp...|spotify:track:4sa...|https://api.spoti...|https://api.spoti...|     199054|             4|
+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+--------------+--------------------+--------------------+--------------------+--------------------+-----------+--------------+
only showing top 20 rows
'''

JOIN

S3에 저장해 놓은 모든 데이터들을 병합한 마스터 테이블을 만들기 위해 아래와 같은 작업을 한다.
artists parquet 데이터를 불러와 아래와 같이 데이터프레임으로 만들 수 있다.

# AWS-S3.py
def main():
    try:
        conn = psycopg2.connect(
            host=host,
            database=database,
            user=username,
            password=password)
        cursor = conn.cursor()
    except:
        logging.error("could not connect to rds")
        sys.exit(1)

    # artists parquet
    cursor.execute('SELECT * FROM artists')
    colnames = [d[0] for d in cursor.description]
    artists = [dict(zip(colnames, row)) for row in cursor.fetchall()]
    artists = pd.DataFrame(artists)

    artists.to_parquet('artists.parquet', engine='pyarrow', compression='snappy')

    dt = datetime.utcnow().strftime('%Y-%m-%d')
    s3 = boto3.resource('s3')
    object = s3.Object('6mini-spotify', 'artists/dt={}/artists.parquet'.format(dt))
    data = open('artists.parquet', 'rb')
    object.put(Body=data)

    sys.exit()

%pyspark

artists = sqlContext.read.format('parquet').load('s3://6mini-spotify/artists/dt=2021-10-15/artists.parquet')
artists = artists.toDF('id', 'name', 'followers', 'popularity', 'url', 'image_url')
artists.show()
'''
+--------------------+--------------------+---------+----------+--------------------+--------------------+
|                  id|                name|followers|popularity|                 url|           image_url|
+--------------------+--------------------+---------+----------+--------------------+--------------------+
|1ZwdS5xdxEREPySFr...|                2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|
|3q7HBObVc0L8jNeTe...|             50 Cent|  7820442|        83|https://open.spot...|https://i.scdn.co...|
|55RI2GNCfyXr0f14u...|   A Thousand Horses|    69243|        45|https://open.spot...|https://i.scdn.co...|
|0LcJLqbBmaGUft1e9...|                ABBA|  6595341|        83|https://open.spot...|https://i.scdn.co...|
|2s79xe5F6eUQkjwjw...|                 ABC|   276038|        54|https://open.spot...|https://i.scdn.co...|
|7Ey4PD4MYsKc5I2do...|           Aerosmith| 10977177|        78|https://open.spot...|https://i.scdn.co...|
|7fUtt9kVZOyn9LWy0...|    Agnetha Fältskog|    50533|        48|https://open.spot...|https://i.scdn.co...|
|4mxWe1mtYIYfP040G...|        Alan Jackson|  1690446|        75|https://open.spot...|https://i.scdn.co...|
|5aygfDCEaX5KTZOxS...|         Albert King|   347862|        53|https://open.spot...|https://i.scdn.co...|
|3EhbVgyfGd7Hkpsag...|        Alice Cooper|  2709110|        66|https://open.spot...|https://i.scdn.co...|
|5J6L7N6B4nI1M5cwa...|       Alison Krauss|   275520|        64|https://open.spot...|https://i.scdn.co...|
|3vAaWhdBR38Q02ohX...|The All-American ...|  2180618|        70|https://open.spot...|https://i.scdn.co...|
|6Q192DXotxtaysaqN...|       Amy Winehouse|  7441082|        77|https://open.spot...|https://i.scdn.co...|
|3EA9hVIzKfFiQI0Ki...|      Andrea Bocelli|  1921448|        70|https://open.spot...|https://i.scdn.co...|
|4YJR4xviDKHoelt9W...|         Andrew W.K.|   130409|        57|https://open.spot...|https://i.scdn.co...|
|3JysSUOyfVs1UQ0Ua...|             Anthrax|  1207275|        59|https://open.spot...|https://i.scdn.co...|
|4sLklwvfOYr12fMGw...|       Apache Indian|    12210|        40|https://open.spot...|https://i.scdn.co...|
|3kjuyTCjPG1WMFCiy...|         Arcade Fire|  1986628|        68|https://open.spot...|https://i.scdn.co...|
|66CXWjxzNUsdJxJ2J...|       Ariana Grande| 70012839|        93|https://open.spot...|https://i.scdn.co...|
|5Va9LuEmaZxnbk1gM...|Arrested Development|   190409|        55|https://open.spot...|https://i.scdn.co...|
+--------------------+--------------------+---------+----------+--------------------+--------------------+
only showing top 20 rows
'''

top_tracks = sqlContext.read.format('parquet').load('s3://6mini-spotify/top-tracks/dt=2021-10-05/top-tracks.parquet')
top_tracks = top_tracks.toDF('id', 'artist_id', 'name', 'popularity', 'external_url')
top_tracks = top_tracks.withColumnRenamed('id', 'track_id').withColumnRenamed('name', 'track_name')

joined = artists.join(top_tracks, top_tracks['artist_id'] == artists['id'])
joined.registerTempTable('joined')
joined.show()
'''
+--------------------+----+---------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+
|                  id|name|followers|popularity|                 url|           image_url|            track_id|           artist_id|          track_name|popularity|        external_url|
+--------------------+----+---------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...|        I Get Around|        68|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...|        I Get Around|        68|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...|        I Get Around|        68|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...|        I Get Around|        68|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|7jLbTp3qZzah9kMId...|1ZwdS5xdxEREPySFr...|       Ghetto Gospel|        68|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|7jLbTp3qZzah9kMId...|1ZwdS5xdxEREPySFr...|       Ghetto Gospel|        68|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|7jLbTp3qZzah9kMId...|1ZwdS5xdxEREPySFr...|       Ghetto Gospel|        68|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|7jLbTp3qZzah9kMId...|1ZwdS5xdxEREPySFr...|       Ghetto Gospel|        68|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|0XRbYXQUymj9SJkrr...|1ZwdS5xdxEREPySFr...|     Keep Ya Head Up|        69|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|0XRbYXQUymj9SJkrr...|1ZwdS5xdxEREPySFr...|     Keep Ya Head Up|        69|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|0XRbYXQUymj9SJkrr...|1ZwdS5xdxEREPySFr...|     Keep Ya Head Up|        69|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|0XRbYXQUymj9SJkrr...|1ZwdS5xdxEREPySFr...|     Keep Ya Head Up|        69|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|3EFPGmCt5svQx5LAV...|1ZwdS5xdxEREPySFr...|Ambitionz Az A Ridah|        69|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|3EFPGmCt5svQx5LAV...|1ZwdS5xdxEREPySFr...|Ambitionz Az A Ridah|        69|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|3EFPGmCt5svQx5LAV...|1ZwdS5xdxEREPySFr...|Ambitionz Az A Ridah|        69|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|3EFPGmCt5svQx5LAV...|1ZwdS5xdxEREPySFr...|Ambitionz Az A Ridah|        69|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|1ofhfV90EnYhEr7Un...|1ZwdS5xdxEREPySFr...|             Changes|        69|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|1ofhfV90EnYhEr7Un...|1ZwdS5xdxEREPySFr...|             Changes|        69|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|1ofhfV90EnYhEr7Un...|1ZwdS5xdxEREPySFr...|             Changes|        69|https://open.spot...|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|1ofhfV90EnYhEr7Un...|1ZwdS5xdxEREPySFr...|             Changes|        69|https://open.spot...|
+--------------------+----+---------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+
only showing top 20 rows
'''

SQL

최종적으로 artists, top-tracks, 그리고 audio_features 모두 병합한 테이블을 만들 것 이다.
제플린의 장점 중 하나는 바로 SQL 테이블로 지정하여 쿼리문으로 작업할 수 있는 것이다.
해당 값을 바로 시각화할 수 있는 점도 장점 중의 하나이다.
옵션에서 없는 그래프를 그리는 것은 파이썬이나 다른 라이브러리를 활용하여 보완할 수 있다.

%pyspark

audio = sqlContext.read.format('parquet').load('s3://6mini-spotify/audio-features/dt=2021-10-05/top-tracks.parquet')
audio = audio.toDF('danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature')
audio = audio.withColumnRenamed('id', 'audio_id')

master = joined.join(audio, joined['track_id'] == audio['audio_id'])
master.registerTempTable('master')
master.show()
'''
+--------------------+----+---------+----------+--------------------+--------------------+--------------------+--------------------+-------------+----------------+--------------------+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+------+--------------+--------------------+--------------------+--------------------+--------------------+-----------+--------------+
|                  id|name|followers|popularity|                 url|           image_url|            track_id|           artist_id|   track_name|track_popularity|        external_url|danceability|energy|key|loudness|mode|speechiness|acousticness|instrumentalness|liveness|valence| tempo|          type|            audio_id|                 uri|          track_href|        analysis_url|duration_ms|time_signature|
+--------------------+----+---------+----------+--------------------+--------------------+--------------------+--------------------+-------------+----------------+--------------------+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+------+--------------+--------------------+--------------------+--------------------+--------------------+-----------+--------------+
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...| I Get Around|              68|https://open.spot...|       0.846| 0.416| 10| -13.767|   0|      0.118|       0.156|         9.34E-6|   0.238|  0.645|96.188|audio_features|4voEoczU7Ijborps9...|spotify:track:4vo...|https://api.spoti...|https://api.spoti...|     259000|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...| I Get Around|              68|https://open.spot...|       0.846| 0.416| 10| -13.767|   0|      0.118|       0.156|         9.34E-6|   0.238|  0.645|96.188|audio_features|4voEoczU7Ijborps9...|spotify:track:4vo...|https://api.spoti...|https://api.spoti...|     259000|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...| I Get Around|              68|https://open.spot...|       0.846| 0.416| 10| -13.767|   0|      0.118|       0.156|         9.34E-6|   0.238|  0.645|96.188|audio_features|4voEoczU7Ijborps9...|spotify:track:4vo...|https://api.spoti...|https://api.spoti...|     259000|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...| I Get Around|              68|https://open.spot...|       0.846| 0.416| 10| -13.767|   0|      0.118|       0.156|         9.34E-6|   0.238|  0.645|96.188|audio_features|4voEoczU7Ijborps9...|spotify:track:4vo...|https://api.spoti...|https://api.spoti...|     259000|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...| I Get Around|              68|https://open.spot...|       0.846| 0.416| 10| -13.767|   0|      0.118|       0.156|         9.34E-6|   0.238|  0.645|96.188|audio_features|4voEoczU7Ijborps9...|spotify:track:4vo...|https://api.spoti...|https://api.spoti...|     259000|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...| I Get Around|              68|https://open.spot...|       0.846| 0.416| 10| -13.767|   0|      0.118|       0.156|         9.34E-6|   0.238|  0.645|96.188|audio_features|4voEoczU7Ijborps9...|spotify:track:4vo...|https://api.spoti...|https://api.spoti...|     259000|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...| I Get Around|              68|https://open.spot...|       0.846| 0.416| 10| -13.767|   0|      0.118|       0.156|         9.34E-6|   0.238|  0.645|96.188|audio_features|4voEoczU7Ijborps9...|spotify:track:4vo...|https://api.spoti...|https://api.spoti...|     259000|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...| I Get Around|              68|https://open.spot...|       0.846| 0.416| 10| -13.767|   0|      0.118|       0.156|         9.34E-6|   0.238|  0.645|96.188|audio_features|4voEoczU7Ijborps9...|spotify:track:4vo...|https://api.spoti...|https://api.spoti...|     259000|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...| I Get Around|              68|https://open.spot...|       0.846| 0.416| 10| -13.767|   0|      0.118|       0.156|         9.34E-6|   0.238|  0.645|96.188|audio_features|4voEoczU7Ijborps9...|spotify:track:4vo...|https://api.spoti...|https://api.spoti...|     259000|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...| I Get Around|              68|https://open.spot...|       0.846| 0.416| 10| -13.767|   0|      0.118|       0.156|         9.34E-6|   0.238|  0.645|96.188|audio_features|4voEoczU7Ijborps9...|spotify:track:4vo...|https://api.spoti...|https://api.spoti...|     259000|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...| I Get Around|              68|https://open.spot...|       0.846| 0.416| 10| -13.767|   0|      0.118|       0.156|         9.34E-6|   0.238|  0.645|96.188|audio_features|4voEoczU7Ijborps9...|spotify:track:4vo...|https://api.spoti...|https://api.spoti...|     259000|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...| I Get Around|              68|https://open.spot...|       0.846| 0.416| 10| -13.767|   0|      0.118|       0.156|         9.34E-6|   0.238|  0.645|96.188|audio_features|4voEoczU7Ijborps9...|spotify:track:4vo...|https://api.spoti...|https://api.spoti...|     259000|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...| I Get Around|              68|https://open.spot...|       0.846| 0.416| 10| -13.767|   0|      0.118|       0.156|         9.34E-6|   0.238|  0.645|96.188|audio_features|4voEoczU7Ijborps9...|spotify:track:4vo...|https://api.spoti...|https://api.spoti...|     259000|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...| I Get Around|              68|https://open.spot...|       0.846| 0.416| 10| -13.767|   0|      0.118|       0.156|         9.34E-6|   0.238|  0.645|96.188|audio_features|4voEoczU7Ijborps9...|spotify:track:4vo...|https://api.spoti...|https://api.spoti...|     259000|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...| I Get Around|              68|https://open.spot...|       0.846| 0.416| 10| -13.767|   0|      0.118|       0.156|         9.34E-6|   0.238|  0.645|96.188|audio_features|4voEoczU7Ijborps9...|spotify:track:4vo...|https://api.spoti...|https://api.spoti...|     259000|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|4voEoczU7Ijborps9...|1ZwdS5xdxEREPySFr...| I Get Around|              68|https://open.spot...|       0.846| 0.416| 10| -13.767|   0|      0.118|       0.156|         9.34E-6|   0.238|  0.645|96.188|audio_features|4voEoczU7Ijborps9...|spotify:track:4vo...|https://api.spoti...|https://api.spoti...|     259000|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|7jLbTp3qZzah9kMId...|1ZwdS5xdxEREPySFr...|Ghetto Gospel|              68|https://open.spot...|       0.794| 0.614|  5|  -5.352|   0|     0.0467|      0.0964|             0.0|  0.0788|  0.663|80.569|audio_features|7jLbTp3qZzah9kMId...|spotify:track:7jL...|https://api.spoti...|https://api.spoti...|     238053|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|7jLbTp3qZzah9kMId...|1ZwdS5xdxEREPySFr...|Ghetto Gospel|              68|https://open.spot...|       0.794| 0.614|  5|  -5.352|   0|     0.0467|      0.0964|             0.0|  0.0788|  0.663|80.569|audio_features|7jLbTp3qZzah9kMId...|spotify:track:7jL...|https://api.spoti...|https://api.spoti...|     238053|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|7jLbTp3qZzah9kMId...|1ZwdS5xdxEREPySFr...|Ghetto Gospel|              68|https://open.spot...|       0.794| 0.614|  5|  -5.352|   0|     0.0467|      0.0964|             0.0|  0.0788|  0.663|80.569|audio_features|7jLbTp3qZzah9kMId...|spotify:track:7jL...|https://api.spoti...|https://api.spoti...|     238053|             4|
|1ZwdS5xdxEREPySFr...|2Pac| 11320693|        81|https://open.spot...|https://i.scdn.co...|7jLbTp3qZzah9kMId...|1ZwdS5xdxEREPySFr...|Ghetto Gospel|              68|https://open.spot...|       0.794| 0.614|  5|  -5.352|   0|     0.0467|      0.0964|             0.0|  0.0788|  0.663|80.569|audio_features|7jLbTp3qZzah9kMId...|spotify:track:7jL...|https://api.spoti...|https://api.spoti...|     238053|             4|
+--------------------+----+---------+----------+--------------------+--------------------+--------------------+--------------------+-------------+----------------+--------------------+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+------+--------------+--------------------+--------------------+--------------------+--------------------+-----------+--------------+
only showing top 20 rows
'''

데이터 분석

%sql

SELECT name, popularity, AVG(abs(popularity-track_popularity)) AS diff
FROM master
GROUP BY 1, 2
ORDER BY 3 ASC
LIMIT 20

%sql

SELECT track_popularity, count(*)
FROM master
GROUP BY 1
ORDER BY 1 ASC

%sql

SELECT AVG(acousticness), AVG(liveness), AVG(speechiness), AVG(tempo)
FROM master
WHERE popularity > 80

데이터의 audio feature의 분포를 통해, 예를 들어 가수의 인기도와 트랙의 인기도의 차이가 거의 없는 해당 가수의 대표적인 트랙을 알고 싶다면 아래와 같은 EDA를 먼저 실행하여 audio feature의 특징을 파악하는 것이 중요하다.
아래 그림처럼 acousticness는 전체적으로 0쪽으로 치우쳐있어 중심을 대표하는 값으로는 median을 사용해야 될 것이라고 판단할 수 있으며, danceability는 정규분포 꼴을 띄고 있어 mean을 사용해도 무방할 것으로 판단 할 수 있다.

%sql

SELECT AVG(acousticness), AVG(liveness), AVG(speechiness), AVG(tempo)
FROM master
WHERE popularity > ${popularity=80}
AND track_popularity > ${track_popularity=80}

SELECT ROUND(acousticness, 2), count(*)
FROM master
GROUP BY 1
ORDER BY 1 ASC

SELECT ROUND(danceability, 2), count(*)
FROM master
GROUP BY 1
ORDER BY 1 ASC