Skip to main content

Data Lake & Big Data

Schema-on-Read, Zone ๊ตฌ์กฐ, Hadoop/Spark, Data Lakehouse


๐Ÿ“š ์‹œ๋ฆฌ์ฆˆ ๋„ค๋น„๊ฒŒ์ด์…˜โ€‹

์ด์ „ํ˜„์žฌ๋‹ค์Œ
DW SystemData LakeETL & Pipeline

โ†’ ์‹œ๋ฆฌ์ฆˆ ๋ชฉ์ฐจ


๐ŸŽฏ Data Lake๋ž€?โ€‹

Data Lake: ์ •์ œ๋˜์ง€ ์•Š์€ ์›์‹œ ๋ฐ์ดํ„ฐ๋ฅผ ์›๋ณธ ๊ทธ๋Œ€๋กœ ์ €์žฅํ•˜๋Š” ์ค‘์•™ ์ €์žฅ์†Œ

ํ•ต์‹ฌ ํŠน์ง•:

  • ์›๋ณธ ๋ฐ์ดํ„ฐ ๊ทธ๋Œ€๋กœ ์ €์žฅ (Schema-on-Read)
  • ์ •ํ˜•/๋ฐ˜์ •ํ˜•/๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ ๋ชจ๋‘ ์ˆ˜์šฉ
  • ์ €๋ ดํ•œ ์Šคํ† ๋ฆฌ์ง€์— ๋Œ€์šฉ๋Ÿ‰ ์ €์žฅ
  • ๋‚˜์ค‘์— ํ•„์š”ํ•  ๋•Œ ๋ณ€ํ™˜/๋ถ„์„

๐Ÿ†š DW vs Data Lakeโ€‹

๊ตฌ๋ถ„Data WarehouseData Lake
๋ฐ์ดํ„ฐ ํ˜•ํƒœ์ •ํ˜• (๊ตฌ์กฐํ™”)์ •ํ˜•/๋ฐ˜์ •ํ˜•/๋น„์ •ํ˜•
์Šคํ‚ค๋งˆSchema-on-WriteSchema-on-Read
์ฒ˜๋ฆฌ ์‹œ์ ์ ์žฌ ์ „ ๋ณ€ํ™˜ (ETL)์‚ฌ์šฉ ์‹œ ๋ณ€ํ™˜ (ELT)
์ €์žฅ ๋น„์šฉ๋†’์Œ๋‚ฎ์Œ
์‚ฌ์šฉ์ž๋น„์ฆˆ๋‹ˆ์Šค ๋ถ„์„๊ฐ€๋ฐ์ดํ„ฐ ์—”์ง€๋‹ˆ์–ด, ๊ณผํ•™์ž
์ฟผ๋ฆฌ ์„ฑ๋Šฅ๋น ๋ฆ„ (์ตœ์ ํ™”๋จ)์ƒ๋Œ€์ ์œผ๋กœ ๋А๋ฆผ
๋ฐ์ดํ„ฐ ํ’ˆ์งˆ๋†’์Œ (์ •์ œ๋จ)๋‹ค์–‘ํ•จ (์›๋ณธ)

๐Ÿ“Š Schema-on-Write vs Schema-on-Readโ€‹

๊ตฌ๋ถ„Schema-on-Write (DW)Schema-on-Read (Data Lake)
์Šคํ‚ค๋งˆ ์ •์˜์ €์žฅ ์ „์กฐํšŒ ์‹œ
์œ ์—ฐ์„ฑ๋‚ฎ์Œ๋†’์Œ
๋ฐ์ดํ„ฐ ํ’ˆ์งˆ๋†’์Œ๋‹ค์–‘ํ•จ
์ €์žฅ ์†๋„๋А๋ฆผ (๋ณ€ํ™˜)๋น ๋ฆ„ (์›๋ณธ ์ €์žฅ)
์ฟผ๋ฆฌ ์†๋„๋น ๋ฆ„๋А๋ฆผ

๐Ÿ—๏ธ Data Lake ์•„ํ‚คํ…์ฒ˜โ€‹

Zone ๊ตฌ์กฐโ€‹

Data Lake๋Š” ๋ณดํ†ต ์—ฌ๋Ÿฌ ์˜์—ญ(Zone)์œผ๋กœ ๋‚˜๋‰จ.


๐Ÿ”ง Big Data ๊ธฐ์ˆ  ์Šคํƒโ€‹

์ €์žฅ (Storage)โ€‹

๊ธฐ์ˆ ์„ค๋ช…
HDFSHadoop Distributed File System, ๋ถ„์‚ฐ ํŒŒ์ผ ์‹œ์Šคํ…œ
Amazon S3ํด๋ผ์šฐ๋“œ ์˜ค๋ธŒ์ ํŠธ ์Šคํ† ๋ฆฌ์ง€
Azure Data Lake StorageAzure ๋ฐ์ดํ„ฐ ๋ ˆ์ดํฌ ์Šคํ† ๋ฆฌ์ง€
Google Cloud StorageGCP ์˜ค๋ธŒ์ ํŠธ ์Šคํ† ๋ฆฌ์ง€

์ฒ˜๋ฆฌ (Processing)โ€‹

๊ธฐ์ˆ ์œ ํ˜•์„ค๋ช…
Hadoop MapReduce๋ฐฐ์น˜๋Œ€์šฉ๋Ÿ‰ ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ (๋ ˆ๊ฑฐ์‹œ)
Apache Spark๋ฐฐ์น˜/์ŠคํŠธ๋ฆผ์ธ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ, ๋น ๋ฆ„
Apache Flink์ŠคํŠธ๋ฆผ์‹ค์‹œ๊ฐ„ ์ŠคํŠธ๋ฆผ ์ฒ˜๋ฆฌ
Apache Kafka์ŠคํŠธ๋ฆผ๋ฉ”์‹œ์ง€ ํ, ์ด๋ฒคํŠธ ์ŠคํŠธ๋ฆฌ๋ฐ

์ฟผ๋ฆฌ (Query)โ€‹

๊ธฐ์ˆ ์„ค๋ช…
Apache HiveSQL on Hadoop, ๋ฐฐ์น˜ ์ฟผ๋ฆฌ
Presto/Trino๋ถ„์‚ฐ SQL ์—”์ง„, ๋น ๋ฅธ ์ฟผ๋ฆฌ
Apache Drill์Šคํ‚ค๋งˆ ์—†๋Š” ์ฟผ๋ฆฌ
Amazon AthenaS3 ์œ„ ์„œ๋ฒ„๋ฆฌ์Šค SQL

๐Ÿ˜ Hadoop ์—์ฝ”์‹œ์Šคํ…œโ€‹


โšก Apache Sparkโ€‹

ํ˜„์žฌ ๊ฐ€์žฅ ๋งŽ์ด ์“ฐ์ด๋Š” ๋น…๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์—”์ง„์ž„.

ํŠน์ง•:

  • ์ธ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ โ†’ MapReduce๋ณด๋‹ค ์ตœ๋Œ€ 100๋ฐฐ ๋น ๋ฆ„
  • ๋ฐฐ์น˜ + ์ŠคํŠธ๋ฆผ ์ฒ˜๋ฆฌ ํ†ตํ•ฉ
  • Python, Scala, Java, R, SQL ์ง€์›

๊ตฌ์„ฑ ์š”์†Œ:

์ปดํฌ๋„ŒํŠธ์šฉ๋„
Spark Core๊ธฐ๋ณธ ์—”์ง„, RDD
Spark SQL๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
Spark Streaming์‹ค์‹œ๊ฐ„ ์ŠคํŠธ๋ฆผ ์ฒ˜๋ฆฌ
MLlib๋จธ์‹ ๋Ÿฌ๋‹ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
GraphX๊ทธ๋ž˜ํ”„ ์ฒ˜๋ฆฌ
# PySpark ์˜ˆ์‹œ
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

# CSV ์ฝ๊ธฐ
df = spark.read.csv("s3://bucket/data.csv", header=True)

# ๋ณ€ํ™˜
result = df.filter(df.age > 30).groupBy("department").count()

# ์ €์žฅ
result.write.parquet("s3://bucket/output/")

๐Ÿข Data Lakehouseโ€‹

Data Lakehouse: Data Lake + Data Warehouse ์žฅ์  ๊ฒฐํ•ฉ

ํŠน์ง•์„ค๋ช…
ACID ์ง€์›Data Lake ์œ„์—์„œ ํŠธ๋žœ์žญ์…˜
์Šคํ‚ค๋งˆ ๊ด€๋ฆฌ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๋ ˆ์ด์–ด
์„ฑ๋Šฅ ์ตœ์ ํ™”์ธ๋ฑ์‹ฑ, ์บ์‹ฑ
ํ†ตํ•ฉ ์ ‘๊ทผBI ๋„๊ตฌ์—์„œ ์ง์ ‘ ์ฟผ๋ฆฌ

์ฃผ์š” ๊ธฐ์ˆ :

๊ธฐ์ˆ ์„ค๋ช…
Delta LakeDatabricks, Spark ๊ธฐ๋ฐ˜
Apache IcebergNetflix ์˜คํ”ˆ์†Œ์Šค
Apache HudiUber ์˜คํ”ˆ์†Œ์Šค

โ˜๏ธ ํด๋ผ์šฐ๋“œ Data Lake ์„œ๋น„์Šคโ€‹

ํด๋ผ์šฐ๋“œ์„œ๋น„์Šค์„ค๋ช…
AWSS3 + Athena + GlueS3 ๊ธฐ๋ฐ˜, ์„œ๋ฒ„๋ฆฌ์Šค ์ฟผ๋ฆฌ
AzureADLS + Synapseํ†ตํ•ฉ ๋ถ„์„ ์„œ๋น„์Šค
GCPGCS + BigQuery์„œ๋ฒ„๋ฆฌ์Šค DW
DatabricksMulti-cloudSpark ๊ธฐ๋ฐ˜ ํ†ตํ•ฉ ํ”Œ๋žซํผ

๐Ÿ”— ์‹œ๋ฆฌ์ฆˆ ๋„ค๋น„๊ฒŒ์ด์…˜โ€‹

์ด์ „๋‹ค์Œ
Data Warehouse SystemETL & Data Pipeline

โ†’ ์‹œ๋ฆฌ์ฆˆ ๋ชฉ์ฐจ๋กœ ๋Œ์•„๊ฐ€๊ธฐ