2021-02-15 게시 됨2021-11-04 업데이트 됨개발자 공부 / Kaggle4분안에 읽기 (약 591 단어)

[캐글 스터디] Bike Sharing Demand

캐글 문제풀이. 도시의 자전거 대여 시스템과 관련된 여러 정보를 통해 자전거 대여량 수요를 예측하도록 한다.

문제 링크

1. 데이터셋

datetime - hourly date + timestamp
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - “feels like” temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals

학습 타겟 y는 count column이다
train dataset에는 casual과 registered가 있는데 test dataset에는 없다. 그냥 drop하면 되는것인지 약간 궁금

2. Baseline 잡기

# csv 파일을 읽어온다
# 파일을 읽어올때 날짜/시간 데이터를 자동으로 파싱한다 : https://rfriend.tistory.com/536
train = pd.read_csv("/kaggle/input/bike-sharing-demand/train.csv", parse_dates=["datetime"])
test = pd.read_csv("/kaggle/input/bike-sharing-demand/test.csv", parse_dates=["datetime"])

display(train, test)

# 저장해온 데이터셋의 정보를 확인한다.
train.info()
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   casual      10886 non-null  int64         
 10  registered  10886 non-null  int64         
 11  count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.7 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    6493 non-null   datetime64[ns]
 1   season      6493 non-null   int64         
 2   holiday     6493 non-null   int64         
 3   workingday  6493 non-null   int64         
 4   weather     6493 non-null   int64         
 5   temp        6493 non-null   float64       
 6   atemp       6493 non-null   float64       
 7   humidity    6493 non-null   int64         
 8   windspeed   6493 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(5)
memory usage: 456.7 KB

pandas.Series.dt 의 함수를 사용해 datetime 정보를 정제해준다.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.date.html

# 날짜 데이터 전처리 진행
train["year"] = train["datetime"].dt.year
test["year"] = test["datetime"].dt.year

train["month"] = train["datetime"].dt.month
test["month"] = test["datetime"].dt.month

train["day"] = train["datetime"].dt.weekday
test["day"] = test["datetime"].dt.weekday

train["hour"] = train["datetime"].dt.hour
test["hour"] = test["datetime"].dt.hour

기존의 datetime column을 제거해준다. RandomForestRegressor를 활용해 count에 대해 학습 시킨다.

train2 = train.drop(["count", "datetime", "registered", "casual"], axis=1)
test2 = test.drop(["datetime"], axis=1)

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_jobs=-1)
rf.fit(train2, train["count"])
result = rf.predict(test2)

이렇게 하면 대략 0.46 내외의 점수가 나온다. 리더보드 기준 1000등 가량

[캐글 스터디] Bike Sharing Demand

1. 데이터셋

2. Baseline 잡기

댓글

카탈로그

카테고리

태그