kaggle比赛实践M5-baseline研读

 

采用lightGBM模型

准备数据与训练

calendar.csv数据集导入。

该数据数聚包含物品的售卖时间与物品类型

  • date: The date in a “y-m-d” format.
  • wm_yr_wk: The id of the week the date belongs to.
  • weekday: The type of the day (Saturday, Sunday, …, Friday).
  • wday: The id of the weekday, starting from Saturday.
  • month: The month of the date.
  • year: The year of the date.
  • event_name_1: If the date includes an event, the name of this event.
  • event_type_1: If the date includes an event, the type of this event.
  • event_name_2: If the date includes a second event, the name of this event.
  • event_type_2: If the date includes a second event, the type of this event.
  • snap_CAsnap_TX, and snap_WI: A binary variable (0 or 1) indicating whether the stores of CA, TX or WI allow SNAPpurchases on the examined date. 1 indicates that SNAP purchases are allowed.
# Correct data types for "calendar.csv"
calendarDTypes = {"event_name_1": "category", 
                  "event_name_2": "category", 
                  "event_type_1": "category", 
                  "event_type_2": "category", 
                  "weekday": "category", 
                  'wm_yr_wk': 'int16', 
                  "wday": "int16",
                  "month": "int16", 
                  "year": "int16", 
                  "snap_CA": "float32", 
                  'snap_TX': 'float32', 
                  'snap_WI': 'float32' }

# Read csv file
calendar = pd.read_csv("./calendar.csv", 
                       dtype = calendarDTypes)
calendar["date"] = pd.to_datetime(calendar["date"])
calendar.head(10)

kaggle比赛实践M5-baseline研读

 

 kaggle比赛实践M5-baseline研读

 

 

# Transform categorical features into integers
for col, colDType in calendarDTypes.items():
    if colDType == "category":
        calendar[col] = calendar[col].cat.codes.astype("int16")
        calendar[col] -= calendar[col].min()

calendar.head(10)
  • calendar[col].cat.codes.astype("int16") 这个是属于简单的编码标签类别编码。后面我们尝试改为one编码试试

sell_prices.csv

File 2: “sell_prices.csv”

该数据数聚包含物品的每天每单位的售卖价格

  • store_id: The id of the store where the product is sold.
  • item_id: The id of the product.
  • wm_yr_wk: The id of the week.
  • sell_price: The price of the product for the given week/store. The price is provided per week (average across seven days). If not available, this means that the product was not sold during the examined week. Note that although prices are constant at weekly basis, they may change through time (both training and test set). 
# Correct data types for "sell_prices.csv"
priceDTypes = {"store_id": "category", 
               "item_id": "category", 
               "wm_yr_wk": "int16",
               "sell_price":"float32"}

# Read csv file
prices = pd.read_csv("./sell_prices.csv", 
                     dtype = priceDTypes)

prices.head()

kaggle比赛实践M5-baseline研读

# Transform categorical features into integers
for col, colDType in priceDTypes.items():
    if colDType == "category":
        prices[col] = prices[col].cat.codes.astype("int16")
        prices[col] -= prices[col].min()
        
prices.head()

kaggle比赛实践M5-baseline研读

sales_train_validation.csv

File 3: “sales_train.csv”

Contains the historical daily unit sales data per product and store.

  • item_id: The id of the product.
  • dept_id: The id of the department the product belongs to.
  • cat_id: The id of the category the product belongs to.
  • store_id: The id of the store where the product is sold.
  • state_id: The State where the store is located.
  • d_1, d_2, …, d_i, … d_1941: The number of units sold at day i, starting from 2011-01-29.
firstDay = 250
lastDay = 1913

# Use x sales days (columns) for training
numCols = [f"d_{day}" for day in range(firstDay, lastDay+1)]

# Define all categorical columns
catCols = ['id', 'item_id', 'dept_id','store_id', 'cat_id', 'state_id']

# Define the correct data types for "sales_train_validation.csv"
dtype = {numCol: "float32" for numCol in numCols} 
dtype.update({catCol: "category" for catCol in catCols if catCol != "id"})

[(k,v)  for k,v in dtype.items()][:10]

kaggle比赛实践M5-baseline研读

# Read csv file
ds = pd.read_csv("./sales_train_validation.csv", 
                 usecols = catCols + numCols, dtype = dtype)

ds.head()

kaggle比赛实践M5-baseline研读

 

 

# Transform categorical features into integers
for col in catCols:
    if col != "id":
        ds[col] = ds[col].cat.codes.astype("int16")
        ds[col] -= ds[col].min()
        
ds = pd.melt(ds,
             id_vars = catCols,
             value_vars = [col for col in ds.columns if col.startswith("d_")],
             var_name = "d",
             value_name = "sales")

# Merge "ds" with "calendar" and "prices" dataframe
ds = ds.merge(calendar, on = "d", copy = False)
ds = ds.merge(prices, on = ["store_id", "item_id", "wm_yr_wk"], copy = False)

ds.head()

1·1

上一篇:力扣 485. 最大连续 1 的个数 难度:简单


下一篇:python3.7 ImportError: No module named _ssl 解决方法