[TensorFlow] Tutorial 5. Large-scale Linear Models with TensorFlow

전공관련/Deep Learning 2016. 11. 9. 10:39

다양한 속성을 이용하여 선형학습을 하기 위한 예제.

#-*- coding: utf-8 -*-

# 임시 파일을 만들어 쓸떄 유용한 라이브러리
import tempfile
# 웹상의 문서나 파일을 가져올때 사용하는 라이브러리
import urllib
# 임시 파일을 생성.
train_file = tempfile.NamedTemporaryFile()
test_file = tempfile.NamedTemporaryFile()

# urlretrieve(URL주소, 저장할 이름) : 다운로드할 파일이 있는 곳에서 URL주소에서 지정함
urllib.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", train_file.name)
urllib.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", test_file.name)

# 데이터 분석을 위한 모듈
import pandas as pd
# 컬럼 리스트
COLUMNS = ["age", "workclass", "fnlwgt", "education", "education_num",
           "marital_status", "occupation", "relationship", "race", "gender",
           "capital_gain", "capital_loss", "hours_per_week", "native_country",
           "income_bracket"]
df_train = pd.read_csv(train_file, names=COLUMNS, skipinitialspace=True)
df_test = pd.read_csv(test_file, names=COLUMNS, skipinitialspace=True, skiprows=1)

LABEL_COLUMN = "label"
df_train[LABEL_COLUMN] = (df_train["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)
df_test[LABEL_COLUMN] = (df_test["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)

# 카테고리화 할 수 있는 항목에 대한 리스트
CATEGORICAL_COLUMNS = ["workclass", "education", "marital_status", "occupation",
                       "relationship", "race", "gender", "native_country"]
# 연속적인 값을 가지는 항목에 대한 리스트
CONTINUOUS_COLUMNS = ["age", "education_num", "capital_gain", "capital_loss", "hours_per_week"]

import tensorflow as tf

def input_fn(df):
  # 연속적인 값과 카테고리컬 값은 속성이 다르므로 따로 처리해서 합치기

  # Creates a dictionary mapping from each continuous feature column name (k) to
  # the values of that column stored in a constant Tensor.
  # 연속적인 값의 경우 한 상태에서 하나의 값만을 가지기 때문에 constant로 할당
  continuous_cols = {k: tf.constant(df[k].values)
                     for k in CONTINUOUS_COLUMNS}
  # Creates a dictionary mapping from each categorical feature column name (k)
  # to the values of that column stored in a tf.SparseTensor.
  # 카테고리 데이터는 해당 카테고리만 1, 나머지는 0의 값을 가지기 때문에 sparse
  # 예 ) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  categorical_cols = {k: tf.SparseTensor(
      indices=[[i, 0] for i in range(df[k].size)],
      values=df[k].values,
      shape=[df[k].size, 1])
                      for k in CATEGORICAL_COLUMNS}

  # Merges the two dictionaries into one.
  feature_cols = dict(continuous_cols.items() + categorical_cols.items())
  # Converts the label column into a constant Tensor.
  label = tf.constant(df[LABEL_COLUMN].values)
  # Returns the feature columns and the label.
  return feature_cols, label

def train_input_fn():
  return input_fn(df_train)

def eval_input_fn():
  return input_fn(df_test)

# 항목이 몇개 안되거나 알고있는 겅우 sparse_column_with_keys 로 할당
# 차례대로 0, 1, ... 순으로 키값을 가짐.
gender = tf.contrib.layers.sparse_column_with_keys(
  column_name="gender", keys=["Female", "Male"])

# 항목 개수를 정확히 모를경우 테이블로 생성
# 가능한 값들을 자동으로 할당?
education = tf.contrib.layers.sparse_column_with_hash_bucket("education", hash_bucket_size=1000)

race = tf.contrib.layers.sparse_column_with_keys(column_name="race", keys=[
  "Amer-Indian-Eskimo", "Asian-Pac-Islander", "Black", "Other", "White"])
marital_status = tf.contrib.layers.sparse_column_with_hash_bucket("marital_status", hash_bucket_size=100)
relationship = tf.contrib.layers.sparse_column_with_hash_bucket("relationship", hash_bucket_size=100)
workclass = tf.contrib.layers.sparse_column_with_hash_bucket("workclass", hash_bucket_size=100)
occupation = tf.contrib.layers.sparse_column_with_hash_bucket("occupation", hash_bucket_size=1000)
native_country = tf.contrib.layers.sparse_column_with_hash_bucket("native_country", hash_bucket_size=1000)

age = tf.contrib.layers.real_valued_column("age")
education_num = tf.contrib.layers.real_valued_column("education_num")
capital_gain = tf.contrib.layers.real_valued_column("capital_gain")
capital_loss = tf.contrib.layers.real_valued_column("capital_loss")
hours_per_week = tf.contrib.layers.real_valued_column("hours_per_week")

age_buckets = tf.contrib.layers.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

education_x_occupation = tf.contrib.layers.crossed_column([education, occupation], hash_bucket_size=int(1e4))

age_buckets_x_race_x_occupation = tf.contrib.layers.crossed_column(
  [age_buckets, race, occupation], hash_bucket_size=int(1e6))


model_dir = tempfile.mkdtemp()
# 생성 및 정리 된 속성들로 linearclassifier 생성
m = tf.contrib.learn.LinearClassifier(feature_columns=[
  gender, native_country, education, occupation, workclass, marital_status, race,
  age_buckets, education_x_occupation, age_buckets_x_race_x_occupation],
  model_dir=model_dir)

# fit 을 통하여 학습.
m.fit(input_fn=train_input_fn, steps=200)

# testset으로 학습 된 모델 확인.
results = m.evaluate(input_fn=eval_input_fn, steps=1)

# 결과출력
for key in sorted(results):
    print "%s: %s" % (key, results[key])

예제를 실행 해 보면 다음과 같은 결과를 얻을 수 있다.

accuracy: 0.834224

accuracy/baseline_target_mean: 0.236226

accuracy/threshold_0.500000_mean: 0.834224

auc: 0.879887

global_step: 200

labels/actual_target_mean: 0.236226

labels/prediction_mean: 0.240396

loss: 0.357844

precision/positive_threshold_0.500000_mean: 0.711234

recall/positive_threshold_0.500000_mean: 0.50208

저작자표시 비영리

'전공관련 > Deep Learning' 카테고리의 다른 글

[Caffe] Digits을 위한 NCCL 설정 방법 (0)	2017.02.02
[Keras] windows 환경에서 Theano 와 Keras 설치하기 (0)	2016.11.24
[Caffe] windows 환경에서 caffe를 설치하자 (161102 기준) (11)	2016.11.03
[TensorFlow] Tutorial 4. tf.contrib.learn Quickstart (0)	2016.11.02
[TensorFlow] Tutorial 3. TensorFlow Mechanics 101 (0)	2016.11.02

매직블럭

작은 지식들 그리고 기억 한조각

달력

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

02-07 18:36

[TensorFlow] Tutorial 5. Large-scale Linear Models with TensorFlow

'전공관련 > Deep Learning' 카테고리의 다른 글

카테고리

태그목록

달력

LATEST FROM OUR BLOG

BLOG VISITORS

티스토리툴바