Reading CSV with tensor - what's the best approach
So I tested different ways of reading a CSV file with 97K lines and each line with 500 functions (about 100mb).
My first approach was to read all data in memory using numpy:
raw_data = genfromtxt (filename, dtype = numpy.int32, delimiter = ',')
This command took so long to work that I needed to find a better way to read my file.
The second approach was to follow this guide: https://www.tensorflow.org/programmers_guide/reading_data
The first thing I noticed is that each epoch takes much longer to launch. Since I am using stochastic gradient descent this can be explained because each batch has to be read from a file
Is there a way to optimize this second approach?
My code (second approach):
reader = tf.TextLineReader()
filename_queue = tf.train.string_input_producer([filename])
_, csv_row = reader.read(filename_queue) # read one line
data = tf.decode_csv(csv_row, record_defaults=rDefaults) # use defaults for this line (in case of missing data)
labels = data[0]
features = data[labelsSize:labelsSize+featuresSize]
# minimum number elements in the queue after a dequeue, used to ensure
# that the samples are sufficiently mixed
# I think 10 times the BATCH_SIZE is sufficient
min_after_dequeue = 10 * batch_size
# the maximum number of elements in the queue
capacity = 20 * batch_size
# shuffle the data to generate BATCH_SIZE sample pairs
features_batch, labels_batch = tf.train.shuffle_batch([features, labels], batch_size=batch_size, num_threads=10, capacity=capacity, min_after_dequeue=min_after_dequeue)
* * * *
coordinator = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coordinator)
try:
# And then after everything is built, start the training loop.
for step in xrange(max_steps):
global_step = step + offset_step
start_time = time.time()
# Run one step of the model. The return values are the activations
# from the `train_op` (which is discarded) and the `loss` Op. To
# inspect the values of your Ops or variables, you may include them
# in the list passed to sess.run() and the value tensors will be
# returned in the tuple from the call.
_, __, loss_value, summary_str = sess.run([eval_op_train, train_op, loss_op, summary_op])
except tf.errors.OutOfRangeError:
print('Done training -- epoch limit reached')
finally:
coordinator.request_stop()
# Wait for threads to finish.
coordinator.join(threads)
sess.close()
source to share
The solution might be to convert the data to binary tensorflow
using TFRecords
.
See TensorFlow Data Entry (Part 1): Placeholders, Protobuffs, and Queues
and for converting CSV file to TFRecords
view this snippet:
csv = pandas.read_csv("your.csv").values
with tf.python_io.TFRecordWriter("csv.tfrecords") as writer:
for row in csv:
features, label = row[:-1], row[-1]
example = tf.train.Example()
example.features.feature["features"].float_list.value.extend(features)
example.features.feature["label"].int64_list.value.append(label)
writer.write(example.SerializeToString())
While for streams of (very) large files from the local filesystem, in a more real-world use case, from a remote storage like AWS S3, HDFS, etc., the Gensim smart_open python library might be useful :
# stream lines from an S3 object
for line in smart_open.smart_open('s3://mybucket/mykey.txt'):
print line
source to share