API - Database

This is the alpha version of database management system. If you have trouble, you can ask for help on fangde.liu@imperial.ac.uk .

Note

We are writing up the documentation, please wait in patient.

Why TensorDB

TensorLayer is designed for production, aiming to be applied large scale machine learning application. TensorDB introduce the database infracture to address the many challenges in large scale machine learning project, such as:

  1. How to mangage the training data and load the training datasets
  2. When the dataset is so large that beyonds the storage limitation of one computer
  3. How shoud we managment different models and version, and comparing different models.
  4. How to automate the whole training, evaluaiton and deploy machine learning model automatically.

In TensorLayer system, we introduce the database technology to the issues above.

TensorDB is designed by following three principles.

Everything is Data

TensorDB is a data warehouse that stores that capture the whole machine learning development process. the data inside tensordb can be catagloried as:

  1. Data and Labels: Which includes all the data for training, validation and prediction. The labels can be manually labelled or generated by machine
  2. Model Architecture: This group store the different model architecture, which user can select to use
  3. Model Parameters: This tables stores all the model parameters of echo in the training step.
  4. Jobs: All the computation is cutted into several jobs. Each jobs constains some computing work load. for training , the jobs includes training data , the model parameter, the model architecture, how many epochs the training want to do. Similarity are the validation jobs and inference jobs.
  5. Logs: The logs store all the step time and accuracy and other metric of each training steps and also the time stamps.

TensorDB in principal is a key-word based search engine. Each model, parameters, or training data are assigned many tags. The data are stored in two layers. On the top, there is the index layer, which instore the blob storage reference with all the tags assigned to the data, which is implemented based on NoSQL document database such as Mongodb. The second layer is used store big chunk of data, such as videos, medical images or image mask, which is usually implemented as file system. Our open source implementation is implemented based MongoDB. The blob data is in store in the gridfs while the tag index is stored in the documents.

Everying is identified by Query

Within TensorDB framework, any entity within the data warehouse, such as the data, model or jobs are specified by the database query language. The first advantage is the query is more efficient in space and can specify multiple objects in a concise way. The advantage such a design is to enable a highly flexible software system. data, model architecture and training are interchangeable. Many work can be implemented by simply rewire different components. This enable us to develop many new application just by change the query without change any applicaition code.

An pulling based Stream processing pipeline.

Also with a large dataset, we can assume that the data is unlimited. TensorDB provides a streaming interface, implemented in python as generators, it keeps return the new data during training. Also the training system have no clue of epochs, instead, it knows batchize and store parameters after how many steps.

Many techniques are introduced behind the streaming interface. The stream is implemented based on the database cursor technology, so for every search, only the cursors are returned, not the actual data. Only when the generator is evaluated, the acutal data is loaded. The data loading is further optimise:

  1. Data are compressed and decompressed,
  2. The dataloaded in bulk model to optimise the IO traffic
  3. The argumentation or random sample are computed on the fly after the data are loaded into the local computer.
  4. To optimise the space, the will also be a cache system that only store the recent blob data.

Based on streaming interface, TensorLayer can be implemented as a continuous machine learning. On the distributed system, the model training, validation and deployment can be running on different computers which all running continuously. The trainer can keeps on optimising the models, the evaluation keeps evaluating the recent added models and the deployment system keeps pulling the best models from the TensorDB warehouse.

Preparation

In principle, TensorDB is can be implemented on any documents NoSQL database system. The exisitng implementation is based on Mongodb. Further implementaiton on other database will be released depends on progress. It will be stragihtford to port the tensorDB system to google cloud , aws and azure.

The following tutorials are based on the MongoDb implmenetation.

Install MongoDB

The installation instruction of Mongodb can be found at MongoDB Docs there are also managed mongodb service from amazon or gcp, or mongo atlas from mongodb

User can also user docker, which is a powerful tool for deploy software .

After install mongodb, a mongod db management tool with graphic user interface will be extremely valuale.

Users can install the Studio3T( mongochef), which is free for none commerical user interface. studio3t

Start MongoDB service

After mongodb is installed, you shoud start the database.

mongod start

You can specificy the path the database files with -d flag

Quick Start

A fully working example with mnist training set is the _TensorLabDemo.ipnb_

Connect to database

To use TensorDB mongodb implmentaiton, you need pymongo client.

you can install it by

pip install pymongo
pip install lz4

it is very strateford to connected to the TensorDB system. you can try the following code

from tensorlayer.db import TensorDB
db = TensorDB(ip='127.0.0.1', port=27017, db_name='your_db', user_name=None, password=None, studyID='ministMLP')

The ip is the ip address of the database, and port number is number of mongodb. You may need to specificy the database name and studyid. The study id is an unique identifier for an experiement.

TensorDB stores different study in one data warehouse. This has pros and cons, the benefits is that suppose the each study we try a different model architecutre, it is very easy for us to evaluate different model architecture.

log and parameters

The basic application is use TensorDB to save the model parameters and training/evaluation/testing logs. to use tensorDB, this can be easily done by replacing the print function by the db.log function

For save the trainning log, we have db.train_log

and

db. save_parameter

methods

Suppose we save the log each step and save the parameters each epoch, we can have the code like this

for epoch in range(0,epoch_count):
   [~,ac]=sess.run([train_op,loss],feed_dict({x:x,y:y_}
   db.train_log({'accuracy':ac})
db.save_parameter(sess.run(network.all_parameters),{'acc':ac})

the code for save validation log and test log are similar.

Model Architecture and Jobs

TensorDb also supporting the model architecture and jobs system in the current version, both the model architecture and job are just simply strings. it is up to the user to specifiy how to convert the string back to models or job. for example, in many our our cases, we just simpliy specify the python code.

code= '''
print "hello
'''
db.save_model_architecutre(code,{'name':'print'}

c,fid = db.find_model_architecutre({'name':'print'})
exec c

db.push_job(code,{'type':'train'})

## worker
code = db.pop_job()
exec code

Database Interface

The trainning set is managed by a seperate database. each application has its own database. However, all the database interface should support two interface, 1. find_data, 2. data_generator

and example for minist dataset is include in the TensorLabDemo code

Data Importing

With a database, the development workflow is very flexible. As long as the comtent in the database in the same, user can use whatever tools to write into the database

the TesorLabDemo has an import data interface, which allow the user to injecting data in future

user can import data by the following code

db.import_data(X,y,{'type':'train'})

Application Framework

In fact, in real application, we rarely code everything from scrach and using the tensorDB interface directly. as demostrate in the TensorLabDemo

we implemented 4 class each with a well defined interace. 1. The dataset. 2. The TensorDb 3. The Model, model is loggically a full compoment can be trained, evaluate and deployed. It has property like parameters 4. The DBLogger, which is connecttor from model to tensorDB, which is implemented as callback functions, automatically called at each batch_step and each epoch.

users can based on the TensorLabDemo code, overrite the interface to suits their own applicaions needs.

when training, the overall archtiecture is first, find a data generator from the dataset module

g=datase.data_generator({"type":[your type]})

then intialize a model with a name

m=model('mytes')

during training, connected the db logger and tensordb togehter

m.fit_generator(g,dblogger(tensordb,m),1000,100)

if the work is distributed, we have to save the model archtiecture and reload and excute it

db.save_model_architecture(code,{'name':'mlp'})
db.push_job({'name':'mlp'},{'type':XXXX},{'batch:1000','epoch':100)

the worker will run the job as the following code

j=job.pop
g=dataset.data_generator(j.filter)
c=tensordb.load_model_architecutre(j.march)
exec c
m=model()
m.fit_generator(g,dblooger(tensordb,m),j.bach_size,j.epoch}