API - Database (alpha)

This is the alpha version of database management system. If you have trouble, you can ask for help on fangde.liu@outlook.com .

Note

We are writing up the documentation, please wait in patient.

Why TensorDB

TensorLayer is designed for real world production, capable of large scale machine learning applications. TensorDB is introduced to address many data management challenges in the large scale machine learning projects, such as:

  1. How to find the training data from the enterprise data warehouse.
  2. How to load the datasets so large that are beyond the storage limitation of one computer.
  3. How can we manage different models with version control, and compare them easily.
  4. How to automate the whole process for training, evaluaiton and deployment of machine learning models.

In TensorLayer system, we introduce the database technology to address the challenges above.

TensorDB is designed by following three principles.

Everything is Data

TensorDB is a data warehouse that stores and captures the whole machine learning development process. The data inside tensordb can be catagloried as:

  1. Data and Labels: This includes all the data for training, validation and prediction. The labels can be manually specified or generated by model prediction.
  2. Model Architecture: TensorDB inlcudes a repo that stores the different model architecture, enable user to reuse the many model development work.
  3. Model Parameters: This database stores all the model parameters of each epoch in the training step.
  4. Jobs: All the computation tasks are divided into several small jobs. Each job constains the necessary information such as hyper-parameters for training or validation. For a training job, typical information includes training data , the model parameter, the model architecture, how many epochs the training job has. Validation, testing and inference are aslo supported by the job system.
  5. Logs: The logs store all the metric of each machine learning model, such as the time stamp, step time and accuracy of each batch or epoch.

TensorDB in principle is a keyword based search engine. Each model, parameter, or training data are assigned with many tags. The storage system organize data in two layers. The index layer stores all the tags and reference to the blob storage. The index layer is implemented based on NoSQL document database such as Mongodb. The blob layber stores videos, medical images or label masks in large chunk size, which is usually implemented based upon a file system. The open source implementation of TensorDB is based on MongoDB. The blob system is based on the gridfs while the indexes are stored as documents.

Everying is identified by Query

Within TensorDB framework, any entity within the data warehouse, such as the data, model or jobs are specified by the database query language. As a reference, the query is more space efficient for storage and it can specify multiple objects in a concise way. Another advantage of such a design is to enable a highly flexible software system. data, model architecture and training managers are interchangeable. Many system can be implemented by simply rewire different components, with many new applications can be implemented just by update the query without modification of any applicaition code.

An pulling based Stream processing pipeline.

For large training dataset, TendsorDB provides a stream interface, which in theory supports unlimited large dataset. A stream interface, implemented as a python generators, keeps on generation the new data during training. When using the stream interface, the idea of epoch does not apply anymore, instead, we specify the batchize and image a epoch will have a fixed large number of steps.

Many techniques are introduced behind the stream interface for performance optimization. The stream interface is based on the database cursor technology. For every data query, only the cursors are returned immediately, not the actual query results. The acutal data are loaded later when the generators are evaluated.

The data loading is further optimised in many ways:

  1. Data are compressed and decompressed,
  2. The data are loaded in bulk model to further optimise the IO traffic
  3. The data argumentation or random sampling are computed on the fly, only after the data are loaded into the local computer memory.
  4. We also introduced simple cache system that stores the recent blob data.

Based on the stream interface, a continous machine learning system can be easily implemented. On a distributed system, the model training, validation and deployment can be run by different computing node which are all running continuously. The trainer keeps on optimising the models with new added data, the evaluation node keeps on evaluating the recent generated models and the deployment system keeps pulling the best models from the TensorDB warehouse for application.

Preparation

In principle, TensorDB is can be implemented based on any document oriented NoSQL database system. The exisitng implementation is based on Mongodb. Further implementaiton on other database will be released depends on the progress. It will be stragihtford to port the tensorDB system to google cloud , aws and azure.

The following tutorials are based on the MongoDb implmenetation.

Install MongoDB

The installation instruction of Mongodb can be found at MongoDB Docs there are also managed mongodb service from amazon or gcp, such as the mongo atlas from mongodb User can also user docker, which is a powerful tool for deploy software . After install mongodb, a mongod db management tool with graphic user interface will be extremely valuale. Users can install the Studio3T( mongochef), which is powerful user interface tool for mongodb and it is free for none commerical usage studio3t

Start MongoDB service

After mongodb is installed, you shoud start the database daemon.

mongod start

You can specificy the path the database files with -d flag

Quick Start

A fully working example with mnist training set is the _TensorLabDemo.ipnb_

Connect to database

To use TensorDB mongodb implmentaiton, you need pymongo client.

you can install it by

pip install pymongo
pip install lz4

it is very straightforward to connected to the TensorDB system, as shown in the following code

from tensorlayer.db import TensorDB
db = TensorDB(ip='127.0.0.1', port=27017, db_name='your_db', user_name=None, password=None, studyID='ministMLP')

The ip is the ip address of the database, and port number is port number of mongodb. You may need to specificy the database name and studyid. The study id is an unique identifier for each experiement.

TensorDB stores different study in one data warehouse. This design decision has pros and cons. An obivious benefit is that it is more convient for us to compare different studies. Suppose in each study we use a different model architecutre, we can evaluate different model architectures by visiting just one database.

logs and parameters

The basic application of TensorDB to save the model parameters and training/evaluation/testing logs. This can be easily done by replacing the print function by the db.log functions.

For save the trainning log and model parameters, we have db.train_log

and

db. save_parameter

methods

The following code shows how we log the model accuracy after each step and save the parameters afer each epoch. Validation and test codes are very similiar.

for epoch in range(0,epoch_count):
   [~,ac]=sess.run([train_op,loss],feed_dict({x:x,y:y_}
   db.train_log({'accuracy':ac})
db.save_parameter(sess.run(network.all_parameters),{'acc':ac})

Model Architecture and Jobs

TensorDb also implemented an system for managing the model architectures and working jobs. In the current version, both the model architecture and jobs are stored as strings. it is up to the user to convert the string back to models or job. for example, in many our our cases, we simple store the python code.

code= '''
print "hello
'''
db.save_model_architecutre(code,{'name':'print'}

c,fid = db.find_model_architecutre({'name':'print'})
exec c

db.push_job(code,{'type':'train'})

## worker
code = db.pop_job()
exec code

Database Interface

The trainning and validation datasets are managed by a seperate database class object. each application can implement its own database class. However, all the database class should support two interfaces, 1. find_data, 2. data_generator

and an example for mnist dataset is included in the TensorLabDemo code

Data Importing

With a database, the development workflow is very flexible. As long as the content in the database is well defined, user can use whatever tools to import data into the database

the TesorLabDemo implemented a api for the database class, which allows the users to inject data.

user can import data as the following code shows

db.import_data(X,y,{'type':'train'})

Application Framework

In fact, in real application, we rarely use the tensorDB interfaces directly and develop application with TensorLayer APIs.

As demostrated in the TensorLabDemo, we implemented 4 class, each of which has a group of well defined application interaces. 1. The dataset. 2. The TensorDb 3. The Model, a model is loggically a machine learning model, that can be trained, evaluated and deployed. It has properties such as like parameters and model architecture. 4. The DBLogger, which connects the model and tensorDB, is implemented as a group of callback functions, automatically executed at each batch step and after each epoch.

users can overrite the interfaces to suit their own applicaions, please reference the TensorLab demo code for the implementation details.

To train a machine learning model, the whole code using tensordb is as follows:

first, find the training data from the dataset object.

g=datase.data_generator({"type":[your type]})

then intialize a model with a study name

m=model('mytest')

during training, A DBLogger object will connect model and tensordb togehter to save the logs

m.fit_generator(g,dblogger(tensordb,m),1000,100)

To distribute the job, we have to save the model archtiecture, then a worker can reload and excute it

db.save_model_architecture(code,{'name':'mlp'})
db.push_job({'name':'mlp'},{'type':XXXX},{'batch:1000','epoch':100)

the worker will run the job, demostrated by the following code

j=job.pop
g=dataset.data_generator(j.filter)
c=tensordb.load_model_architecutre(j.march)
exec c
m=model()
m.fit_generator(g,dblooger(tensordb,m),j.bach_size,j.epoch}
class tensorlayer.db.TensorDB(ip='localhost', port=27017, db_name='db_name', user_name=None, password='password', studyID=None)[source]

TensorDB is a MongoDB based manager that help you to manage data, network topology, parameters and logging.

Parameters:
  • ip (str) – Localhost or IP address.
  • port (int) – Port number.
  • db_name (str) – Database name.
  • user_name (str) – User name. Set to None if it donnot need authentication.
  • password (str) – Password
db

pymongo.MongoClient[db_name], xxxxxx

datafs

gridfs.GridFS(self.db, collection="datafs"), xxxxxxxxxx

modelfs

gridfs.GridFS(self.db, collection="modelfs"),

paramsfs

gridfs.GridFS(self.db, collection="paramsfs"),

db.Params

Collection for

db.TrainLog

Collection for

db.ValidLog

Collection for

db.TestLog

Collection for

studyID

string, unique ID, if None random generate one.

Notes

  • MongoDB, as TensorDB is based on MongoDB, you need to install it in your local machine or remote machine.
  • pip install pymongo, for MongoDB python API.
  • You may like to install MongoChef or Mongo Management Studo APP for visualizing or testing your MongoDB.
save_params(params=None, args=None)[source]

Save parameters into MongoDB Buckets, and save the file ID into Params Collections.

Parameters:
  • params (a list of parameters) –
  • args (dictionary, item meta data.) –
Returns:

f_id

Return type:

the Buckets ID of the parameters.