Skip to content

data partition within group#1

Merged
asfgit merged 3 commits intoapache:masterfrom
nudles:master
May 17, 2015
Merged

data partition within group#1
asfgit merged 3 commits intoapache:masterfrom
nudles:master

Conversation

@nudles
Copy link
Copy Markdown
Member

@nudles nudles commented May 16, 2015

This training scheme partitions one batch data into sub-batches where each worker in the group process one sub-batch. It is implemented by partitioning the layers (except the data and parser layers) of the original neural network into sub-layers where each sub-layer holds a sub-batch of the features. These sub-layers share the same set of parameter objects. There is one worker who is the owner for each parameter object. Each worker owns a partition of the neural network and computes the gradients of parameters over it.

The work flow is,

  • Each parameter object is initialized by its owner worker. A put request is then sent to the server by the owner worker.
  • Each worker waits for the fresh parameters and runs the Back-propagation algorithm over its owned partitions (i.e., layers). Once getting the gradients, it send an update message to the main thread (i.e., the stub).
  • The main thread averages the gradients from all workers for the same parameter and sends the update request the the server.
  • The main thread handles the responses for the update requests and update the parameter version and data field.

TODO,

  • optimization for single node case, where memory copy can be avoided by sharing memory between servers and workers.
  • consider multiple nodes case.
  • update the autoconf files to delete some files which are merged into other files.

nudles and others added 3 commits May 12, 2015 09:50
TODO
1. update the performance collection by reporting performance to the stub.
2. let workers pass requests to the stub without copying data (passing addr or param id). messages to servers are then generated by the stub which can aggregate gradients of shared parameters from all workers and collect the updated parameters for them.
…implify the logics. now workers send simple messages to the stub thread which construct the real update/get/put requests.

the stub thread also handles the responses from servers. E.g., the get/update response is handled by the stub now. the workers then wait until its param's version is updated in the collect function.
avoid deadlocks for param_dealer_ and layer_dealer_
2. tested data partition in single group in one procs.
3. generate a json file under workspace/visualization representing the neural net structure. users can create an image using the python script (scirpt/graph.py) reading the json file.
@asfgit asfgit merged commit 0d47ec5 into apache:master May 17, 2015
asfgit pushed a commit that referenced this pull request Aug 30, 2016
check build python package in mac
nudles pushed a commit that referenced this pull request Aug 9, 2019
nudles pushed a commit that referenced this pull request Nov 27, 2019
changes EXPECT_EQ to EXPECT_NEAR
joddiy referenced this pull request in joddiy/incubator-singa Jan 14, 2020
nudles pushed a commit that referenced this pull request Aug 12, 2020
nudles pushed a commit that referenced this pull request Aug 24, 2020
Update from apache:dev branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants