Datasets

vp-add-dataset - Register a dataset

description

In order to be used with other tools a dataset must first be registered. A set of files is given a name to be grouped by and metadata describing it. In order to transfer data it must first be tagged.

command line

Usage: --name=cluster --tag-name=name [options] file_1 .. file_n

Options:
  -h, --help            show this help message and exit
  --host=HOST           Host of web services to connect to, defaults to local
                        host
  --name=NAME           Name of cluster, defaults to local
  --tag-name=TAG_NAME   Name of tag
  --tag-base-dir=TAG_BASE_DIR
                        Base directory of the tag
  -r, --recursive       Recursively include directories
  -e, --expand          Expand archives
  -c COMPRESS, --compress=COMPRESS
                        Make a tarball of the tagged results.  This should be
                        the directory to put the tarball. This is not mutually
                        exclusive with --expand
  -a, --append          Append listed files to tag name, ignoring duplicate
                        files
  -o, --overwrite       Overwrite file list if it exists
  -b, --block           Block on the tagging
  -t, --print-task-name
                        Print the name of the task at the end
  -m METADATA           Add metadata in a key=value notation.  Multiple
                        options are valid.  Ex: -m filetype=fasta -m
                        usage=referencedb

command line examples

Example 1

Add a file called /foo/bar/baz under the tag name example1:

vp-add-dataset --tag-name=example1 /foo/bar/baz

Output:

Task: tagData-1291407449.22 Type: tagData       State: completed     Num: 1/1 (100%) LastUpdated: 2010/12/03 20:17:30 UTC
Notification - 2010/12/03 20:17:29 UTC: Tagging example1
Notification - 2010/12/03 20:17:30 UTC: Tagging complete

The output for the following examples will all be similar to the above output so it is not included

Example 2

Add a file called /foo/bar/baz under the tag name example2 but specify a tag-base-dir. This states what portion of the directory to consider the root directory. When the tag is uploaded this portion of the file name will be stripped off.

vp-add-dataset --tag-name=example2 --tag-base-dir=/foo/bar /foo/bar/baz

Example 3

Add a dataset called example3 that is every file in the directory /path/to/boom:

vp-add-dataset --tag-name=example3 -r /path/to/boom

Example 4

Add a dataset called example4 that is every file in the directory /path/to/boom and all compressed files in it are expanded:

vp-add-dataset --tag-name=example4 -r -e /path/to/boom

Note: the -e works if you are just tagging a file like in the first example.

Example 5

Append the files in /path/to/boom to the tag example5:

vp-add-dataset --tag-name=example5 -a -r /path/to/boom

Example 6

Tag the files in /path/to/boom and create a compressed copy of the contents of the tag to /path/to/compressed:

vp-add-dataset --tag-name=example6 -r --compress=/path/to/compressed /path/to/boom

Note: This last example will create a file called /path/to/compressed/boom.tar.gz which are the contents of the tag.

Example 7

Take the tag example6 from above and append the the file /foo/bar/baz to it and compress the contents of the tag again:

vp-add-dataset --tag-name=example6 -a --compress=/path/to/compressed /foo/bar/baz

Example 8

Overwrite the tag example6 with the file /foo/bar/baz and the contents the directory /path/to/boom:

vp-add-dataset --tag-name=example6 -o -r /foo/bar/baz /path/to/boom

Example 9

Make a tag called example9 that has all the files that start with foo in /path/to/boom in it and add two metadata keys to it, filetype and author:

vp-add-dataset --tag-name=example9 -m filetype=foos -m author=me_of_course /path/to/boom/foo*

There are a number of ways to use vp-add-dataset, it provides a lot of functionality. Almost all of the combinations of options work together. Play around with the options to get comfortable with it.

webservice url

/vappio/tagData_ws.py

webservice parameters

Parameter Required Type Meaning
cluster Yes String The name of the cluster to tag data on.
tag_name Yes String The name of the tag.
tag_base_dir Yes String or null If present, the base directory of the tag.
files Yes String list A list of filenames to tag, an empty list is acceptable.
recursive Yes Boolean Recursively tag the elements in files.
expand Yes Boolean Example any compressed files found in the tagging process.
compress Yes String or null If the contents of the tag should be compressed give the base directory to compress into.
append Yes Boolean Append to the curent tag.
overwrite Yes Boolean Overwrite the tag or not. If the tag is already present and append is not specified the operation becomes a noop.
tag_metadata Yes Dictionary Key value pairs of metadata for the tag.

webservice response

The return is the name of the task associated with tagging the data.

vp-transfer-dataset - Upload a dataset

description

This uploads a dataset. This is being expanded to support upload and download from any cluster to another cluster.

command line

Usage: vp-transfer-dataset [options]

Options:
  -h, --help            show this help message and exit
  --host=HOST           Host of web services to connect to, defaults to local
                        host
  --tag-name=TAG_NAME   Name of tag to upload
  --src-cluster=SRC_CLUSTER
                        Name of source cluster, hardcoded to local for now
  --dst-cluster=DST_CLUSTER
                        Name of dest cluster
  --transfer-type=TRANSFER_TYPE
                        Type of transfer to do (cluster, s3) default is
                        cluster
  -b, --block           Block until cluster is up (no longer used)
  --expand              Expand files
  --compress            Compress files
  -t, --print-task-name
                        Print the name of the task at the end

command line examples

Example 1

Uploaded a tag named example_tag to cluster my_ec2_cluster:

vp-transfer-dataset --tag-name=example_tag --dst-cluster=my_ec2_cluster

webservice url

/vappio/uploadTag_ws.py

webservice parameters

Parameter Required Type Meaning
tag_name Yes String The name of the tag to transfer.
src_cluster Yes String The name of the source cluster, should be local for now.
dst_cluster Yes String Name of the destination cluster.
expand Yes Boolean Should the files be expanded after upload.
compress Yes Boolean Should the files be compressed after upload.

webservice response

The name of the task associated with the upload

vp-download-dataset - Download a dataset

description

This downloads a dataset. This will be removed in the future, vp-transfer-dataset will be used for both upload and download

command line

Usage: vp-download-dataset [options]

Options:
  -h, --help            show this help message and exit
  --host=HOST           Host of web services to connect to, defaults to local
                        host
  --tag-name=TAG_NAME   Name of tag to upload
  --src-cluster=SRC_CLUSTER
                        Name of source cluster
  --dst-cluster=DST_CLUSTER
                        Name of dest cluster, hardcoded to local for now
  --output-dir=OUTPUT_DIR
                        Name of directory to download to
  -b, --block           Block until download is complete
  --expand              Expand files
  --compress            Compress files
  -t, --print-task-name
                        Print the name of the task at the end

command line examples

Example 1

Downloads a tag named example_tag from cluster my_ec2_cluster:

vp-download-dataset --tag-name=example_tag --src-cluster=my_ec2_cluster

webservice url

/vappio/downloadTag_ws.py

webservice parameters

Parameter Required Type Meaning
tag_name Yes String The name of the tag to transfer.
src_cluster Yes String The name of the source cluster.
dst_cluster Yes String Name of the destination cluster, should be local for now.
expand Yes Boolean Should the files be expanded after download.
compress Yes Boolean Should the files be compressed after download.

webservice response

The name of the task associated with the upload

vp-describe-dataset - Get information about datasets

description

With a dataset registered with the system the files and metadata can be queried.

Note: Datasets are being expanded upon and redefined, this data will change.

command line

Usage: vp-describe-dataset [options]

Options:
  -h, --help           show this help message and exit
  --host=HOST          Host of web services to connect to, defaults to local
                       host
  --name=NAME          Name of cluster
  --tag-name=TAG_NAME  Name of tag

command line examples

Example 1

List all registered datasets:

vp-describe-dataset

Output:

TAG     clovr-core-set-aligned-imputed-fasta
TAG     diag-2-iozone-test
TAG     clovr_search_11-29-2010-15:01:57_blastall_raw
TAG     clovr-prok-db
TAG     test-iozone-test
TAG     ncbi-nr
TAG     clovr-cogdb
Example 2

List files and metadata about a particular dataset:

vp-describe-dataset --tag-name=clovr_search_12-01-2010-15:07:00_blastall_raw

Output:

FILE    /mnt/output/clovr_search_12-01-2010-15:07:00/ncbi-blastall/6_default/i1/g3/NC_000964_1.ncbi-blastall.raw
FILE    /mnt/output/clovr_search_12-01-2010-15:07:00/ncbi-blastall/6_default/i1/g4/NC_000964_4.ncbi-blastall.raw
FILE    /mnt/output/clovr_search_12-01-2010-15:07:00/ncbi-blastall/6_default/i1/g1/NC_000964_2.ncbi-blastall.raw
FILE    /mnt/output/clovr_search_12-01-2010-15:07:00/ncbi-blastall/6_default/i1/g2/NC_000964_3.ncbi-blastall.raw
METADATA        pipeline_configs.clovr_search_12-01-2010-15:07:00.env.METHOD    dhcp
METADATA        pipeline_configs.clovr_search_12-01-2010-15:07:00.VAPPIO_CLI    /opt/vappio-py/vappio/cli/
METADATA        tag_base_dir    /mnt/output/clovr_search_12-01-2010-15:07:00
METADATA        pipeline_configs.clovr_search_12-01-2010-15:07:00.NODE_TYPE     MASTER
METADATA        pipeline_configs.clovr_search_12-01-2010-15:07:00.dirs.clovr_project    /mnt/projects/clovr
METADATA        pipeline_configs.clovr_search_12-01-2010-15:07:00.cluster.CLUSTER_NAME  local
METADATA        pipeline_configs.clovr_search_12-01-2010-15:07:00.cluster.EXEC_NODES    0

webservice url

/vappio/queryTag_ws.py

webservice parameters

Parameter Required Type Meaning
cluster Yes String Name of cluster to query.
tag_name Yes String List List of tags to get info for, an empty list of all tags.

webservice response

A list of datasets is returned where each entry is a dictionary containing the following values:

Parameter Name Meaning
name String Name of the dataset.
files String List A list of all the files in the dataset.
metadata.??? String All metadata is stored with the string ‘metadata.’ infront of it.