The Direct Data API is a new class of API that provides high-speed read-only data access to Vault. Direct Data is a reliable, easy-to-use, timely, and consistent API for extracting Vault data.
It is designed for organizations that wish to replicate large amounts of Vault data to an external database, data warehouse, or data lake. Common use cases include:
The Direct Data API is not designed for real-time application integration.
In v24.1+, this API supports the extraction of Vault objects, document versions, picklists, workflows, and audit logs.
The Direct Data API extracts the following data from your Vault:
The Direct Data API is not configurable, and all of the above data is always made available. You can use or ignore the data in the files.
The Direct Data API provides the following:
There are several benefits of using the Direct Data API to extract data from your Vault:
A Direct Data file is produced in a fixed, well-defined, and easy-to-understand format. This simplifies integrations, as the user doesn’t need to know the data model of the components or make multiple calls to different endpoints to build the dataset. However, the Direct Data API does include the data model for all the objects and documents in a single metadata.csv
file so that tables can be created in the external system based on the data provided.
The Direct Data API continuously collects and stages the data in the background and publishes it as a single file for the interval specified. This is significantly faster than extracting the data via traditional APIs, which may require multiple calls based on the number of records being extracted.
Files are always published at fixed times and at a regular cadence. The Direct Data API provides Incremental files every 15 minutes, tracking changes in a 15-minute interval, which makes it possible to update the data warehouse on a more timely basis.
The Direct Data API provides a transactionally consistent view of data across Vault at a given time, called stop_time
.
To retrieve Direct Data files, you can use the following endpoints:
As an alternative to using the Vault Postman Collection, you can use the example Shell script below to download Direct Data files from your Vault. The script uses your provided credentials and the filters specified to download all file parts of the latest available Direct Data file.
The script requires the following variables:
full_directdata
, incremental_directdata
, or log_directdata
YYYY-MM-DDTHH:MMZ
. Always use 2000-01-01T00:00Z
if extract_type=full_directdata
.YYYY-MM-DDTHH:MMZ
.This script runs natively in OSX/UNIX systems. On Windows operating systems, the script requires Bash. If you have Git installed, you can use Git Bash.
Run this script from the directory where you would like to download the Direct Data file. If there are multiple file parts, the script combines them into a single .tar.gz file.
Click to expand the code example.
# Add the vault_dns of your Vault
vault_dns="your-vault.veevavault.com"
# Add in your session ID
session_id="YOUR_SESSION_ID"
# Add "full_directdata", "incremental_directdata", or "log_directdata"
extract_type="full_directdata"
# For "full_directdata" always use 2000-01-01T00:00Z as the start_time
start_time="2000-01-01T00:00Z"
# Add the stop_time
stop_time="2024-06-01T15:15Z"
# Will place the files in the current folder where the script runs
target_directory="$(pwd)"
# Perform the API call to retrieve the list of Direct Data files
direct_data_file_list_response=$(curl -s -X GET -H "Authorization: $session_id" \
-H "Accept: application/json" \
"https://$vault_dns/api/v24.1/services/directdata/files?extract_type=$extract_type&start_time=$start_time&stop_time=$stop_time")
# Extract the response status from the API response
response_status=$(echo "$direct_data_file_list_response" | grep -o '"responseStatus":"[^"]*' | sed 's/"responseStatus":"//')
# Check if the API call was successful
if [ "$response_status" != "SUCCESS" ]; then
error_message=$(echo "$direct_data_file_list_response" | grep -o '"message":"[^"]*' | sed 's/"message":"//' | tr -d '"')
if [ -z "$error_message" ]; then
printf "Retrieve Available Direct Data Files call failed. Exiting script.\n"
else
printf "Retrieve Available Direct Data Files call failed with error: %s\n" "$error_message"
fi
exit 1
else
printf "Retrieve Available Direct Data Files call succeeded.\n"
# Extract data array
data=$(echo "$direct_data_file_list_response" | grep -o '"data":\[[^]]*\]' | sed 's/"data":\[//' | tr -d ']')
# Count file parts
fileparts=$(echo "$data" | grep -o '"fileparts":[0-9]*' | sed 's/"fileparts"://')
# Check if fileparts is null or empty
if [ -z "$fileparts" ]; then
printf "No Direct Data Extract Files found for '$extract_type' with start_time = '$start_time' and stop_time = '$stop_time'.\n"
exit 0
fi
if [ "$fileparts" -gt 1 ]; then
printf "Multiple file parts.\n"
# Handling multiple file parts
filepart_details=$(echo "$data" | grep -o '"filepart_details":\[{"[^]]*' | sed 's/"filepart_details":\[//' | tr -d ']')
filepart_details=$(echo "$filepart_details" | sed 's/},{/}\n{/g')
filename=$(echo "$data" | grep -o '"filename":"[^"]*' | sed 's/"filename":"//' | tr -d '"' | head -n 1)
while IFS= read -r filepart_detail; do
filepart_url=$(echo "$filepart_detail" | grep -o '"url":"[^"]*' | sed 's/"url":"//' | tr -d '"')
output_filepart_name=$(echo "$filepart_detail" | grep -o '"filename":"[^"]*' | sed 's/"filename":"//' | tr -d '"')
curl -o "$output_filepart_name" -X GET -H "Authorization: $session_id" \
-H "Accept: application/json" \
"$filepart_url"
done <<< "$filepart_details"
# Combine file parts
name=$(echo "$data" | grep -o '"name":"[^"]*' | sed 's/"name":"//' | tr -d '"' | head -n 1)
cat "$name."* > "$filename"
full_path="$target_directory/$name"
if [ ! -d "$full_path" ]; then
# Directory does not exist, create it
mkdir -p "$full_path"
printf "Directory '%s' created.\n" "$full_path"
else
printf "Directory '%s' already exists.\n" "$full_path"
fi
tar -xzvf "$filename" -C "$full_path"
else
printf "Only one file part.\n"
# Handling single file part
filepart_detail=$(echo "$data" | grep -o '"filepart_details":\[{"[^]]*' | sed 's/"filepart_details":\[//' | tr -d '{}')
filepart_url=$(echo "$filepart_detail" | grep -o '"url":"[^"]*' | sed 's/"url":"//' | tr -d '"')
filename=$(echo "$data" | grep -o '"filename":"[^"]*' | sed 's/"filename":"//' | tr -d '"' | head -n 1)
curl -o "$filename" -X GET -H "Authorization: $session_id" \
-H "Accept: application/json" "$filepart_url"
name=$(echo "$data" | grep -o '"name":"[^"]*' | sed 's/"name":"//' | tr -d '"' | head -n 1)
full_path="$target_directory/$name"
if [ ! -d "$full_path" ]; then
# Directory does not exist, create it
mkdir -p "$full_path"
printf "Directory '%s' created.\n" "$full_path"
else
printf "Directory '%s' already exists.\n" "$full_path"
fi
tar -xzvf "$filename" -C "$full_path"
fi
fi
A Direct Data file is a .gzip file that includes a set of data entities as CSV files called extracts. You cannot directly create or modify Direct Data extracts. The available extracts may vary depending on the Vault application.
Direct Data files are categorized under the following types:
stop_time
and are available for ten (10) days. For example, a window of 02:00-02:15 UTC will result in an Incremental file published at 02:30 UTC.The following image shows the folder structure for a Full Direct Data file:
Every file is named according to the following format: {vaultid}-{date}-{stop_time}-{type}.tar.gz.{filepart}
. The file name is comprised of the following variables:
vaultid
: Refers to the Vault’s IDdate
: Refers to the date that the file was created (in YYYYMMDD format)stop_time
: Refers to the stop time of the interval (in HHMM format)type
: Refers to the type of the file (N: Incremental, F: Full, L: Log)filepart
: Refers to the part number of the file. Files greater than 1 GB in size are split into parts to keep downloads manageable (in NNN format).For example, 143462-20240123-0000-F.tar.gzip.001
indicates the first file part of a Full Direct Data file from a Vault with ID 143462 that contains data from the time the Vault was created to January 23, 2024, 00:00 UTC.
The manifest.csv
file provides definitive information about what is included in the file, as well as the record count for each extract. This file is always present under the root folder.
The manifest CSV file includes the following columns to describe each extract:
Column Name | Description |
---|---|
extract | The extract name, in the format {component}.{extract_name} . For example, Object.user__sys . |
extract_label | The extract label. For example, if the extract name is Object.user__sys , the extract label is User . |
type | The type of extract: updates or deletes . This column only appears if the extract_type is incremental_directdata . |
records | The number of records for a given extract. This may show as zero records if there is no data for the given time period. |
file | Relative path to the CSV file within the Direct Data .gzip file. This column may not show a file if there are zero records for a given extract. |
The metadata.csv
file defines the structure of each extract so that consumers can understand the structure of the extract CSV.
Incremental files include the metadata that has changed in the interval. The metadata.csv
is available in the Metadata folder in the .gzip Direct Data file. There is also a metadata_full.csv
under the root folder which includes the metadata of all Vault data. This file is identical to the metadata.csv
file in a Full file and helps consumers look at all metadata of the Vault regardless of the changes that are captured in an Incremental file. This file is not included in the manifest.csv
.
The metadata CSV file includes the following standard columns in the following order:
Column Name | Description |
---|---|
modified_date | The date the configuration of the field was last updated. |
extract | The extract name, in the format {component}.{extract_name} . For example, Object.user__sys or Document.document_version__sys . |
extract_label | The extract label. For example, if the extract name is Object.user__sys , the extract label is User . |
column_name | Name of the column in the extract. For example, description__c . |
column_label | The column label in the extract. For example, if the column name is description__c , the column label is Description . |
type | The indicated data type of the column: String , LongText , Number , Date , DateTime , Relationship , Picklist , MultiPicklist , or Boolean . |
length | For columns where the type value is String or LongText , this provides the length of the field. |
related_extract | For columns where the type value is Relationship , Picklist , or MultiPicklist , this indicates the name of the related extract. |
Extracts contain the data for Vault components: Documents, Objects, Picklists, Workflows, and Logs. The Direct Data API names extract CSV files according to their extract_name
. For example, product__v.csv
. If a user deletes object records or document versions, the API stores it in a separate file by appending _deletes.csv
to the extract name. The CSV files include a column referencing the record ID of related objects (which can be identified using the metadata.csv
). The columns available in each extract vary depending on the component.
Document version data is available in the document_version__sys.csv
file. Deleted document versions are tracked in a separate file.
All document extracts have a set of standard fields in addition to all the defined document fields in Vault.
The following standard columns are available in the document version extract:
Column Name | Description |
---|---|
id | The document version ID, in the format {doc_id}_{major_version_number}_{minor_version_number} . For example, 101_0_1 represents version 0.1 of document ID 101. This value is the same as version_id . |
modified_date__v | The date the document version was last modified. |
doc_id | The document id field value. |
version_id | The document version ID, in the format {doc_id}_{major_version_number}_{minor_version_number} . For example, 101_0_1 represents version 0.1 of document ID 101. This value is the same as id . |
major_version_number | The major version of the document. |
minor_version_number | The minor version of the document. |
type | The document type. |
subtype | The document subtype. |
classification | The document classification. |
source_file | The Vault REST API request to download the source file using the Download Document Version File endpoint. |
rendition_file | The Vault REST API request to download the rendition file using the Download Document Version Rendition File endpoint. |
The Direct Data API includes document metadata in the document_version__sys
extract. This file includes additional attributes source_file
and rendition_file
which have generated URLs to download the content for that particular version of a document.
If your organization needs to make the source content for all documents available for further processing or data mining, use the Export Document Versions endpoint to export documents to your Vault’s file staging server in bulk. This endpoint allows up to 10,000 document versions per request.
Each object has its own extract file. Extracts are named according to their object name. For example, the extract CSV file for the Activity object is named activity__v.csv
. If there are deleted records for an object, they are tracked in a separate {objectname}_deletes.csv
.
Both custom and standard objects from your Vault are included. All objects visible on the Admin > Configuration page of your Vault are available for extraction.
All object extracts have a set of standard fields in addition to all of the defined fields included on the object. The following standard columns are available in Vault object extracts:
Column Name | Description |
---|---|
id | The object record ID. |
modified_date__v | The date the object record was last modified. |
name__v | The name of the object record. |
status__v | The status of the object record. |
created_by__v | The ID of the user who created the object record. |
created_date__v | The date the object record was created. |
modified_by__v | The ID of the user who last modified the object record. |
global_id__sys | The global ID of the object record. |
link__sys | The object record ID across all Vaults where the record exists. |
All picklist data is available in the picklist__sys.csv
. This does not include picklists that are not referenced by any objects or documents. Learn more about picklist references.
The following standard columns are available in picklist extracts:
Column Name | Description |
---|---|
modified_date__v | The date the picklist was last modified. |
object | The name of the object on which the picklist is defined. |
object_field | The name of the object picklist field. |
picklist_value_name | The picklist value name. |
picklist_value_label | The picklist value label. |
status__v | The status of the picklist value. |
Workflow data including workflow instances, items, user tasks, task items, and legacy workflow information is available in the following extracts:
workflow__sys.csv
: Provides workflow-level information about each workflow instance.workflow_item__sys.csv
: Provides item-level information about each document or object record associated with a workflow.workflow_task__sys.csv
: Provides task-level information about each user task associated with a workflow.workflow_task_item__sys.csv
: Provides item-level information about each user task associated with a workflow.A Direct Data file may include additional extracts for legacy workflows. All workflow data in extracts includes active and inactive workflows for both objects and documents. Incremental files exclude inactive legacy workflow data, however, this information is accessible from a Full file. The extract for inactive legacy workflows only includes data from the previous day.
The workflow__sys.csv
extract provides workflow-level information about each workflow instance, including the workflow ID, workflow label, owner, type, and relevant dates.
The workflow_item__sys.csv
provides item-level information about each document or object record associated with a workflow, including the workflow instance ID, item type, and IDs of the related Vault object record or document. If a workflow includes a document, the Document Version ID (doc_version_id
) column in the CSV file references the document version it’s related to. If the workflow instance is not related to a specific document version, this column displays the latest version ID of that document. If the extract_type
is incremental_directdata
, the Incremental file captures new document versions associated with the workflow.
The metadata CSV file assigns the workflow item extract a type
of String
.
The workflow_task__sys.csv
provides task-level information about each user task associated with a workflow. This extract includes information such as the workflow ID, task label, task owner, task instructions, and relevant dates. This extract excludes participant group details.
The workflow_task_item__sys.csv
provides item-level information about each user task associated with a workflow, such as the workflow task item ID, any captured verdicts, and the type of task item.
The following recurring jobs are visible in the UI when the Direct Data API is enabled:
The integration component used to build the Target System has the following functions:
Let’s walk through each of the steps to help achieve the above functions of the integration.
Access the source system data with the following steps:
Stage the source system data with the following steps:
.gzip
file before use. Learn more in Handling File Parts.Once the files are extracted, you are ready to start loading your Target System.
metadata.csv
file provides the schema for all the components that are present in the file. Any referenced data is defined with type=Relationship
and the related extract is specified in the metadata.csv
. This makes it easy to define tables in your database.manifest.csv
file provides an inventory of all extracts and gives the records captured in the CSV.stop_time
of the file. COPY
command to load the table from the extract to create a table in Redshift. Below is a code example that provides a way to do this with data stored in AWS S3:Click to expand the Python script example.
f"COPY {dbname}.{schema_name}.{table_name} ({csv_headers}) FROM '{s3_uri}' " \
f"IAM_ROLE '{settings.config.get('redshift', 'iam_redshift_s3_read')}' " \
f"FORMAT AS CSV " \
f"QUOTE '\"' " \
f"IGNOREHEADER 1 " \
f"TIMEFORMAT 'auto'" \
f"ACCEPTINVCHARS " \
f"FILLRECORD"
stop_time
of the Full file.Our Vault Developer Support team has built a sample connector which you can use as is with your Vault or may help build your custom integration. The connector is hosted on AWS and uses different AWS services to do the following:
You can access the sample code here.
The following best practices should be kept in mind when incorporating Direct Data into new or existing integrations.
The Direct Data API publishes Direct Data files at fixed times in a day. For example, for a data extraction window of 13:00 to 13:15 UTC, the corresponding Incremental file will be published at 13:30 UTC.
This means an integration can check and request the file at 13:30 UTC or after. This reduces the number of requests to the API to check if a file is available and makes your code simpler to manage.
A Direct Data file name always includes a file part number. If the compressed file is over 1 GB, then the file is broken into 1 GB parts to simplify the downloading of very large files. After downloading all the file parts, you should concatenate the files into a valid .gzip file before use. Each part in itself is not readable.
Below is a code example to handle multiple file parts:
Click to expand the Python script example.
try:
for file_part in directDataItem.filepart_details:
file_part_number = file_part.filepart
response: VaultResponse = request.download_item(file_part.name, file_part_number)
response = s3.upload_part(
Bucket=bucket_name,
Key=object_key,
UploadId=upload_id,
PartNumber=file_part_number,
Body=response.binary_content
)
part_info = {'PartNumber': file_part_number, 'ETag': response['ETag']}
parts.append(part_info)
s3.complete_multipart_upload(
Bucket=bucket_name,
Key=object_key,
UploadId=upload_id,
MultipartUpload={'Parts': parts}
)
except Exception as e:
# Abort the multipart upload in case of an error
s3.abort_multipart_upload(Bucket=bucket_name, Key=object_key, UploadId=upload_id)
log_message(log_level='Error',
message=f'Multi-file upload aborted',
exception=e,
context=None)
raise e
When the Direct Data API is turned on for the first time, the first Full file Vault prepares may take a long time to complete, depending on when the feature was enabled and how much data is present in your Vault. Veeva can help provide an estimate for when the first Full file will be available after the feature has been enabled for your Vault.
The Direct Data API provides the following options to help filter the Direct Data files available for download according to the data you want to retrieve:
extract_type
query parameter allows you to filter based on file type and to list only Incremental, Full, or Log files in the response to the Retrieve Available Direct Data Files request.start_time
and stop_time
of the window for which the data is captured in the Direct Data file is another query parameter available for the Retrieve Available Direct Data Files request. All Full files have a start time of 00:00 Jan 1, 2000.record_count
attribute for a Direct Data file, included in the response to the Retrieve Available Direct Data Files request, can help locate empty files. This attribute provides the total number of records for a given extract. If this count is zero (0), it means no data changes were captured in the time interval for which the file was produced.During a Vault release, a production Vault is typically unavailable for up to 10 minutes in a 2-hour timeframe. Each Vault release may introduce configuration updates and new components. You should expect to see a large number of updates within a short period of time in your Vault’s Direct Data files.
Object or document fields that reference picklist values are classified with a type
of Picklist
or MultiPicklist
in the metadata.csv
file.
The picklist extract allows you to retrieve the picklist value labels corresponding to the picklist value names referenced in other extracts. Picklist extracts should be handled in the following ways:
object
, object_field
, picklist_value_name
). The extract and extract field metadata provide the object
and object_field
values, respectively.Below is an example using the masking__v
picklist field on the study__v
object from a Safety Vault:
id | modified_date__v | masking__v | name__v | organization__v | study_name__v |
V17000000001001 | 2023-12-06T17:57:05.000Z | open_label__v | Study 1 | V0Z000000000201 | Study for Evaluation of Study Product |
The corresponding entry in the metadata.csv
for the picklist field would be:
extract | extract_label | column_name | column_label | type | length | related_extract |
Object.study__v | Study | masking__v | Masking | Picklist | 46 | Picklist.picklist__sys |
Below is an example of the Picklist.csv
row for the open_label__v
value of the masking__v
picklist field on the study__v
object:
modified_date__v | object | object_field | picklist_value_name | picklist_value_label | status__v |
2023-12-14T00:06:28.867Z | study__v | masking__v | open_label__v | Open Label | active__v |
The workflow item extract (workflow_item__sys.csv
) provides information about the document version or Vault object record that relates to the item. There may be instances where the referenced document version or Vault object record does not have a corresponding extract in the Direct Data file. For example, when you retrieve an Incremental file, its extracts only contain data updated within the specified 15-minute interval. Therefore, if the document version or object record was not modified within this interval, it will not have its own extract.
As best practice, your external data warehouse should allow for a polymorphic relationship between the workflow item extract and each of the tables representing object extracts.
With the availability of Incremental files via the Direct Data API, Vault has a consistent way to capture and report on data changes that are committed to the database in 15-minute increments. This ensures that the Direct Data API provides a fully consistent data state based on the committed data.
However, Incremental files only include Vault events that are committed to the database within a given time frame. The database commit time may differ from the last modified time. The last modified time may be updated by long-running transactions such as jobs, cascading state changes, or even triggers that create additional entities as part of an SDK job.
Direct Data only evaluates formula fields during initial staging (for Full files) or when a record change occurs (for an Incremental file). Other endpoints of the Vault REST API evaluate formula fields every time a record is requested.
If a formula field contains Today()
, the only time the formula field value will be the same between the Direct Data API and the Vault REST API is on a day when the record has changed.
The Direct Data API captures user logins as events within the user__sys.csv
extract file even if no additional changes were made to the User object record, including if the record’s modified_date__v
value remains unchanged.