Direct Data API

The Direct Data API is a new class of API that provides high-speed read-only data access to Vault. Direct Data is a reliable, easy-to-use, timely, and consistent API for extracting Vault data.

It is designed for organizations that wish to replicate large amounts of Vault data to an external database, data warehouse, or data lake. Common use cases include:

The Direct Data API is not designed for real-time application integration.

In v24.1+, this API supports the extraction of Vault objects, document versions, picklists, and audit logs.

Included Data

The Direct Data API extracts the following data from your Vault:

The Direct Data API is not configurable, and all of the above data is always made available. You can use or ignore the data in the files.

What Does the Direct Data API Provide

The Direct Data API provides the following:

Benefits of Using the Direct Data API

There are several benefits of using the Direct Data API to extract data from your Vault:

Simpler Integrations

A Direct Data file is produced in a fixed, well-defined, and easy-to-understand format. This simplifies integrations, as the user doesn’t need to know the data model of the components or make multiple calls to different endpoints to build the dataset. However, the Direct Data API does include the data model for all the objects and documents in a single metadata.csv file so that tables can be created in the external system based on the data provided.

Faster Integrations

The Direct Data API continuously collects and stages the data in the background and publishes it as a single file for the interval specified. This is significantly faster than extracting the data via traditional APIs, which may require multiple calls based on the number of records being extracted.

More Timely Integrations

Files are always published at fixed times and at a regular cadence. The Direct Data API provides Incremental files every 15 minutes, tracking changes in a 15-minute interval, which makes it possible to update the data warehouse on a more timely basis.

Consistent

The Direct Data API provides a transactionally consistent view of data across Vault at a given time, called stop_time.

Understanding Direct Data Files

A Direct Data file is a .gzip file that includes a set of data entities as CSV files called extracts. You cannot directly create or modify Direct Data extracts. The available extracts may vary depending on the Vault application.

Direct Data files are categorized under the following types:

The following image shows the folder structure for a Direct Data file:

Every file is named according to the following format: {vaultid}-{date}-{stop_time}-{type}.tar.gz.{filepart}. The file name is comprised of the following variables:

For example, 143462-20240123-0000-F.tar.gzip.001 indicates the first file part of a Full Direct Data file from a Vault with ID 143462 that contains data from the time the Vault was created to January 23, 2024, 00:00 UTC.

Manifest CSV File

The manifest.csv file provides the record count for each extract. This provides definitive information about what is included in the file. This file is always present under the root folder.

The manifest CSV file includes the following columns to describe each extract:

Column Name Description
extract The extract name, in the format {component}.{extract_name}. For example, Object.user__sys.
extract_label The extract label. For example, if the extract name is Object.user__sys, the extract label is User.
type The type of extract: updates or deletes.
records The number of records for a given extract. This may show as zero records if there is no data for the given time period.
file Relative path to the CSV file within the Direct Data .gzip file.

Metadata CSV File

The metadata.csv file defines the structure of each extract so that consumers can understand the structure of the extract CSV.

Incremental files include the metadata that has changed in the interval. The metadata.csv is available in the Metadata folder in the .gzip Direct Data file. There is also a metadata_full.csv under the root folder which includes the metadata of all Vault data. This file is identical to the metadata.csv file in a Full file and helps consumers look at all metadata of the Vault regardless of the changes that are captured in an Incremental file. This file is not included in the manifest.csv.

The metadata CSV file includes the following standard columns in the following order:

Column Name Description
modified_date The date the configuration of the field was last updated.
extract The extract name, in the format {component}.{extract_name}. For example, Object.user__sys or Document.document_version__sys.
extract_label The extract label. For example, if the extract name is Object.user__sys, the extract label is User.
column_name Name of the column in the extract. For example, description__c.
column_label The column label in the extract. For example, if the column name is description__c, the column label is Description.
type The indicated data type of the column: String, Long Text, Number, Date, DateTime, Relationship, Picklist, MultiPicklist, or Boolean.
length For columns where the type value is String or Long Text, this provides the length of the field.
related_extract For columns where the type value is Relationship, Picklist, or MultiPicklist, this indicates the name of the related extract.

Extract Naming & Attributes

Extracts contain the data for Vault components: Documents, Objects, Picklists, and Logs. The Direct Data API names extract CSV files according to their extract_name. For example, product__v.csv. If a user deletes object records or document versions, the API stores it in a separate file by appending _deletes.csv to the extract name. The CSV files include a column referencing the record ID of related objects (which can be identified using the metadata.csv). The columns available in each extract vary depending on the component.

Document Extract

Document version data is available in the document_version__sys.csv file. Deleted document versions are tracked in a separate file.

All document extracts have a set of standard fields in addition to all the defined document fields in Vault.

The following standard columns are available in the document version extract:

Column Name Description
id The document version ID, in the format {doc_id}_{major_version_number}_{minor_version_number}. For example, 101_0_1 represents version 0.1 of document ID 101. This value is the same as version_id.
modified_date__v The date the document version was last modified.
doc_id The document id field value.
version_id The document version ID, in the format {doc_id}_{major_version_number}_{minor_version_number}. For example, 101_0_1 represents version 0.1 of document ID 101. This value is the same as id.
major_version_number The major version of the document.
minor_version_number The minor version of the document.
type The document type.
subtype The document subtype.
classification The document classification.
source_file The Vault REST API request to download the source file using the Download Document Version File endpoint.
rendition_file The Vault REST API request to download the rendition file using the Download Document Version Rendition File endpoint.

Accessing Source Content

The Direct Data API includes document metadata in the document_version__sys extract. This file includes additional attributes source_file and rendition_file which have generated URLs to download the content for that particular version of a document.

If your organization needs to make the source content for all documents available for further processing or data mining, use the Export Document Versions endpoint to export documents to your Vault’s file staging server in bulk. This endpoint allows up to 10,000 document versions per request.

Vault Objects Extract

Each object has its own extract file. Extracts are named according to their object name. For example, the extract CSV file for the Activity object is named activity__v.csv. If there are deleted records for an object, they are tracked in a separate {objectname}_deletes.csv.

Both custom and standard objects from your Vault are included. All objects visible on the Admin > Configuration page of your Vault are available for extraction.

All object extracts have a set of standard fields in addition to all of the defined fields included on the object. The following standard columns are available in Vault object extracts:

Column Name Description
id The object record ID.
modified_date__v The date the object record was last modified.
name__v The name of the object record.
status__v The status of the object record.
created_by__v The ID of the user who created the object record.
created_date__v The date the object record was created.
modified_by__v The ID of the user who last modified the object record.
global_id__sys The global ID of the object record.
link__sys The object record ID across all Vaults where the record exists.

Picklist Extract

All picklist data is available in the picklist__sys.csv. This does not include picklists that are not referenced by any objects or documents. Learn more about picklist references.

The following standard columns are available in picklist extracts:

Column Name Description
modified_date__v The date the picklist was last modified.
object The name of the object on which the picklist is defined.
object_field The name of the object picklist field.
picklist_value_name The picklist value name.
picklist_value_label The picklist value label.
status__v The status of the picklist value.

Direct Data Jobs

The following recurring jobs are visible in the UI when the Direct Data API is enabled:

Building Your Direct Data API Integration

Terms Used

Source System: System from where data is being extracted. In this case, it is your Vault instance.

Target System: External system where the extracted data is being loaded. It could be a database, data lake or a data warehouse based on the use case. For the purpose of this document, we will go with a data warehouse.

Staging: An intermediate location to store data downloaded from the Source System before loading it into the Target System.

The integration component used to build the Target System has four functions:

  1. Building the Target System Tables
  2. Initial Load of the Data
  3. Incremental Data Load
  4. Handling Schema Changes
  5. Verifying Data in the Target System

Let’s walk through each of the steps to help achieve the above functions of the integration.

Access to Source System Data:

  1. Vault produces a set of extracts which are accessible via Vault’s REST API. The Download a Direct Data File endpoint allows a Vault user with Vault Owner permissions to access the data.

Staging the Source Data

  1. You will need a separate location to download Direct Data files.
  2. A Direct Data file could have multiple parts if it is greater than 1GB in size. Handling File Parts explains more about this.
  3. Once the files are extracted, you are ready to start loading your Target System.

Building the Target System Tables

  1. The metadata.csv file provides the schema for all the components that are present in the file. Any referenced data is defined with type=Relationship and the related extract is specified in the metadata.csv. This makes it easy to define tables in your database.
  2. The manifest.csv file provides an inventory of all extracts and gives the records captured in the CSV.
  3. Every object has its own extract CSV file in Direct Data. As such, every object should have a table in the data warehouse.
  4. Document data should have its own table.

Initial Load of the Data

  1. For the Initial Load, you will always work with a Full file. This has all Vault data captured from the time the Vault was created till the stop_time of the file.
  2. Loading an extract into the database is simply loading a CSV as a table with the schema already defined.
  3. For example, if you are loading Vault data into AWS Redshift, once the extracts are stored in an S3 bucket, you can use the COPY command to load the table from the extract to create a table in Redshift. Below is a code example that provides a way to do this with data stored in AWS S3:

    f"COPY {dbname}.{schema_name}.{table_name} ({csv_headers}) FROM '{s3_uri}' " \
                    f"IAM_ROLE '{settings.config.get('redshift', 'iam_redshift_s3_read')}' " \
                    f"FORMAT AS CSV " \
                    f"QUOTE '\"' " \
                    f"IGNOREHEADER 1 " \
                    f"TIMEFORMAT 'auto'" \
                    f"ACCEPTINVCHARS " \
                    f"FILLRECORD"
    
  4. Once all the tables are loaded with the data in the extracts, you would have replicated the Vault in your Target System with a consistent data set up to the stop_time of the Full file.

  5. You may now verify the data in the Target System to check for any gaps.

Incremental Data Load

  1. To keep the Target System up to date with the latest data in your Vault, you would need to consume Incremental Direct Data files and load the changes.
  2. Incremental files capture creates, updates and deletes made to the data in your Vault.
  3. If the same piece of data has undergone multiple changes within the same time window in an incremental file, it will only appear once in the extract.
  4. A general best practice is to load the extract with creates and updates before loading the extract with deletes, which are captured in a separate file.

Handling Schema Changes

  1. Veeva Vault makes it easy for a Vault user to make configuration changes to a Vault using the UI, API, Vault Loader or a Vault Package.
  2. A number of schema changes are made as part of Vault Releases too which happen three times a year.
  3. Schema changes are captured in the Incremental metadata file. A good practice for the integration should be to check for schema changes before making any data updates in the tables.

Verifying Data in the Target System

  1. Once the data is loaded in the data warehouse, you will need a reliable way to test that data in Vault is now replicated in the data warehouse.
  2. A simple way to achieve this is to query Vault for some filtered data and compare it to the data in the data warehouse by querying it.
  3. It is important to keep in mind that data in Vault may have changed since it was extracted via Direct Data, so results may vary slightly.

Open Source AWS Connector:

Our Vault Developer Support team has built a sample connector which you can use as is with your Vault or may help build your custom integration. The connector is hosted on AWS and uses different AWS services to do the following:

  1. Download zipped Direct Data files to an AWS S3 bucket
  2. Unzip Direct Data files
  3. Load the data within the Direct Data file into an AWS Redshift database

You can access the sample code here.

Integration Best Practices

The following best practices should be kept in mind when incorporating Direct Data into new or existing integrations.

Calling the API

The Direct Data API publishes Direct Data files at fixed times in a day. For example, for a data extraction window of 13:00 to 13:15 UTC, the corresponding Incremental file will be published at 13:30 UTC.

This means an integration can check and request the file at 13:30 UTC or after. This reduces the number of requests to the API to check if a file is available and makes your code simpler to manage.

Handling File Parts

A Direct Data file name always includes a file part number. If the compressed file is over 1 GB, then the file is broken into 1 GB parts to simplify the downloading of very large files. After downloading all the file parts, you should concatenate the files into a valid .gzip file before use. Each part in itself is not readable.

Below is a code example to handle multiple file parts:

try:
    for file_part in directDataItem.filepart_details:
        file_part_number = file_part.filepart
        response: VaultResponse = request.download_item(file_part.name, file_part_number)

        response = s3.upload_part(
            Bucket=bucket_name,
            Key=object_key,
            UploadId=upload_id,
            PartNumber=file_part_number,
            Body=response.binary_content
        )

        part_info = {'PartNumber': file_part_number, 'ETag': response['ETag']}
        parts.append(part_info)

    s3.complete_multipart_upload(
        Bucket=bucket_name,
        Key=object_key,
        UploadId=upload_id,
        MultipartUpload={'Parts': parts}
    )
except Exception as e:
    # Abort the multipart upload in case of an error
    s3.abort_multipart_upload(Bucket=bucket_name, Key=object_key, UploadId=upload_id)
    log_message(log_level='Error',
                message=f'Multi-file upload aborted',
                exception=e,
                context=None)
    raise e

First Full File

When the Direct Data API is turned on for the first time, the first Full file Vault prepares may take a long time to complete, depending on when the feature was enabled and how much data is present in your Vault. Veeva can help provide an estimate for when the first Full file will be available after the feature has been enabled for your Vault.

Filtering Files

The Direct Data API provides the following options to help filter the Direct Data files available for download according to the data you want to retrieve:

Vault Upgrades and Direct Data API

During a Vault release, a production Vault is typically unavailable for up to 10 minutes in a 2-hour timeframe. Each Vault release may introduce configuration updates and new components. You should expect to see a large number of updates within a short period of time in your Vault’s Direct Data files.

Picklist References in Object & Document Extracts

Object or document fields that reference picklist values are classified with a type of Picklist or MultiPicklist in the metadata.csv file.

The picklist extract allows you to retrieve the picklist value labels corresponding to the picklist value names referenced in other extracts. Picklist extracts should be handled in the following ways:

Below is an example using the masking__v picklist field on the study__v object from a Safety Vault:

id modified_date__v masking__v name__v organization__v study_name__v
V17000000001001 2023-12-06T17:57:05.000Z open_label__v Study 1 V0Z000000000201 Study for Evaluation of Study Product

The corresponding entry in the metadata.csv for the picklist field would be:

extract extract_label column_name column_label type length related_extract
Object.study__v Study masking__v Masking Picklist 46 Picklist.picklist__sys

Below is an example of the Picklist.csv row for the open_label__v value of the masking__v picklist field on the study__v object:

modified_date__v object object_field picklist_value_name picklist_value_label status__v
2023-12-14T00:06:28.867Z study__v masking__v open_label__v Open Label active__v

Transaction Times & Data for Incremental Files

With the availability of Incremental files via the Direct Data API, Vault has a consistent way to capture and report on data changes that are committed to the database in 15-minute increments. This ensures that the Direct Data API provides a fully consistent data state based on the committed data.

However, Incremental files only include Vault events that are committed to the database within a given time frame. The database commit time may differ from the last modified time. The last modified time may be updated by long-running transactions such as jobs, cascading state changes, or even triggers that create additional entities as part of an SDK job.