Use Data Sources

Data sources have a global scope in a Domino deployment and are accessible to anyone with the appropriate permissions in any project.

Connect to a Data Source

Some data stores require additional steps. The Data Source connector page has more details about specific connections.

Note	Data Sources can be set up to use either service account credentials or individual user credentials. Verify the credentials you’ll need with your administrator before connecting to data sources.

These steps show you how to connect to a Data Store:

From the navigation pane, click Data > Data Sources > Add a Data Source.
Click Create New Data Source, then click Create a Data Source.
Choose an option from the Select Data Store dropdown.
Enter credentials for the Data Source and choose from the options.
Click Finish Setup.

Add an existing Data Source to a project

You can add data sources to a project in two ways. First, you can add a data source directly to the project’s data page. Second, a data source can be added automatically when used in the code within the project.

You can add a data source to a project if you have access to that data source and it is set up. This step is not required, but it helps you see which data sources are used in your projects.

If you don’t add a Data Source to a project, you can still use it in your code if you have permission to access it.

In your project, go to Data > Data Sources > Add a Data Source.
Select an existing Data Source from the list.
Click Add to Project.

Use the Domino Data API

After a Data Source is properly configured, use the Domino Data API to retrieve data without installing drivers or Data Source-specific libraries.

The auto-generated code snippets provided in your workspace are based on the Domino Data API, which supports tabular and file-based Data Sources. The API supports Python and R.

The Data API comes pre-packaged in the Domino Standard Environment (DSE). You can install the Data API in custom environments if needed.

The Data API’s Data Source client uses environment variables available in the workspace to automatically authenticate your identity. You can override this behavior using custom authentication.

Get code snippets

Domino creates code snippets to help you access data sources for your project using the Domino Data API. Code snippets are available for Python and R and customized for tabular and file-based Data Sources. The Data Source must be added to your project to enable snippets.

Here’s how to get a code snippet that you can copy and paste into your workspace:

In your workspace, go to Data > Data Sources.
Click the copy icon to display language options.
Select Python or R to copy the code snippet in the desired language.
Paste the copied snippet into your own code, and modify it as needed.

Note

Domino data sources do not support querying nested objects. The workaround is to UNNEST the object in the SQL query.

The following is an example UNNEST query:

res = ds.query("""
select account_id, t1
from sample_analytics.transactions
cross join unnest (transactions)
as t(t1, t2, t3, t4, t5, t6)
""")

Query a tabular store

You can query a tabular store with the following code, assuming a data source named redshift-test is configured with valid credentials for the current user:

from domino.data_sources import DataSourceClient

# instantiate a client and fetch the datasource instance
redshift = DataSourceClient().get_datasource("redshift-test")

query = """
     SELECT
         firstname,
         lastname,
         age
     FROM
         employees
     LIMIT 1000
 """

 # res is a simple wrapper of the query result
 res = redshift.query(query)
 # to_pandas() loads the result into a pandas dataframe
 df = res.to_pandas()
 # check the first 10 rows
 df.head(10)

Read/write to an object store

List

Get the data source from the client:

from domino.data_sources import DataSourceClient

s3_dev = DataSourceClient().get_datasource("s3-dev")

You can list objects available in the data source. You can also specify a prefix:

objects = s3_dev.list_objects()

objects_under_path = s3_dev.list_objects("path_prefix")

By default the number of returned objects is limited by the underlying data source. You can specify how many keys you want as an optional parameter:

objects = s3_dev.list_objects(page_size = 1500)

Read

You can get object content, without having to create object entities, by using the data source API and specifying the Object key name:

# Get content as binary
content = s3_dev.get("key")

# Download content to file
s3_dev.download_file("key", "./path/to/local/file")

# Download content to file-like object
f = io.BytesIO()
s3_dev.download_fileobj("key", f)

You can also get the data source entity content from an object entity (Python only):

# Key object
my_key = s3_dev.Object("key")

# Get content as binary
content = my_key.get()

# Download content to file
my_key.download_file("./path/to/local/file")

# Download content to file-like object
f = io.BytesIO()
my_key.download_fileobj(f)

Write

Similar to the read/get APIs, you can also write data to a specific object key. From the data source:

# Put binary content to given object key
s3_dev.put("key", b"content")

# Upload file content to specified object key
s3_dev.upload_file("key", "./path/to/local/file")

# Upload file-like content to specified object key
f = io.BytesIO(b"content")
s3_dev.upload_fileobj("key", f)

You can also write from the object entity (Python only).

# Key object
my_key = s3_dev.Object("key")

# Put content as binary
my_key.put(b"content")

# Upload content from file
my_key.upload_file("./path/to/local/file")

# Upload content from file-like object
f = io.BytesIO()
my_key.upload_fileobj(f)

Write to a local file

Parquet

Because Domino uses PyArrow to serialize and transport data, the query result is easily written to a local parquet file. You can also use pandas as shown in the CSV example.

redshift = DataSourceClient().get_datasource("redshift-test")

res = redshift.query("SELECT * FROM wines LIMIT 1000")

# to_parquet() accepts a path or file-like object
# the whole result is loaded and written once
res.to_parquet("./wines_1000.parquet")

CSV

Because serializing to a CSV is lossy, Domino recommends using the Pandas.to_csv API so you can leverage its multiple options.

redshift = DataSourceClient().get_datasource("redshift-test")

res = redshift.query("SELECT * FROM wines LIMIT 1000")

# See Pandas.to_csv documentation for all options
csv_options = {header: True, quotechar: "'"}

res.to_pandas().to_csv("./wines_1000.csv", **csv_options)

Your identity is verified automatically to make sure you have the right permissions to use the Domino Data API. The system will try to use a Domino JWT token. If that’s not available, it will use a user API key instead.

In a Domino Nexus deployment, Data Sources can be accessed on both the Local and remote data planes, with the exception of Starburst Trino. Data sources may not be usable in every data plane due to network restrictions.
Connectivity issues may originate anywhere between your Domino deployment and the external data store. Consult your administrator to verify that the Data Source is accessible from your Domino deployment.

Next steps

Create training sets using the Domino Data API.
Run or schedule a job to train your model or update your model’s predictions using the latest data.
Learn more about developing and deploying models in Domino.
Use model monitoring to detect data drift.