Step 6: Running dataload.py

Before You Begin. If you are using webhooks as part of your Identity Cloud implementation it’s absolutely imperative that you disable those webhooks before you do an actual data migration. (You can leave webhooks running if you’re doing a dry run, because no data is actually copied to the user profile store during a dry run.) Webhooks are designed to send a notification whenever certain events – such as creating a new user profile – take place. If you’re migrating 2 million user accounts, that’s 2 million webhook notifications, enough to completely overwhelm your system. 

Actually, Akamai would probably shut your system down before you could completely overwhelm everything. That’s the good news. The bad news? Shutting down your system will also shut down your data migration. All of which adds up to this: turn off your webhooks before doing a data migration.

To run the dataload.py script (and to migrate your data) you need to include several of the command line arguments shown in the following table (note that some of these arguments are optional and some are required; required arguments are marked in yellow):

Named Parameter

Required

Description

  -h, --help

Displays the parameters that can be used with dataload.py. For example:

python3 dataload.py -h

There is no need to include other parameters when using -h. In fact, if you do include additional parameters those parameters will be ignored and the help information will be displayed.

This, by the way, is the same help you see if you call the script without any arguments.

  -u, --apid_uri 

Yes

The URI to your Identity Cloud Capture domain. For example:

-u https://educationcenter.us-dev.janraincapture.com

You can find the URI to your Capture domain by looking at the Manage Application page in the Console.

  -i, --client-id 

Yes

Client ID for the API client used to do the data migration. For example:

-i 382bgrkj4w28984myp7298pzh35sj2q

  -s, --client-secret 

Yes

Client secret for the API client used to do the data migration. For example:

-s b2gfp7mgk9332annghwcf0po57xzqht5

 -k, --config-key 

Reserved for Akamai internal use.

  -d, --default-client  

Reserved for Akamai internal use.

  -t, --type-name 

Name of the entity type that user records should be written to. For example:

-t user

If not specified, the script defaults to the user entity type.

  -b, --batch-size 

Number of records to be included in each batch; the default value is 10. For example:

-b 20

Batches represent the number of records sent with each call to the entity.bulkCreate endpoint.

  -a, --start-at 

Record number (i.e., line number within the CSV file) where the migration process should start; the default value is 1. For example:

-a 100

This parameter is typically used if a previous import failed at an identifiable point in the process (e.g., if the first 99 records were successfully imported before network connectivity was lost).

  -w, --workers 

Total number of worker threads; the default value is 4. Adding threads can speed up the data migration process. 

For example:

-w 6

  -o, --timeout 

Amount of time that can elapse, in seconds, before an API call times out; the default value is 10. For example:

-o 5

It’s recommended that you never set the API timeout to be greater than 10 seconds. As a general rule, it’s better to change the batch size than it is to change the timeout interval.

  -r, --rate

Maximum number of API calls that can be made per second; the default value is 1. For example:

-r 2

If you receive a 510 (rate limit) error while running the script, use this parameter to reduce the maximum number of API calls that can be made per second.

  -x, --dry-run         

Runs through the full data migration process, but without copying records from the legacy data file to the user profile store. Note that you must include all the required parameters in order to successfully complete a dry run. For example:

python3 dataload.py -u https://educationcenter.us-dev.janraincapture.com -i 382bgrkj4w28984myp7298pzh35sj2q -s b2gfp7mgk9332annghwcf0po57xzqht5 -x test_users.csv

-m, --delta-migration

Performs the migration as a delta migration, which overwrites any matching records already in the user store.

-p, --primary-key

Used during a delta migration. Specify an existing attribute in the target Entity Type to identify duplicate records that will be updated. The attribute must be unique in the schema (the default primary key is email). 

For example, suppose you use email as you primary key and your CSV file includes a record that has the email address karim.nafir@mail.com. If dataload.py finds an existing user profile record that has that email address, the existing record will be replaced by the record found in the CSV file.

Positional  Parameter

Required

Description

 

Yes

Path to the datafile containing the user records being migrated to the user profile store. This parameter should be the last parameter in your command-line call. For example:

python dataload.py -u https://educationcenter.us-dev.janraincapture.com -i 382bgrkj4w28984myp7298pzh35sj2q -s b2gfp7mgk9332annghwcf0po57xzqht5 test_users.csv

When all is said and done, a call to dataload.py will look something like this:

python3 dataload.py -u https://educationcenter.us-dev.janraincapture.com -i 382bgrkj4w28984myp7298pzh35sj2q -s b2gfp7mgk9332annghwcf0po57xzqht5 test_users.csv

Two things to keep in mind here. First, command-line arguments have both a short name and a long name; these two partial commands are identical:

python3 dataload.py -u https://educationcenter.us-dev.janraincapture.com
python3 dataload.py --api_url https://educationcenter.us-dev.janraincapture.com

Second, there’s no datafile command line argument. Instead, the datafile is simply the last item in the command:

python3 dataload.py -u https://educationcenter.us-dev.janraincapture.com -i 382bgrkj4w28984myp7298pzh35sj2q -s b2gfp7mgk9332annghwcf0po57xzqht5 test_users.csv

If the datafile is anywhere except at the very end of the command, your script will run and you won’t get any error messages. However, no data will be copied to the user profile store.

Incidentally, you can run dataload.py as many times as you want; there’s nothing to prevent you from doing that. Is there any reason why you’d even want to run dataload.py on multiple occasions? We can think of at least two possibilities:

  • By using the “dry run” option, you can “practice” running your script, and running it against your actual datafile a million times (or more) without ever copying any data to the user profile store. That’s a good way to practice data migration, and to identify and resolve problems before you do everything for real. You should perform several dry runs before trying to do the actual migration.

    On a related note, you can do as many dry runs as you want (and need), but sooner or later you’ll have to migrate some real data to the real user profile store. When that time comes, we don’t recommend that, on your first try, you attempt to migrate all 9 million of your user accounts. Instead, you might want to migrate 3 or 4 user accounts, and make sure that all the fields can be copied over successfully. If so, then you can take try 50 or 100 accounts, and do the same thing. When you’re fully confident that the process is working, then you can copy over the entire datafile.
     
  • If you have multiple legacy systems, you might want to copy them one at a time. For example, suppose you have separate systems for users in North and South America, users in Europe, and users in Africa and Asia. Instead of combining the CSV files, you might choose to do separate migrations, one for each legacy system.