Important. We should emphasize that the data migration script relies on the entity.bulkData endpoint to create user accounts. That's important, because this endpoint does not generate webhook events: if you use the entity.bulkData endpoint to create 10,000 user accounts you won't get any webhook notifications. That's because entity.bulkData doesn't generate entityCreated events; as a result, there's nothing for Webhooks v3 to report. Among other things, this means that, whenever you do a data migration, you don't have to worry about disabling any webhook subscriptions that look for the entityCreated event.
To run the dataload.py script (and to migrate your data) you need to include several of the command line arguments shown in the following table (note that some of these arguments are optional and some are required; required arguments are marked in yellow):
Displays the parameters that can be used with dataload.py. For example:
There is no need to include other parameters when using -h. In fact, if you do include additional parameters those parameters will be ignored and the help information will be displayed.
This, by the way, is the same help you see if you call the script without any arguments.
The URI to your Identity Cloud Capture domain. For example:
You can find the URI to your Capture domain by looking at the Manage Application page in the Console.
Client ID for the API client used to do the data migration. For example:
Client secret for the API client used to do the data migration. For example:
Reserved for Akamai internal use.
Reserved for Akamai internal use.
Name of the entity type that user records should be written to. For example:
If not specified, the script defaults to the user entity type.
Number of records to be included in each batch; the default value is 10. For example:
Batches represent the number of records sent with each call to the entity.bulkCreate endpoint.
Record number (i.e., line number within the CSV file) where the migration process should start; the default value is 1. For example:
This parameter is typically used if a previous import failed at an identifiable point in the process (e.g., if the first 99 records were successfully imported before network connectivity was lost).
Total number of worker threads; the default value is 4. Adding threads can speed up the data migration process.
Amount of time that can elapse, in seconds, before an API call times out; the default value is 10. For example:
It’s recommended that you never set the API timeout to be greater than 10 seconds. As a general rule, it’s better to change the batch size than it is to change the timeout interval.
Maximum number of API calls that can be made per second; the default value is 1. For example:
If you receive a 510 (rate limit) error while running the script, use this parameter to reduce the maximum number of API calls that can be made per second.
Runs through the full data migration process, but without copying records from the legacy data file to the user profile store. Note that you must include all the required parameters in order to successfully complete a dry run. For example:
Performs the migration as a delta migration, which overwrites any matching records already in the user store.
Used during a delta migration. Specify an existing attribute in the target Entity Type to identify duplicate records that will be updated. The attribute must be unique in the schema (the default primary key is email).
For example, suppose you use email as you primary key and your CSV file includes a record that has the email address email@example.com. If dataload.py finds an existing user profile record that has that email address, the existing record will be replaced by the record found in the CSV file.
Path to the datafile containing the user records being migrated to the user profile store. This parameter should be the last parameter in your command-line call. For example:
When all is said and done, a call to dataload.py will look something like this:
Two things to keep in mind here. First, command-line arguments have both a short name and a long name; these two partial commands are identical:
Second, there’s no datafile command line argument. Instead, the datafile is simply the last item in the command:
If the datafile is anywhere except at the very end of the command, your script will run and you won’t get any error messages. However, no data will be copied to the user profile store.
Incidentally, you can run dataload.py as many times as you want; there’s nothing to prevent you from doing that. Is there any reason why you’d even want to run dataload.py on multiple occasions? We can think of at least two possibilities:
- By using the “dry run” option, you can “practice” running your script, and running it against your actual datafile a million times (or more) without ever copying any data to the user profile store. That’s a good way to practice data migration, and to identify and resolve problems before you do everything for real. You should perform several dry runs before trying to do the actual migration.
On a related note, you can do as many dry runs as you want (and need), but sooner or later you’ll have to migrate some real data to the real user profile store. When that time comes, we don’t recommend that, on your first try, you attempt to migrate all 9 million of your user accounts. Instead, you might want to migrate 3 or 4 user accounts, and make sure that all the fields can be copied over successfully. If so, then you can take try 50 or 100 accounts, and do the same thing. When you’re fully confident that the process is working, then you can copy over the entire datafile.
- If you have multiple legacy systems, you might want to copy them one at a time. For example, suppose you have separate systems for users in North and South America, users in Europe, and users in Africa and Asia. Instead of combining the CSV files, you might choose to do separate migrations, one for each legacy system.