Data Migration Log Files

During a data migration, dataload.py updates the following log files in real time:

dataload.log

dataload.log provides a running commentary on pretty much everything that happens during a data migration. For example:

[2018-12-17 11:33:17,592] DEBUG urllib3.connectionpool: Starting new HTTPS connection (1): greg-stemp.us-dev.janraincapture.com:443
[2018-12-17 11:33:17,984] DEBUG urllib3.connectionpool: https://greg-stemp.us-dev.janraincapture.com:443 "POST /entity.bulkCreate HTTP/1.1" 200 None
[2018-12-17 11:33:21,560] INFO janrain-dataload.py: Done!
[2018-12-17 11:39:13,227] DEBUG dataload.cli: https://greg-stemp.us-dev.janraincapture.com
[2018-12-17 11:39:13,228] INFO janrain-dataload.py: Loading data from test_users.csv into the 'user' entity type.
[2018-12-17 11:39:13,228] DEBUG janrain-dataload.py: Minimum processing time per worker: 4.0
[2018-12-17 11:39:13,228] INFO dataload.reader: Validating UTF-8 encoding
[2018-12-17 11:39:13,232] DEBUG dataload.reader: Transform 'birthday': 15/01/2000 => 2000-01-15 00:00:00
[2018-12-17 11:39:13,232] DEBUG dataload.reader: Transform 'password': $P$8lChq8ox9WOf8B06d2JWafiY2qFJ6z. => {'type': 'password-phpass-md5', 'value': '$P$8lChq8ox9WOf8B06d2JWafiY2qFJ6z.'}
[2018-12-17 11:39:13,232] DEBUG dataload.reader: Transform 'primaryAddress.country': Mexico => MX
[2018-12-17 11:39:13,232] DEBUG janrain-dataload.py: [{'email': 'karim.nafir@mail.com', 'givenName': ' Karim, 'familyName': ' Nafir, 'birthday': '1990-02-19 00:00:00', 'gender': ' cisgender', 'displayName': ' Karim Nafir, 'password': {'type': 'password-phpass-md5', 'value': '$P$8lChq8ox9WOf8B06d2JWafiY2qFJ6z.'}, 'primaryAddress': {'city': ' Mexico City, 'country': 'MX'}}]

The preceding log entries recap the migration for a single record, a record that contains less than 10 fields. As you can see, datalog.log is the very definition of “comprehensive,” and is obviously your go-to tool if you need to do some serious data migration debugging.

Of course, that also means that dataload.log has the potential to grow quite large, especially if you are migrating millions of records (something many organizations need to do). Make sure that you have ample disk space available before you begin your data import.

And just how much disk space is “ample” disk space? That’s a difficult question to answer: it depends on the number of records you need to migrate, the number of fields you need to migrate, the number of data transforms you need to perform, etc. As a general rule, you should take a look at the size of your datafile, multiply that value by 3, then make sure you have at least that much free disk space.

Incidentally, dataload.log is a “rotating log” with a maximum size of 500 MB. That means that, when the log file reaches 500 MB (524288000 bytes), two things will happen. First, the current log file will be closed and renamed dataload.log.1. Second, a new log file will be opened, and pick up where the first file left off. This will continue as long as there is data that needs to be logged. 

Incidentally, this behavior – and the maximum file size – are configurable: you’ll just need to edit the logging_config.json file. For example, here we’ve set the maximum file size to 200 MB (209715200 bytes):

"debug_file_handler": {
           "class": "logging.handlers.RotatingFileHandler",
           "level": "DEBUG",
           "formatter": "standard",
           "filename": "dataload.log",
           "maxBytes": 209715200

Another thing to keep in mind is that dataload.log is a cumulative (i.e., an “append”) file: it does not automatically reset each time you run dataload.py. That means that data from any previous data migrations will always be available; it also means that your log files have the ability to grow very large. If you want to reset the file, you can always open dataload.log in a text editor and then delete all the data.


Dataload_info.log

Dataload_info.log provides an abridged listing of events as the script runs.  This log is also configurable in logging_config.json :

"info_file_handler": {
           "class": "logging.handlers.RotatingFileHandler",
           "level": "INFO",
           "formatter": "simple",
           "filename": "dataload)info.log",
           "maxBytes": 524288000


success.csv

Keeps a running tally of the records that were successfully migrated to the user profile store. For example, this excerpt shows that information was migrated, and a new user profile created, for the user karim.nafir@mail.com. That user has also been assigned the UUID bacfa66d-16e2-492e-a019-da6220df2ae9:

batch,line,uuid,email
1,2,bacfa66d-16e2-492e-a019-da6220df2ae9,karim.nafir@mail.com

Unlike dataload.log, success.csv is not cumulative: Each time dataload.py is executed a new success CSV file is created and a timestamp is appended to the name. For example :

success_May_22_2019_10_55.csv


fail.csv

Keeps a running tally of the records that were not successfully migrated to the user profile store. For example, here we see that the first record in the datafile (line 2, line 1 being the data header) failed because of a duplicate value; in this case, that means that there’s already a user profile using the email address referenced in the datafile:

batch,line,email,error
1,2,Attempted to update a duplicate value

Like success.cvs, fail.csv is not cumulative: Each time dataload.py is executed a new fail CSV file is created and a timestamp is appended to the name: 

fail_May_22_2019_10_55.csv

In addition to logging all its activity, dataload constantly reports its progress on the command-line, letting you know how many records have been processed and some other useful real time measurements. For example:

Total Success (S) Total Fails (F) Success Rate (SR) Average Records per Minute (AVG)
S:990 F:30 SR:97.06% AVG:12090rec/m : 50%