A task I had to face recently was to upload tens millions of records in a MySQL table - as fast as possible: - reducing the wallclock and CPU time to a minimum - and more important, optimizing the MySQL downtime
There are many solutions to this problem, but few of them are really interesting in terms of time.
Let's play a bit with some of them.
As we need some test data, let's try to work with IEEE OUI numbers which are used for instance to identify the vendors of a NIC based on its MAC Address. There is a dataset of more than ten thousands entries there. It's not a lot, but it enough to show some of the basic concepts.
The first step is to get the OUI list - the list has currently about 10 000 vendors, but continuously increases.
cd ~
mkdir tmp
wget http://standards.ieee.org/regauth/oui/oui.txt
A look at the beginning of the file shows its structure:
OUI Organization
company_id Organization
Address
00-00-00 (hex) XEROX CORPORATION
000000 (base 16) XEROX CORPORATION
M/S 105-50C
800 PHILLIPS ROAD
WEBSTER NY 14580
UNITED STATES
A reasonable assumption is that we want to store: - the 3-octets prefix as a string. This field will be indexed (it should probably be 'unique' as well, but I'm afraid the IEEE database has duplicates there) - the vendor name as a string
This translates quite easily in MySQL:
mysql> create database tmp;
Query OK, 1 row affected (0.00 sec)
mysql> use tmp;
Database changed
mysql> create table oui (id int(11) primary key auto_increment, oui char(6) not null, vendor varchar(255) not null);
Query OK, 0 rows affected (0.01 sec)
mysql> create index idx_oui_oui on oui (oui);
Query OK, 0 rows affected (0.01 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> grant all privileges on tmp.oui to 'test_perl'@'localhost';
Query OK, 0 rows affected (0.07 sec)
The naive approach is the following: - parse the data - look for a new OUI number - when found, prepare the statement - and insert the value in the DB
Translated in Perl code, the result is the following:
#!/usr/bin/perl -w
use strict;
use DBI;
my $line;
my $dbh = DBI->connect("DBI:mysql:database=tmp", "test_perl");
$dbh->do ("TRUNCATE TABLE oui");
while ($line = <>)
{
next unless $line =~ /^([0-F]{6})\s+\(base 16\)\s+(.*)$/;
my ($oui, $vendor) = ($1, $2);
my $sth = $dbh->prepare ("INSERT INTO oui(oui, vendor) VALUES (?, ?)") or die "Cannot prepare $!";
$sth->execute ($oui, $vendor) or die "Cannot insert $!";
}
This runs in consistently 9.9 seconds of real time on my server:
mac@dax:~/tmp$ time ./test.pl oui.txt
real 0m9.943s
user 0m4.935s
sys 0m0.437s
The 'prepared' statements are exactly what they sound like: statements which get first prepared, then executed. The actual preparations has to be done one single time - and the statement can be executed many times... Let's try to factorize the statement preparation:
#!/usr/bin/perl -w
use strict;
use DBI;
my $line;
my $dbh = DBI->connect("DBI:mysql:database=tmp", "test_perl");
$dbh->do ("TRUNCATE TABLE oui");
my $sth = $dbh->prepare ("INSERT INTO oui(oui, vendor) VALUES (?, ?)") or die "Cannot prepare $!";
while ($line = <>)
{
next unless $line =~ /^([0-F]{6})\s+\(base 16\)\s+(.*)$/;
my ($oui, $vendor) = ($1, $2);
$sth->execute ($oui, $vendor) or die "Cannot insert $!";
}
Bingo! We're down to an average of 7 seconds...
mac@dax:~/tmp$ time ./test.pl oui.txt
real 0m6.811s
user 0m2.406s
sys 0m0.355s
But there's much faster...
LOAD DATA INFILE is MySQL's high-speed solution to load data in bulk. It works by parsing a file on disk... That means that we have to customize a bit out Perl script to store the data in a file after parsing:
#!/usr/bin/perl -w
use strict;
use File::Temp qw/ tempfile /;
use DBI;
my $line;
my $dbh = DBI->connect("DBI:mysql:database=tmp", "test_perl");
# Create the temporary file
my ($fh, $filename) = tempfile(undef, UNLINK => 1);
while ($line = <>)
{
next unless $line =~ /^([0-F]{6})\s+\(base 16\)\s+(.*)$/;
my ($oui, $vendor) = ($1, $2);
$fh->printf ("%s\t%s\n", $oui, $vendor);
}
$fh->close();
$dbh->do ("TRUNCATE TABLE oui");
$dbh->do ("LOAD DATA LOCAL INFILE '$filename' INTO TABLE oui(oui, vendor)");
And the timing gives...
mac@dax:~/tmp$ time ./test.pl oui.txt
real 0m0.493s
user 0m0.322s
sys 0m0.032s
Speedup of more than 10 times... And the DB's downtime is now limited to one single instruction. Difficult to do much better !