#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2010
    Posts
    10
    Rep Power
    0

    Arrow Query Performance


    I am having difficulty figuring out how to improve the performance of a query. The query is used to obtain a summary of individuals from a family tree database based upon their names. Note that each individual may have more than one name. For example a married person has both a birth name and a married name (which may be the same). A person may also have an adopted name, or a nom de plume, etc.

    Code:
    SELECT DISTINCT tblNX.IDIR, tblNX.Surname,
    			  tblIR.Surname, tblNX.GivenName,
    			  tblIR.BirthD, tblIR.DeathD, tblIR.Gender
    			FROM tblNX JOIN tblIR on tblIR.IDIR=tblNX.IDIR
    			WHERE ((tblNX.Surname='Campbell' AND 
    			                         tblNX.GivenName>='John') OR
    			             tblNX.Surname>'Campbell') AND
                                                      tblNX.`Order`>=0 
    			ORDER BY tblNX.SurName, tblNX.GivenName,
    				 tblIR.BirthSD, tblIR.DeathSD LIMIT 50
    The table containing the basic data on the individual is tblIR:

    Code:
    CREATE TABLE `tblIR` ( 
      `ID` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT, 
      `IDIR` INT(10) UNSIGNED DEFAULT NULL, 
      `FSID` VARCHAR(255) DEFAULT NULL, 
      `Surname` VARCHAR(120) DEFAULT NULL, 
      `SoundsLike` VARCHAR(4) DEFAULT NULL, 
      `GivenName` VARCHAR(120) DEFAULT NULL, 
      `Prefix` VARCHAR(120) DEFAULT NULL, 
      `Title` VARCHAR(120) DEFAULT NULL, 
      `NameNote` LONGTEXT, 
      `Gender` TINYINT(3) UNSIGNED NOT NULL DEFAULT 0, 
      `BirthD` VARCHAR(100) DEFAULT NULL, 
      `BirthSD` INT(10) NOT NULL DEFAULT -99999999, 
      `IDLRBirth` INT(10) UNSIGNED DEFAULT NULL, 
      `ChrisD` VARCHAR(100) DEFAULT NULL, 
      `ChrisSD` INT(10) NOT NULL DEFAULT -99999999, 
      `IDLRChris` INT(10) UNSIGNED DEFAULT NULL, 
      `ChrTerm` VARCHAR(100) DEFAULT NULL, 
      `DeathD` VARCHAR(100) DEFAULT NULL, 
      `DeathSD` INT(10) NOT NULL DEFAULT -99999999, 
      `IDLRDeath` INT(10) UNSIGNED DEFAULT NULL, 
      `BuriedD` VARCHAR(100) DEFAULT NULL, 
      `BuriedSD` INT(10) NOT NULL DEFAULT -99999999, 
      `IDLRBuried` INT(10) UNSIGNED DEFAULT NULL, 
      ...
      PRIMARY KEY (`ID`), 
      UNIQUE KEY `IDIR` (`IDIR`), 
      KEY `Surname` (`Surname`), 
      KEY `GivenName` (`GivenName`) 
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8
    The nominal index, which includes all of the alternative names, is in table tblNX:
    Code:
    CREATE TABLE `tblNX` ( 
      `IDNX` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT, 
      `IDNR` INT(10) UNSIGNED DEFAULT NULL, 
      `IDIR` INT(10) UNSIGNED DEFAULT NULL, 
      `Order` smallint(5) NOT NULL DEFAULT 0, 
      `MarriedNameCreatedBy` TINYINT(3) UNSIGNED DEFAULT NULL, 
      `MarriedNameMarIDID` INT(10) UNSIGNED DEFAULT NULL, 
      `Prefix` VARCHAR(120) DEFAULT NULL, 
      `Title` VARCHAR(120) DEFAULT NULL, 
      `Surname` VARCHAR(120) NOT NULL DEFAULT '', 
      `GivenName` VARCHAR(120) NOT NULL DEFAULT '', 
      `SoundsLike` VARCHAR(4) NOT NULL DEFAULT '', 
      `UserRef` VARCHAR(50) DEFAULT NULL, 
      `AKANote` LONGTEXT, 
      `PreferredAKA` TINYINT(3) UNSIGNED NOT NULL DEFAULT 0, 
      `BirthSD` INT(10) NOT NULL DEFAULT -99999999, 
      `SrchTag` TINYINT(3) UNSIGNED NOT NULL DEFAULT 0, 
      `qsTag` TINYINT(3) UNSIGNED NOT NULL DEFAULT 0, 
      PRIMARY KEY (`IDNX`), 
      KEY `IDIR` (`IDIR`), 
      KEY `Surname` (`Surname`), 
      KEY `GivenName` (`GivenName`), 
      KEY `SoundsLike` (`SoundsLike`), 
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8
    problem
    Explain on this query returns:
    Code:
    +----+-------------+-------+------+------------------------+------+---------+------------------------------+-------+---------------------------------+
    | id | select_type | table | type | possible_keys          | key  | key_len | ref                          | rows  | Extra                           |
    +----+-------------+-------+------+------------------------+------+---------+------------------------------+-------+---------------------------------+
    |  1 | SIMPLE      | tblIR | ALL  | IDIR                   | NULL | NULL    | NULL                         | 67863 | Using temporary; Using filesort |
    |  1 | SIMPLE      | tblNX | ref  | IDIR,Surname,GivenName | IDIR | 5       | jcobban_genealogy.tblIR.IDIR |     1 | Using where                     |
    +----+-------------+-------+------+------------------------+------+---------+------------------------------+-------+---------------------------------+
    As I understand this, the server is performing a linear search of the entire table tblIR, which currently has 67863 rows, and then uses IDIR to examine the associated rows of tblNX. This has poor performance and will only get worse as more rows are added. I do not understand why the primary search is not on tblNX, which is the only table referenced in the WHERE clause, or why the key of tblIR, which is explicitly mentioned in the JOIN, is not being used.

    If I extract the core of the query I get:
    Code:
    explain select idir, surname, givenname from tblNX where ((surname='Campbell' and givenname>='John') or surname>'Campbell') order by surname, givenname, birthsd limit 20;
    +----+-------------+-------+-------+-------------------+---------+---------+------+-------+-----------------------------+
    | id | select_type | table | type  | possible_keys     | key     | key_len | ref  | rows  | Extra                       |
    +----+-------------+-------+-------+-------------------+---------+---------+------+-------+-----------------------------+
    |  1 | SIMPLE      | tblNX | range | Surname,GivenName | Surname | 363     | NULL | 78515 | Using where; Using filesort |
    +----+-------------+-------+-------+-------------------+---------+---------+------+-------+-----------------------------+
    1 row in set (0.00 sec)
    This does show the query exploiting the surname index, although not the given name index. The performance of this query is probably acceptable. The problem is that while tblNX contains all of the name information it does not contain the "maiden" surname, gender, birth date, and death date information which I need to display to the end user, which is why I did the join of tblIR.

    How do I restructure this query to get MySQL to do the tblNX query exploiting the surname index and then join the tblIR information into the response?
  2. #2
  3. No Profile Picture
    Problem Solver
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jan 2001
    Location
    Stockholm, Sweden
    Posts
    4,495
    Rep Power
    534
    Originally Posted by jcobban
    Code:
    SELECT DISTINCT tblNX.IDIR, tblNX.Surname,
    			  tblIR.Surname, tblNX.GivenName,
    			  tblIR.BirthD, tblIR.DeathD, tblIR.Gender
    			FROM tblNX JOIN tblIR on tblIR.IDIR=tblNX.IDIR
    			WHERE ((tblNX.Surname='Campbell' AND 
    			                         tblNX.GivenName>='John') OR
    			             tblNX.Surname>'Campbell') AND
                                                      tblNX.`Order`>=0 
    			ORDER BY tblNX.SurName, tblNX.GivenName,
    				 tblIR.BirthSD, tblIR.DeathSD LIMIT 50
    What are you actually testing for with your above conditions?
    Because the operators >= and > are numeric operators and should not be used on strings in the way that you are using them. And most times when you force implicit data conversion like this indexes can't be used properly.
    /Stefan
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    74
    Rep Power
    19
    Originally Posted by sr
    What are you actually testing for with your above conditions?
    Because the operators >= and > are numeric operators and should not be used on strings in the way that you are using them. And most times when you force implicit data conversion like this indexes can't be used properly.
    Actually to my belief (and trial/error experience) using the > and < operators do work for Strings.
    I agree on the statement that the index does not work, because of the inequality operators used. The index is built for a specific value in the column, and not on any conversion of that value.
  6. #4
  7. No Profile Picture
    Problem Solver
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jan 2001
    Location
    Stockholm, Sweden
    Posts
    4,495
    Rep Power
    534
    Originally Posted by Aurum84
    Actually to my belief (and trial/error experience) using the > and < operators do work for Strings.
    Yes but usually not in the way that is expected (if one can expect anything from a numeric operator operating on strings ).

    Since it performs a byte-by-byte comparison of the two strings until it finds a difference and those bytes decide which is larger/smaller of the two strings you get comparison results like:
    Code:
    mysql> select 'AndersenHasAVeryLongName' > 'Campbell';
    +-----------------------------------------+
    | 'AndersenHasAVeryLongName' > 'Campbell' |
    +-----------------------------------------+
    |                                       0 |
    +-----------------------------------------+
    1 row in set (0.00 sec)
    
    mysql> select 'Springsteen' > 'Campbell';
    +----------------------------+
    | 'Springsteen' > 'Campbell' |
    +----------------------------+
    |                          1 |
    +----------------------------+
    1 row in set (0.00 sec)
    That often doesn't really make any sense, I mean why is Springsteen larger than Campbell ?
    /Stefan
  8. #5
  9. SQL Consultant
    Devshed Supreme Being (6500+ posts)

    Join Date
    Feb 2003
    Location
    Toronto Canada
    Posts
    27,280
    Rep Power
    4279
    Originally Posted by sr
    ... I mean why is Springsteen larger than Campbell ?
    because "larger" is the wrong semantic interpretation of the operator

    look, you use > and < on dates too, right?

    would you say one date is "larger" than another (hint: no)

    so why insist on using "numeric size" for all three different purposes of the operators?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
  10. #6
  11. No Profile Picture
    Problem Solver
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jan 2001
    Location
    Stockholm, Sweden
    Posts
    4,495
    Rep Power
    534
    Originally Posted by r937
    look, you use > and < on dates too, right?

    would you say one date is "larger" than another (hint: no)
    I beg to differ older man and say yes.

    Date is time and time is an ever incrementing value that you can add, subtract and calculate the difference between two of them.

    And what part of a date is not a numeric series:
    Code:
    mysql> create table mydate (col1 date not null);
    Query OK, 0 rows affected (0.05 sec)
    
    mysql> insert into mydate values (current_date);
    Query OK, 1 row affected (0.01 sec)
    
    mysql> insert into mydate select max(col1) + interval 1 day from mydate;
    Query OK, 1 row affected (0.00 sec)
    
    ...
    
    
    mysql> select * from mydate;
    +------------+
    | col1       |
    +------------+
    | 2013-03-01 |
    | 2013-03-02 |
    | 2013-03-03 |
    | 2013-03-04 |
    | 2013-03-05 |
    | 2013-03-06 |
    | 2013-03-07 |
    | 2013-03-08 |
    | 2013-03-09 |
    | 2013-03-10 |
    | 2013-03-11 |
    | 2013-03-12 |
    | 2013-03-13 |
    | 2013-03-14 |
    | 2013-03-15 |
    | 2013-03-16 |
    | 2013-03-17 |
    | 2013-03-18 |
    | 2013-03-19 |
    | 2013-03-20 |
    | 2013-03-21 |
    | 2013-03-22 |
    | 2013-03-23 |
    | 2013-03-24 |
    | 2013-03-25 |
    | 2013-03-26 |
    | 2013-03-27 |
    | 2013-03-28 |
    | 2013-03-29 |
    | 2013-03-30 |
    | 2013-03-31 |
    | 2013-04-01 |
    | 2013-04-02 |
    | 2013-04-03 |
    +------------+
    34 rows in set (0.00 sec)
    The fact that you guys on the other side of the puddle don't usually use ISO8601 format to represent a date doesn't make it less of an incrementing series of values (albeit a bit odd base varying between 28-31 for the days).
    /Stefan
  12. #7
  13. SQL Consultant
    Devshed Supreme Being (6500+ posts)

    Join Date
    Feb 2003
    Location
    Toronto Canada
    Posts
    27,280
    Rep Power
    4279
    Originally Posted by sr
    Date is time and time is an ever incrementing value that you can add, subtract and calculate the difference between two of them.

    And what part of a date is not a numeric series:
    strings are characters and characters are ever incrementing in the collating sequence

    and what part of characters are not in a collating sequence?

    also, i beg to differ, you can ~not~ add two dates

    plus, stop lumping me in with the americans... we canadians are a lot smarter than they are, and we always use yyyy-mm-dd format

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

IMN logo majestic logo threadwatch logo seochat tools logo