#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2012
    Posts
    12
    Rep Power
    0

    Access speed for very large files


    I have a customer who has a table that contains almost 6 milling records. Several years ago, when the record count was 4.8 million, I wrote a program in Delphi using Paradox. The customer has the need to display a read only grid containing the records ordered by customer name and then page through the grid. (Up/down arrow, pgup/pgdn, first/last) and also search by name. With Paradox it was simple. All I need do was open the table and paradox loaded the grid and access to any page and name search was instant. Unfortunately the table has grown larger than what paradox allows. I rewrote the program to use MySQL for access. I have indexed the file by record number and also customer name. Although it is possible to simply put a grid on the page and let MySQL handle the access, the access time is completely unacceptable. It takes 5 minutes just to open the table. I then wrote by own controls for the grid so that I could access the table with small pages. This way, from the customer perspective, they are looking at the entire table but in reality they are only looking at 20 records at a time. For example, if the customer presses the down arrow while at the end of the grid the next 20 records are loaded. I access the table as SELECT * FROM table ORDER BY customername LIMIT currentposttion,20. This way all I need do is change current position to access 20 records from anywhere in the table. My assumption was that, since I am only reading 20 records at a time access would be quick. It is not. Can someone help me with the proper way to handle pagination with very large files? Thanks in advance
  2. #2
  3. No Profile Picture
    Lost in code
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 2004
    Posts
    8,317
    Rep Power
    7170
    My assumption was that, since I am only reading 20 records at a time access would be quick.
    That assumption should actually correct, provided the query is written well and you have the right indexes on the table.

    What is your actual query and table structure?
    PHP FAQ

    Originally Posted by Spad
    Ah USB, the only rectangular connector where you have to make 3 attempts before you get it the right way around
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2012
    Posts
    12
    Rep Power
    0

    More detail


    The file is supplied to my customer in ascii comma delimited format. I use the following sql to import the file

    LOAD DATA INFILE "E:/File.txt"
    INTO TABLE table1
    FIELDS TERMINATED BY "^"
    LINES TERMINATED BY "\r\n";

    The 1st issue I have is that the above times out and the load fails. My assumption was that the ascii file was to large so I broke the ascii file into 6 equal pieces. Once done, I am able to change the above to load file1,file2.... This works, even though it would be easier if I could load in 1 piece.

    The table structure is as follows:

    CREATE TABLE table1

    (basevalue DOUBLE NULL,
    owner VARCHAR(50) NULL,
    street VARCHAR(40) NULL,
    id INT NULL,
    rdate VARCHAR(10),
    INDEX valuekey (basevalue,owner,id))

    ENGINE = innoDB CHARACTER SET utf8 COLLATE utf8_general_ci

    As I stated previously, my 1st idea was to do the same as I did with paradox. I put a grid on the form and the SQL accessed all records:

    SELECT * FROM table1 ORDER BY basevalue,owner,id

    This works but with the table is opened it takes about 5 minutes before the grid is populated, then when I attempt to pgup, pgdn there is a lag. If I click End or Home (to go to the 1st or last record) there is another 5 minute lag.

    Obviously not acceptable.

    I then though that I would simply write my own pagination code and trap the key strokes for up/down arrow, pgup,pgdn,end and home so that I could control how many records are read. I again put a grid on the form with the following SQL:

    SELECT * FROM table1 ORDER BY basevalue,owner,id LIMIT 0,20

    In this way I am only loading the 1st 20 records into the grid starting with record 0. This works, the file now opens instantly. I assumed the issue was solved.

    I then implemented the pgdn and pgup routines. Basically the pgup simply adds 20 to the record position each time pgup is pressed. LIMIT 20,20 / LIMIT 40,20 etc. This seemed to work but as I kept pressing pgdn the access started to slow. I then used EXPLAIN to see what was happening and this is what I see:

    When LIMIT is 0,20 EXPLAIN shows

    possible keys NULL
    key valuekey
    rows 20

    I am confused about possible keys being NULL but, other than that, all seemed OK.

    However, if I change the LIMIT to 100,20, which is what happen after a few pgdns EXPLAIN now shows:

    possible keys NULL
    key valuekey
    rows 120

    This tells me that all rows from 0 - 120 are being read, I would expect only rows 100 - 120 to be read and that rows would always be 20 since the LIMIT shoud start ready at 100 for 20 rows.

    I then changed the LIMIT to 1000,20. Now things get real strange, if working as above I would expect EXPLAIN to show:

    possible keys NULL
    key valuekey
    rows 1020

    What it shows is

    possible keys NULL
    key valuekey
    rows 5287912

    It appears that all 5+ million rows are being read and, based on the lag I see, I believe it.

    My assumption would be that a LIMIT of startrow, numberofrows would always only return numberofrows from whatever position startrow dictated.

    What am I missing? Thanks
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2012
    Posts
    12
    Rep Power
    0

    Issue with EXPLAIN?


    Noticed something else that seems odd. If I use LIMIT 0,20 - LIMIT 268,20 EXPLAIN shows

    type index
    possible_keys NULL
    key valuekey
    rows 288 (FOR LIMIT 268,20)
    extra

    If I change to LIMIT 269,20 I get

    type ALL
    possible_keys NULL
    key NULL
    rows 5287912
    extra Using filesort

    As a matter of fact any combination of LIMIT that totals more than 289 will change EXPLAIN to all Using Filesort
    ie. LIMIT 269,20 or 268,21 (anything in which LIMIT and OFFSET total more than 289)

    Is this a bug?
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2012
    Posts
    12
    Rep Power
    0

    Further tests with LIMIT


    In some further testing with LIMIT X,Y I think I see the lag problem. I still can't explain why the index seems to change from index to ALL when reading over 289 records but I do believe I see the problem with using LIMIT for pagination. It appears the LIMIT does not do what I would image it would do. My though was that LIMIT x,y would start reading at record x and return y records. In reality it seems that limit x,y reads the file from record 0 and reads through x+y records and then simply discards x records and returns the remaining y records. So if I say LIMIT 1000,10, what limit does is reads 1010 records then discards 1000 and returns 10. This, in my opinion, is nuts. It works OK if the table has a small amount of records. I have tested it on a table with 100,000 records and, since the entire table will fit in memory, it works well. The problem is, with tables that don't fit in memory, it is very slow.
  10. #6
  11. No Profile Picture
    Problem Solver
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jan 2001
    Location
    Stockholm, Sweden
    Posts
    4,495
    Rep Power
    534
    1.
    I think the first thing that you should do is create a primary key on your table. InnoDB actually stores the data rows in the primary key index (indexed ordered table).
    And if you don't specify a primary key in your design it will create one implicitly and this is usually a big disadvantage.
    And the primary key should be small and auto_increment because it is used as the reference in all other indexes.

    So in your case even though you might not intend to use it I recommend something like:
    Code:
    CREATE TABLE table1
    (
    pid int not null primary key auto_increment,
    basevalue DOUBLE NULL,
    owner VARCHAR(50) NULL,
    street VARCHAR(40) NULL,
    id INT NULL,
    rdate VARCHAR(10),
    INDEX valuekey (basevalue,owner,id))
    And this will usually also speed up the import since it doesn't have to rearrange and balance the tree in the table all the time.

    2.
    Yes the way you are describing it is how an ORDER BY ... LIMIT x,y works.
    My though was that LIMIT x,y would start reading at record x and return y records.
    How would you otherwise imagine LIMIT to find record x?
    LIMIT is used last in the internal execution of a query to reduce the number of rows returned after all other logic has been applied.
    If you want to use indexes in better ways you will have to do it in the _WHERE clause_ instead. Which when it comes to pagination can be a bit bothersome, but you can usually use something else to paginate on. Besides clicking 50 times on a pagination is usually not what a user wants to do so some other sort of grouping is usually necessary.
    /Stefan

IMN logo majestic logo threadwatch logo seochat tools logo