Tuesday, April 10, 2012

The Data Charmer: Finding tables without primary keys

Finding tables without primary keys

I was checking a third party server, and I needed to find if there were tables without primary keys. This is important to know, not only because the lack of primary keys affects performance and data accuracy in general, but also because in row-based replication performance can degrade beyond belief when updating tables without primary keys. Anyway, I did not remember off the bat any method to get this information from a server with thousands of tables, and thus I went to find a solution on my own. My first instinct called for using the COLUMNS table from the INFORMATIOn_SCHEMA, and so I came up with this query, where I sum the number of columns that are inside either a PRIMARY or UNIQUE key and filter only the ones where such sum is zero (i.e. no primary or unique keys):
select 
    table_schema,table_name 
from  
    information_schema.columns  
group by 
    table_schema,table_name   
having 
    sum(if(column_key in ('PRI','UNI'), 1,0)) = 0;
This query got the job done, and it was quite quick as well. Then, since I was chatting with Sheeri Cabral about other matters, I asked her if she could come up with an alternative solution. She suggested a LEFT JOIN between the information_schema.tables and information_schema.statistics, which I translated into this query:
select 
    t.table_schema, t.table_name 
from 
    information_schema.tables  t 
    left join information_schema. statistics s 
       on t.table_schema=s.table_schema and t.table_name=s.table_name 
       and s.non_unique=0 
where 
    s.table_name is null;
This query works on the principle that it removes from the tables list all the ones for which there is no corresponding table in the statistics table. This query also works. Using both queries in a relatively empty server did not show any significant difference. But since I knew that I had to use this method on a very busy server, with a lot of tables, I quickly created 1,000 databases, each containing 5 tables, two of which did not have any primary or unique key. Now came the first surprise. The query with GROUP BY took about 0.5 seconds, while the one using LEFT JOIN used 11 seconds. I was about to congratulate myself for my acumen, when I realized that, in addition to schema and table names, I also needed the table engine. For the second query, that is not a problem. Adding the engine to the columns list works OK, and the query runs in 11 seconds like before. The first query, though, can't list the engine. There is no 'engine' in the COLUMNS table. So I needed a JOIN. Thus my query became
select 
    t.table_schema,t.table_name,engine 
from 
    information_schema.tables t 
    inner join information_schema .columns c  
        on t.table_schema=c.table_schema and t.table_name=c.table_name 
group by 
    t.table_schema,t.table_name   
having 
    sum(if(column_key in ('PRI','UNI'), 1,0)) =0;
Guess what? This query ran in 17 seconds (!). So much for my instinct! Joins without keys are not efficient in MySQL, and tables in the information schema are no exception. If anyone has a more efficient method of getting a list of tables without primary key (the list including schema name, table name, and engine), I am curious to know.

No comments:

Post a Comment