Friday, November 30, 2018

Mysql Replication Crash Safe Slaves Replication Feature introduced in 5.6

What to fix:

2014-07-21 06:29:58 3850 [Warning] Slave SQL: If a crash happens this configuration does not guarantee that the relay log info will be consistent, Error_code: 0

Current Settings

Add 2 lines to server 1/2 my.cnf

master_info_repository=TABLE
relay_log_info_repository=TABLE

Restart mysqld

Verify the Setting

root@server1> show variables like '%info_repository';
+---------------------------+-------+
| Variable_name             | Value |
+---------------------------+-------+
| master_info_repository    | TABLE |
| relay_log_info_repository | TABLE |
+---------------------------+-------+
2 rows in set (0.00 sec)
root@server1> select * from mysql.slave_master_info \G
*************************** 1. row ***************************
       Number_of_lines: 23
       Master_log_name: bin-log.000008
        Master_log_pos: 120
                  Host: 127.0.0.1
             User_name: repl_user
         User_password: repl_pass
                  Port: 3307
         Connect_retry: 60
           Enabled_ssl: 0
                Ssl_ca:
            Ssl_capath:
              Ssl_cert:
            Ssl_cipher:
               Ssl_key:
Ssl_verify_server_cert: 0
             Heartbeat: 1800
                  Bind:
    Ignored_server_ids: 0
                  Uuid: 37f99013-0fa4-11e4-ae2f-e13c88f43118
           Retry_count: 86400
               Ssl_crl:
           Ssl_crlpath:
Enabled_auto_position: 0
1 row in set (0.00 sec)
root@server1> select * from mysql.slave_relay_log_info \G
*************************** 1. row ***************************
Number_of_lines: 7
   Relay_log_name: /mysql56/mysql-server1/slave/relay-bin.000021
    Relay_log_pos: 281
Master_log_name: bin-log.000008
   Master_log_pos: 120
        Sql_delay: 0
Number_of_workers: 0
               Id: 1
1 row in set (0.00 sec)

Friday, November 9, 2018

[Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT

[Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. CREATE... IGNORE SELECT is unsafe because the order in which rows are retrieved by the SELECT determines which (if any) rows are ignored. This order cannot be predicted and may differ on master and the slave. Statement: create temporary table IF NOT EXISTS vt_tmp_u1098(id int(11) primary key) ignore (SELECT 1098 as id) UNION (SELECT vtiger_user2role.userid AS userid FROM vtiger_user2role INNER JOIN vtiger_users ON vtiger_users.id=vtiger_user2role.userid INNER JOIN vtiger_role ON vtiger_role.roleid=vtiger_user2role.roleid WHERE vtiger_role.parentrole like 'H1::H2::H31::H32::%') UNION (SELECT groupid FROM vtiger_groups where groupid in (777,778))

23:17:39' /* user:maintenance/update_mysql_from_logs.pl bt:main,486,445,364,164,79 */

2018-11-09T16:42:03.193203-08:00 8747303 [Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. INSERT... ON DUPLICATE KEY UPDATE on a table with more than one UNIQUE KEY is unsafe Statement: INSERT INTO demographics (pwsid, active, sex, bday, bday2, lat, lon, lastlogin, handle, race, body_type, sex_orient, marital, state, country, cupid_sex) VALUES ('2155832_48095', '1', '1', '1954-02-01', '1900-00-00', '25.7979', '-80.3071', '2018-11-09 16:41:16', 'e409', '4', '1', '0', '0', 'Florida', 'United States', '2') ON DUPLICATE KEY UPDATE active = '1',lastlogin = '2018-11-09 16:41:16' /* user:maintenance/update_mysql_from_logs.pl bt:main,486,445,364,164,79 */

The error message it gives you is pretty good actually. Basically the statement that it is warning you about could potentially end up with different results when it runs on the slave versus when it ran on the master. This is something that a tool like Percona's pt-table-checksum (see below link) can help with to detect what is called slave drift or data drift, which is when the slave has different data than the master.

What to actually do about this (other than check your tables with the tool I mentioned) depends on the situation. The query it is warning you about may end up working perfectly fine, but you would not know for sure without checking, and even then it may work sometimes and not others.

The best option in my opinion is to switch to row based replication to avoid this, as then MySQL is replicating the data itself and not the statement. Row based replication has some downsides, mainly that more data is transferred, so slave lag may be more of an issue in extreme cases, so you'll want to read up on that first before making any changes.

The other option would be to re-write the query to avoid any functions / formatting that results in unsafe queries for statement best replication

Mixed format does pretty much what it sounds like, but I've never used it myself and have not seen many other people using it. So that choice is up to you, but I would make sure to research / test it first.

MIXED mode is a good compromise between statement and ROW, and will work for the statement you had in warning message. Basically, for any data update that is "unsafe" from replication point of view, MySQL will log all changed rows values instead of the original statement. But it will keep logging other updates the same way as in statement mode.

In "MIXED" mode, replicated transactions are written in either of two formats: row based logging, or statement based login. MySQL decides to use statement or row based log entry if it considers a given statement is "safe" or not. So "safe" updates will be logged just like in the STATEMENT binlog format, and "not-safe" will be logged as row-based events. This way the MIXED mode is more safe then STATEMENT from data integrity point of view. It's just a compromise between pure STATEMENT or pure ROW modes.
Now, whether to use pure ROW format, we can argue, as there are many pros and cons of this solution

NOTE:
The binlog_format variable is dynamic, so you may change it at any point. The cleanest way would be to set it in your my.cnf on the master and restart the server, otherwise you can stop replication on your slave, change the setting dynamically on the master, and then start replication on your slave so that it picks up the change since a change in global variables only applies to new connections.

Secondly to just switch to "ROW" based replication, I can see that all i need to do is specify the command globally for all.
like this :- mysql> SET GLOBAL binlog_format = 'ROW';

Friday, November 2, 2018

SQL Skills Missing in Developers

Database Design and Support Skills

The biggest assumption is that just because a developer knows SQL that they have to skills to perform database design, data profiling or even data querying. Here are my thoughts on typical skills needed in system development:

Database Modelling Skills

Mission-critical databases should be designed by data architects or at least trained data modellers. Many clients I have worked for used their application developers to model and build their databases. Database design is a skill. The just-build-it approach (just add a table or a column on the fly) is not proper database development – it is a data slum.

Database Design Skills

Too many database are being designed by developers in an ad-hoc approach – missing a table, just add it; missing a column, just add it. Designing databases is not about creating SQL. The designer must know what makes a good model and must be knowledgeable and keen on data management best practises. In short, good database design requires data architecture experience.

Moreover, it is about asking the right questions to come up with a design that will meet the business needs, and be performant and resilient to business change. One should always use a data modelling tool to design a database and generate the SQL needed to create/modify the database – it should never be hand-crafted.

Avoid hard-coded SQL statements in application programs that can be used by developers to override expected system logic.
Design Referential Integrity (RI) into the model so that the database automatically creates the constraints to enforce RI.
Build all business rules into the tables thereby minimizing and preferably eliminating the hard-coded rules in the applications.

Data Profiling Skills

Just because a developer may know SQL does not mean that s/he knows how to perform data profiling to find all frequent data issues and determine their causes. S/he needs to have keen powers of observation and an unwillingness to ignore data issues.

Data Discovery Skills

Just because a developer may know SQL does not mean that s/he knows how to find patterns in the data and interpret their meaning. S/he may not understand the relationships between the tables.

Query Design Skills

It is not enough for a developer to know SQL to design queries and reports. S/he may not understand the business needs so s/he cannot possibly design the SQL queries needed to produce the outputs.

Database Indexing Knowledge

Just because a developer knows SQL does not mean s/he knows proper SQL construction. The following examples of an actual study showed that many query performance problems were caused by a lack of SQL knowledge, poor design or poor SQL coding.

Ensure criteria in a WHERE clause are indexed in the database;
Avoid an equation on left side of a WHERE clause, such as WHERE Amount + X < 5000;
Avoid WHERE clauses that contain columns inside an equation such as WHERE account_name = bestValue (acc_name, account_name).

These kind of queries cannot use indexing and may result in a full table scan and thus very poor performance.

What Makes Good Primary Key?

The following question was recently asked on Quora:

How do I choose the primary key when there are two unique identifiers?

I disagree that choosing an integer over a character field is the best answer for a Primary Key (PK). I am providing a synopsis of the standards I defined in my Data Architecture Handbook and Data Management Framework. There is additional detail and diagrams under the section Data Design Considerations below.

Data Type

When I first started my 35-year+ career as a data architect, it was true that integer fields were more performant than character. This has not been true for a while yet the belief still persists.

When I was building a major (multi-terabyte) Enterprise Data Warehouse (EDW) for a law enforcement system, I actually had to prove this to my DBAs via an extensive test and the difference was nano-seconds. What was also proved is that fixed length character is superior to variable-length character. Names and descriptions are not good candidates for a PK for many reasons.

Key Length

Length not data type has an impact on index performance.

Composite Keys

However, my DBs and EDWs often use composite PKs consisting of multiple columns and there is an insignificant impact on performance. There is also the extra benefit of being able to select by the highest order columns without having to know the exact value you are looking for in the lower level columns; e.g. find all country-subdivisions (state/province/canton…) for a selected country.

Using a Composite Key is far better than using Junction Tables (see Figure 2 below) in between every two related parent and child tables. Increasing the number of tables joined in queries has a definite performance hit and SQL coding complexity.

Two Unique Identifiers

So back to the original question of which column to choose. Generally, having two columns that uniquely identify each row is against Third Normal Form (3NF) Rules. 3NF is the basis on which relational databases were originally designed by their founder, Codd.

First you need to identify is what is the origin and purpose of each column. Why was a second unique key added to the table? I will give you one example I have come across a lot. Many governmental departments used to create a table of Countries identified by an arbitrary integer number. At a later date they added the ISO two-character Country Code. In this case, the two-character code is superior to the four or eight-byte integer. But not for the reason of length.

Reference Code Tables

The point of ISO and other international standards is to provide a set of code values that are independent of language and can be exchanged between all countries and companies. Therefore, it is the ISO, IATA, ITU, United Nations, Microsoft… code that MUST be the primary key and must be exchanged between organizations. Therefore, all Reference Code Tables should use the code values defined by the international or industry-specific standards (for instance, Microsoft defined the Language codes).

Operational Data Tables

This is where a judgement call needs to be made. Look at the purpose of each column and choose the one that makes the most sense for your business. So, for a Purchase Order table one would want to use the PO Number as defined by the business. Generally, these keys are integer but not always. For instance, the General Ledger usually looks like a number but is actually defined as character by most financial accounting systems.

Data Design Considerations

Primary Key Choices

Natural Key

A natural key is formed from attributes that already exist in the real world.^[2] Especially if it is alphabetic, it can indicate the meaning of the data held in a row. An example in the figure below is Country, where a mnemonic acronym defined by ISO is used to identify each country. The two-character ISO Country is pervasive as it is used by the banking, travel, financial, foreign exchange (prefix for the ISO Currency Code), and internet industries (used in web domains).

While some of these attributes will be natural keys, many are auto-incrementing integer numbers such as Purchase Order Number, Transaction Id, Contract Id, Invoice Number, and Customer Id.

Surrogate Key

Surrogate Key is a unique identifier for an object or row in a table and is frequently a sequential auto-number. It has no relationship to the real-world meaning of the data held in a row. Surrogate Key is essential for EDWs, which integrate data into a table from multiple data sources. One cannot use the key defined in the source system as it may conflict, be the same as a key in another system and/or may be changed by the source system.

Data Modelling

If a database has been designed properly there is no problem changing a primary key. The problem is that many developers with limited data architecture experience are still designing databases. In many instances, databases are not developed with industry-strength Data Modelling (DM) tools. If a DM tool is used, Referential Integrity business rules will be built right into the database. There will be no need to do your own “coding”. The RI rules will take care of changes or deletions, and when they are allowed.

Composite Keys

When a PK consists of two or more columns it is called a composite primary key. A composite key is required when one table is dependent on another table. For instance, Country Region (province, state, canton, département…) is only unique within (i.e. dependent on) Country. As shown in the figure below, the Country Region Codes table has a primary key consisting of Country and Ctry Region.

Figure 1: Composite Primary Key

Junction Tables

This following figure is an example of introducing an extraneous table into the model/database.[1] It provides no value-add and makes queries more complex and less performant due to the extra joins. Junction tables lead to table proliferation.

Figure 2: Junction Table between every Parent & Child Table

Identifying or Dependent Relationship

Instead directly link the tables using an identifying or dependent relationship; in this case, phones belong to or are used by an entity. A real plus is that a query can be made that finds all phones used by an entity. You don’t have to know the phone numbers used by the entity. This would be a single SELECT statement against a single table.

Figure 3: Child Table with Identifying Relationship

SQL Coding

Poorly coded SQL has probably the most significant impact on query performance.

10 Most Critical Database Mistakes

Top 10 Database Design Issues

After well over 30 years in the industry, I am amazed how some basic problems still exist. It is like the industry has not matured at all. You can look through the Internet and see similar problems described in the past. The list is not in order of significance but some of the biggest issues are the last three.

Summary of Top 10 Design Issues

Assigning Atrocious Element Names
Missing Data Documentation
Disregarding Normalization
Misusing SQL Functions and Procedures
Ignoring Data Integrity
Dismissing Interoperability & Reuse
Promoting Poor Performance
Dismissing Data Modelling & Design
Ignoring Object-Oriented Design
Bypassing Data Management

Each post will discuss one of these issues. If you can think of some examples from your experience, please share them as comments.

1. Assigning Atrocious Element Names

Should names of entities and attributes be a developer’s personal choice? Do we design IT solutions for the benefit of the developer or the Business? If you agree it’s the Business, then why would you use names that do not clearly explain the purpose of the data element? For example, what does the column name CODE really mean? Yes I have seen that column name!

In fact, when management says they want a data dictionary, this is really a reflection on the fact they do not understand the meaning of the names that were used. Attribute/column names must be clear in meaning. Future users and developers should not have to search endlessly for documentation – if it can be found – that explains their meaning.

One space should be used between words because that is how human languages work and that is what should be shown to Business users in conceptual/logical models and data dictionaries. Spaces improve readability. Moreover, the Physical Data Model automatically changes the space to an underscore with no effort on the part of the modeller.

Abbreviations

If you agree that a database/system should be designed for the Business, then why would you use unintelligible, highly-abbreviated names without vowels? For instance, what does X20 mean?

Yes, there is a limit of 30 characters for names in Oracle. But one should be able to name an element with 30 characters. In fact, developers drop vowels and use totally non-standard abbreviations even when the fully-spelled name would have been less than 30 characters. This has to stop.

To improve this situation on my projects, I developed a routine to convert all physical names to improved conceptual business names. The abbreviations were expanded to business language and very few exceeded the 30-character limit.

If the designer has to use abbreviations, it should be ones that are used throughout the world, generally accepted in the industry, found in the dictionary or the www.abbreviations.com web site. For example, the standard abbreviation for elevation is ELEV not ELVTN.

How do you manage and standardize acceptable abbreviations? You make a “small” list that developers are allowed to use and vet all names against that list. There is no point abbreviated Date to DT to save a mere two characters. Also consider clarity. If you abbreviate Street to ST how does the reader know if ST means street, saint or standard time? Clearly, one should focus on abbreviating longer words. Here is a example list:

Table 1 Standard Abbreviations Sorted by Meaning

Abbreviation	Meaning
Abbrev	Abbreviation
Accom	Accommodation
Admin	Administration
Desc	Description

Name Prefixes and Suffixes

Prefixes and suffixes are popular in application languages but they should not be used in data models and databases.

Names should not contain data type prefixes or suffixes like txtGender.

That is developer-speak. You look in the database or the data dictionary IN the data model to know data types and other metadata. A similar example is the superfluous use of “text” when describing a message or description, or “number” when describing a count column; e.g. RETURN_MESSAGE_TEXT or PERSON_COUNT_NBR.

Names should not contain system or table prefixes like ERS_GENDER_TYPE, AMS_ADDRESS or CRM_ERROR_MSG.

It takes up too many characters when name is limited to length 30. Moreover, a data modelling tool will migrate the key of a parent table to the child table AS IS. It will not change Country Code to Purchase Order Country Code, Requisition Country Code… Similarly, a good data modelling or database tool will automatically join tables using the named data elements that are common to both tables. Another way to look at it is element reuse…

Element Reuse

Table and Data Elements are meant to be reused.

In fact, if you believe in object-oriented design then data elements that mean the same thing should have exactly the same name. Normally database elements χ1 and χ2 are physically separate entities even if they have the same name. But it tells the users they are the same. It also means that if I search for element χ then all instances of element χ will be found.

Some Data Modelling tools enable Table and Data Element reuse.

In fact, the best data modelling tools allow model element reuse and inheritance. This means that if I update element χ then all instances of element χ in the model will be modified. This facilitates changes to the description and other metadata in the base object as you know they will be automatically inherited by all other instances – a real improvement for data dictionaries and metadata repositories.

If your data modelling tool does not allow inheritance, then use the DOMAIN concept. A DOMAIN applies a set of metadata properties to all members of the Domain. The data elements do not have to have the same name but will have the same data types, defaults, maximum values and other constraint rule properties. DOMAIN is an actual SQL construct that dates back to the origin of Relational Databases.

Three main categories to SQL and database performance

There are three main categories to SQL and database performance.

SQL Coding Techniques
Database Indexing
Database Design

SQL CODING TECHNIQUES

SQL is a language and you can control how well it is being used. Most of these are things you should not do:

Avoid an equation on left side of a WHERE clause, such as WHERE Amount + 3000 > Past_Amount;
Never use function calls on the left side of a WHERE clause such as WHERE UPPER (Last_Name) LIKE β;
Do not concatenate columns in the WHERE clause as it cannot use the indexes; e.g. Last_Name || Given_Names;
Avoid UNION of datasets: UNION requires a sort operation to eliminate any rows that are duplicated across the two data sets;
Avoid SQL statements that improperly handle null values;
Code the most important selection criteria first in the WHERE clause.

DATABASE INDEXING

This is one of most important things that affects performance. But only a DBA can add or change indexes. So a developer has no control over this. You may not even have the authority to look at the database to see how it’s defined. What are the things that need to be looked at? The candidate action items are:

Ensure you always define Primary Keys (PK) and Foreign Keys (FK)

If these are missing then your database was not designed by a data modeller using a data model. A data model is not just a pretty picture. It automatically creates FK constraints and indexes when you join two tables. Moreover, the parent tables — usually your reference codes or dimensions — are the most important selection criteria in your BI queries and reports.

See What Makes Good Primary Key? – no, PKs do not need to be integers!

If there is no data model it probably means your database was not designed at all and probably is a victim of the “Just Build It” approach performed by developers. In other words, it’s a slum not an architecture. I have seen databases where only half the tables had a PK, which means you cannot have an FK, which means you cannot have Referential Integrity. RI is a far bigger and more important problem than performance.

Use Clustered Indexes appropriately
Ensure that the most Important Selection Criteria are Indexed

Selection Criteria are not only in your own SQL scripts but also implicitly in stored procedures and the SQL run by BI and other analytical tools. The candidates are:

Index Columns used frequently in a WHERE clause
Index Columns used in JOIN, GROUP BY, ORDER BY and TOP clauses
Index Columns used in Aggregation Clauses
Avoid Indexes with a Long Length

It is not true that integer keys are faster than character-based keys. That was true eons ago but not in todays databases. What is true is that a two-character key will be faster than a set of big integer fields or a > 30 variable character key.

Avoid Creating Tables with Less Than 6 Rows

If you use an index based on a table with 3 or 4 values (e.g. Gender F, M, T, and unassigned) you will be performing a full table scan — go for coffee. There will be probably be a Gender value of type F on every page of the table so the index is non-performant. Do not even do it.

P.S.: use a value list (aka enumeration) constraint instead.

DATABASE DESIGN

If the DB is poorly designed it may be near impossible or very inefficient to create the query your users want. For instance, one database created a separate table for each request form or document. Now management wants to get a cross-tab of the number of requests received, reviewed, approved or rejected and the average number of days spent in each approval stage by document type. Because the data is split between different tables, you would have to do multiple queries and then use the very inefficient UNION clause. That assumes that each table uses the same columns, names, definitions and business rules. Alternatively, the users extract the data to Excel and spend significant manual effort (days) massaging the data. The answer should really be possible in one query taking seconds.