2011年9月29日 星期四

SQL Antipatterns: Avoiding the Pitfalls of Database Programming

Bear在這裡買的
amazon的參考

SQL反模式/圖靈程序設計叢書
作者:(美)卡爾文|譯者:譚振林
出版社:人民郵電
ISBN:9787115261274
出版日期:2011/09/01
頁數:253
人民幣:RMB 59 元


Bear: 第三章Bear覺得很重要,講述的內容是 和 CTE 密切相關的東西,一定要多看幾次。

P.20(原文P.36)
這樣的設計叫做鄰接表。

This design is called Adjacency List. It’s probably the most common design software developers use to store hierarchical data. The following is some sample data to show a hierarchy of comments, and an illustration of the tree is shown in Figure 3.2, on the following page.

但是,就算如此多的程序員會將鄰接表作為默認的解決方案,它仍然可能成為一個反模式,原因在於它無法完成樹操作中最普通的一項:查詢一個節點的所有後代。

Adjacency List can be an antipattern when it’s the default choice of so many developers yet it fails to be a solution for one of the most common tasks you need to do with a tree: query all descendants.

P.23(原文P.40)
鄰接表設計的優勢在於能快速地獲取一個給定節點的直接父子節點,它也很容易插入新節點。如果這樣的需求就是你的應用程序對於分層數據的全部操作,那使用鄰接表就可以很好地工作了。

The Adjacency List design might be just fine to support the work you need to do in your application. The strength of the Adjacency List design is retrieving the direct parent or child of a given node. It’s also easy to insert rows. If those operations are all you need to do with your hierarchical data, then Adjacency List can work well for you.

SQL-99標準定義了遞迴查詢的表達式規範,使用 WITH 關鍵定加上公共表表達式。
(Bear: 就是 MS SQL Server 中的 CTE)

Some brands of RDBMS support extensions to SQL to support hierarchies stored in the Adjacency List format. The SQL-99 standard defines recursive query syntax using the WITH keyword followed by a common table expression.

P.33(原文P.52)
鄰接表是最方便的設計,並且很多軟件開發者都了解它。

Adjacency List is the most conventional design, and many software developers recognize it.

如果你使用的數據庫支援 with 或者 connect by prior 的遞迴查詢,那能使得鄰接表的查詢更為高效。

Recursive Queries using WITH or CONNECT BY PRIOR make it more efficient to use the Adjacency List design, provided you use one of the database brands that supports the syntax.

P.35(原文P.55)
這章的目標就是要確認那些使用了主鍵,卻混淆了主鍵的本質而造成的一種反模式。

The objective is to make sure every table has a primary key, but confusion about the nature of a primary key has resulted in an antipattern.

在這樣的表中,需要引入一個對於表的域模型無義的新列來存儲一個偽值。這一列被用作這張表的主鍵,從而通過它來確定表中的一條記錄,即便其他的列允許出現適當的重覆項。這種類型的主鍵列我們通常稱其為偽主鍵或者代理鍵。

A new column is needed in such tables to store an artificial value that has no meaning in the domain modeled by the table. This column is used as the primary key, so you can address rows uniquely while allowing any other attribute column to contain duplicates, if that’s appropriate. This type of primary key column is sometimes called a pseudokey or a surrogate key.

如果你不使用主鍵約束,就只有一個選擇:檢查是否有重覆行。

If you don’t use primary key constraints, you create a chore for yourself: checking for duplicate rows.

P.36(原文P.56)
為主鍵直到SQL 2003才成為一個標準,因而每個數據庫都使用自已特有的SQL擴展來實現偽主鍵,甚至不同數據庫中對於偽主鍵都有不同的名稱(不同的表述)。

Pseudokeys weren’t standardized until SQL:2003, so each database uses its own extension to SQL to implement them. Even the terminology for pseudokeys is vendor-dependent, as shown by the following table:
....
IDENTITY Microsoft SQL Server

P.38(原文P.59)
You’re probably familiar with the SQL syntax for a join, using the keywords JOIN and ON preceding an expression to evaluate matching rows in the two tables.

SELECT * FROM Bugs AS b JOIN BugsProducts AS bp ON (b.bug_id = bp.bug_id);

SQL also supports a more concise syntax for expressing a join between two tables. You can rewrite the previous query in the following way if the columns have the same name in both tables:

SELECT * FROM Bugs JOIN BugsProducts USING (bug_id);

Bear: 在這之前,真的沒聽過也沒看過這樣寫法,真的是太神奇了。

P.41(原文P.63)
如果你的表中包含一列能確保唯一,非空以及能夠用來定位一條記錄,就別僅僅因為傳統而覺得有必要再加上一個偽主鍵。

If your table contains an attribute that’s guaranteed to be unique, is non-null, and can serve to identify the row, don’t feel obligated to add a pseudokey solely for the sake of tradition.

P.47(原文P.70)
日語中有個短語 poka-yoke,意思是"防差錯技術"。這是一種制造工藝,通過在錯誤發生時對錯誤加以阻止、糾正或者引起注意來幫助消除產品缺陷。這項工藝能夠顯著地提昇產品質量,幫助減少糾錯的必要,相比於使用這種工藝的開銷,其所獲得收益更高。

The Japanese phrase poka-yoke means “mistake-proofing.” This term refers to a manufacturing process that helps eliminate product defects by preventing, correcting, or drawing attention to errors as they occur. This practice improves quality and decreases the need for correction, which more than makes up for the cost of its use.

P.52(原文P.75)
這樣的設計稱為實體-屬性-值,簡稱 EAV。有時也稱之為:開放架構、無模式或者名-值對。

This design is called Entity-Attribute-Value, or EAV for short. It’s also sometimes called open schema, schemaless, or name-value pairs.

P.56(原文P.81)
如果你有非關係數據管理的需求,最好的答案是使用非關係技術。

If you have nonrelational data management needs, the best answer is to use a nonrelational technology. This is a book about SQL, not about SQL alternatives, so I’ll list only a sampling of these technologies:

P.57-58(原文P.82)
對於所有的子類型來說,既有一些公共屬性,但同時又有一些子類型特有屬性。這些子類型特有屬性列必須支持空值,因為根據子類型的不同,有些屬性並不需要填寫,從而對於一條記錄來說,那些非空的項會變得比較零散。

The simplest design is to store all related types in one table, with distinct columns for every attribute that exists in any type. Use one attribute to define the subtype of a given row. In this example, this attribute is called issue_type. Some attributes are common to all subtypes. Many attributes are subtype-specific, and these columns must be given a null value on any row storing an object for which the attribute does not apply; the columns with non-null values become sparse.

P.58(原文P.82)
CREATE TABLE Issues (
issue_id SERIAL PRIMARY KEY,
reported_by BIGINT UNSIGNED NOT NULL,
product_id BIGINT UNSIGNED,
priority VARCHAR(20),
version_resolved VARCHAR(20),
status VARCHAR(20),
issue_type VARCHAR(10), -- BUG or FEATURE
severity VARCHAR(20), -- only for bugs
version_affected VARCHAR(20), -- only for bugs
sponsor VARCHAR(50), -- only for feature requests
FOREIGN KEY (reported_by) REFERENCES Accounts(account_id)
FOREIGN KEY (product_id) REFERENCES Products(product_id)
);

Bear: 這裡的意思是,有些欄位在 issue_type 值為 bug時才會被填值,有些欄位是在 issue_type 為 feature 時才會填值。請參考 version_affected severity sponsor 欄位。

P.60(原文P.85)
CREATE TABLE Issues (
issue_id SERIAL PRIMARY KEY,
reported_by BIGINT UNSIGNED NOT NULL,
product_id BIGINT UNSIGNED,
priority VARCHAR(20),
version_resolved VARCHAR(20),
status VARCHAR(20),
FOREIGN KEY (reported_by) REFERENCES Accounts(account_id),
FOREIGN KEY (product_id) REFERENCES Products(product_id)
);

CREATE TABLE Bugs (
issue_id BIGINT UNSIGNED PRIMARY KEY,
severity VARCHAR(20),
version_affected VARCHAR(20),
FOREIGN KEY (issue_id) REFERENCES Issues(issue_id)
);

CREATE TABLE FeatureRequests (
issue_id BIGINT UNSIGNED PRIMARY KEY,
sponsor VARCHAR(50),
FOREIGN KEY (issue_id) REFERENCES Issues(issue_id)
);

Bear:這三個table之間以 issue_id 這個primary key關聯起來,根據這本書中多次引用這種模式,此模式應是最被推薦的方法。

P.68(原文P.94)
就像 EAV 的設計一樣,你應該懷疑任何聲稱有無限擴展性的設計。

Like in EAV, you should be suspicious of any claims of unlimited flexibility.

P.69(原文P.95)
你應該盡可能地避免使用多態關聯,應該使用外鍵約束等來確保引用完整性。

You should avoid the Polymorphic Associations antipattern—use constraints like foreign keys to ensure referential integrity. Polymorphic Associations often relies too much on application code instead of metadata.

P.69(原文P.96)
既要避免多態關聯的缺點,又同時支持你所需要的數據模型,最好的選擇是重新設計數據庫。

It’s better to redesign your database to avoid the weaknesses of Polymorphic Associations but still support the data modeling you need. The following sections describe a few solutions that accommodate the data relationship but make better use of metadata to enforce integrity.

P.77(原文P.104)
The syntax required to search for a single value over multiple columns is lengthy and tedious to write. You can make it more compact by using an IN predicate in a slightly untraditional manner:

SELECT * FROM Bugs
WHERE 'performance' IN (tag1, tag2, tag3)
AND 'printing' IN (tag1, tag2, tag3);

Bear: 這裡又長知識了,欄位被拿來放在 IN 中比對(一般是把值放在IN中比對)

P.77(原文P.105)
NULLIF()在SQL中是一個標準函數,它被除 Informix 和 Ingres 以外的所有廠商支持。

The NULLIF( ) is a standard function in SQL; it’s supported by all brands except Informix and Ingres.

P.88(原文P.117)
手動分割表的一個合理使用場景是歸檔數據 - 將歷史數據從日常使用的數據中移除。通常在過期數據的查詢變得非常移少的情況下,才會進行如此的操作。
如果你沒有同時查詢當前數據和歷史數據的需求,將老數據從當前活動的表轉移到其它地方是很合適的操作。
將數據歸檔到和當前表結構相兼容的新表中,既能支持偶爾做數據分析時的查詢,同時也能讓日常數據查詢變得非常高效。

One good use of manually splitting tables is forarchiving—removing historical data from day-to-day use. Often the need to run queries against
historical data is greatly reduced after the data is no longer current.

If you have no need to query current data and historical data together, it’s appropriate to copy the older data to another location and delete it from the active tables. Archiving keeps the data in a compatible table structure for occasional analysis but allows queries against current data to run with greater performance.

Bear: 這種架構方式在日月光看過。

P.89(原文P.118)
當一張表的數據量變得非常巨大時,除了手動拆分這張表,還有更好的辦法來提昇查詢性能。這些方法就包括了水平分區、垂直分區以及使用關聯表。

There are better ways to improve performance if a table gets too large, instead of splitting the table manually. These include horizontal partitioning, vertical partitioning, and using dependent tables.

P.95(原文P.124)
SQL中的 FLOAT 類型,就和其它大多數編程語言的 float 一樣,根據 IEEE 754 標準使用二進制格式編碼實數數據。

The FLOAT data type in SQL, like float in most programming languages, encodes a real number in a binary format according to the IEEE 754 standard. You need to understand some characteristics of floatingpoint numbers in this format to use them effectively.

P.99(原文P.130)
如果你需要精確地表示十進制數,使用 NUMERIC 類型。FLOAT 類型無法表示很多十進制的有理數,因此它們應該當成非精確值來處理。

If you need exact decimal values, use the NUMERIC data type. The FLOAT data type is unable to represent many decimal rational numbers, so they should be treated as inexact values.

盡可能不要使用浮點數。

Do not use FLOAT if you can avoid it.

P.117(原文P.151)
大多數數據庫都會自動地為主鍵建立索引,因此額外再定義一個索引就是一個冗餘操作。這個額外定義的索引並無任何好處,它只會成為額外的開銷。

bug_id: Most databases create an index automatically for a primary key, so it’s redundant to define another index. There’s no benefit to it, and it could just be extra overhead. Each database brand has its own rules for when to create an index automatically. You need to read the documentation for the database you use.

P.118(原文P.152-153)
Some examples of queries that can’t benefit from this index include the following:

1.SELECT * FROM Accounts ORDER BY first_name, last_name;

這個查詢就是之前電話號碼簿的情況。如果你創建了一種先 last_name 再 first_name 順序的聯合索引(就如同電話號碼簿),它是不會幫你先按照 first_name 進行排序。

This query shows the telephone book scenario. If you create a compound index for the columns last_name followed by first_name (as in a telephone book), the index doesn’t help you sort primarily by first_name.

2.SELECT * FROM Bugs WHERE MONTH(date_reported) = 4;
Even if you create an index for the date_reported column, the order of the index doesn’t help you search by month. The order of this index is based on the entire date, starting with the year. But each year has a fourth month, so the rows where the month is equal to 4 are scattered through the table.

Some databases support indexes on expressions, or indexes on generated columns, as well as indexes on plain columns. But you have to define the index prior to using it, and that index helps only for the expression you specify in its definition.

3. SELECT * FROM Bugs WHERE last_name = 'Charles' OR first_name = 'Charles';
We’re back to the problem that rows with that specific first name are scattered unpredictably with respect to the order of the index we defined. The result of the previous query is the same as the result of the following:

SELECT * FROM Bugs WHERE last_name = 'Charles'
UNION
SELECT * FROM Bugs WHERE first_name = 'Charles';

The index in our example helps find that last name, but it doesn’t help find that first name.

4. SELECT * FROM Bugs WHERE description LIKE '%crash%';

由於這個查詢斷言的匹配子串可能出現在該字段的任何部份,因此即使經過排序的索引結構也幫不上任何忙。

Because the pattern in this search predicate could occur anywhere in the string, there’s no way the sorted index data structure can help.

P.120(原文P.156)
如果一個查詢很少被調用,那即使它是單次調用耗時最多的一個,也不見得是最耗時間的查詢。其它更簡單一點的查詢可能被調得很頻繁,比你所預期的還要頻繁,因此它們所花費的總時間就會更多。專注於這些能夠讓你事半功倍的查詢優化。

The area of greatest cost in your application isn’t necessarily the most time-consuming query if that query is run only rarely. Other simpler queries might be run frequently, more often than you would expect, so they account for more total time. Giving attention to optimizing these queries gives you more bang for your buck.

記住在做查詢性能測試的時候要禁止所有的查詢結果緩存。這些緩存被設計用來繞過查詢過程和索引使用,因此,如果不禁止這些緩存,你是得不到準確信息的。

Disable any query result caching while you’re measuring query performance. This type of cache is designed to bypass query execution and index usage, so it won’t give an accurate measurement.

P.131(原文P.168)
Microsoft .NET 2.0 以上版本,為 Web 界面提供了一個叫做 ConvertEmptyStringToNull 的方法。參數和綁定定段會自動地將空字符串轉換成 NULL。

Similarly, user input cannot represent a null directly. An application that accepts user input may provide a way to map some special input sequence to null. For example, Microsoft .NET 2.0 and newer supports a property called ConvertEmptyStringToNull for web user interfaces. Parameters and bound fields with this property automatically convert an empty string value ("") to null.

P.132(原文P.169)

Expression Expected Actual Because
NULL AND TRUE FALSE NULL Null is not false.
NULL AND FALSE FALSE FALSE Any truth value AND FALSE is false.
NULL OR FALSE FALSE NULL Null is not false.
NULL OR TRUE TRUE TRUE Any truth value OR TRUE is true.
NOT (NULL) TRUE NULL Null is not false.

P.134(原文P.171)
為此可以使用 COALESCE( ) 函數。這個函數接受一系列的值作為入參,並且返回第一個非 null 的參數。

In some queries, you may need to force a column or expression to be non-null for the sake of simplifying the query logic, but you don’t want that value to be stored. What you need is a way to set a default for a given column or expression ad hoc, in a specific query only. For this you should use the COALESCE( ) function. This function accepts a variable number of arguments and returns its first non-null argument.

P.142(原文P.181)
JOIN 解決方案適用於針對大量數據查詢並且可伸縮性比較關鍵時。盡管這個方案比較難以理解和維護,但它總是能比基於子查詢的解決方案更好地適應數據量的變化。記住一定要對不同類型的查詢的性能進行實際的測量,而不是僅靠猜測來判斷那個更好。

Use the JOIN solution when the scalability of the query over large sets of data is important. Although it’s a tougher concept to grasp and therefore more difficult to maintain, it often scales better than a subquerybased solution. Remember to measure the performance of several query forms, instead of assuming that one performs better than the other.

P.149(原文P.189)
每種數據庫都可能針對這個需求提供獨有的解決方案,比如 Microsoft SQL server 2005 增加了一個 TABLESAMPLE 子句

Any given brand of database might implement its own solution for this kind of task. For example, Microsoft SQL Server 2005 added a TABLESAMPLE clause: SELECT * FROM Bugs TABLESAMPLE (1 ROWS);

P.169(原文P.213)
執行多次SQL查詢或者多條SQL語句可能並不是解決問題最高效的辦法,但你應該在效率和解決問題之間找到平衡點。

Executing so many SQL queries or statements may not be the most efficient way to accomplish a task. But you should balance the goal of efficiency against the goal of getting the task done.

P.175(原文P.220)
無論如何你都需要放棄使用通配符

You Need to Give Up Wildcards Anyway

你不可避免地要在查詢中引入列別名、函數或者從列表中排除某列。如果你從一開始就不使用通配符,那之後要對查詢進行修改就會變得更加方便。

You’ll inevitably need to treat some columns in a query individually by employing a column alias or a function or removing a column from the list. If you skip the use of wildcards from the beginning, it’ll be easier to change your query later.

P.186(原文P.231)
第一個方案是當用戶忘記他們的密碼請求幫助的時候,程序發送一封帶有臨時生成密碼的郵件給用戶,而不是直接發給他自已的密碼。為了安全起見,在一個較短的時間之內,這個臨時密碼就會過期,即使這封 email 被截獲了,程序還是不會允許未驗證的訪問。同時程序應該設計成一旦用戶使用臨時密碼登錄後,就應該被強制要求修改密碼。

The first alternative is that when a user who has forgotten his password requests help, instead of emailing his password to him, your application can send an email with a temporary password generated by the application. For additional security, the application may expire the temporary password after a short time, so if the email is intercepted, it’s more likely that it will not allow unauthorized access. Also, the application should be designed so that the user is forced to change the password as his first action when he logs in.

P.192(原文P.239)
某些數據庫可會隱式地將字符串轉換成一個等價的數字,但在標准SQL中,在將字符串轉換成數字時,一定要明確使用 CAST() 函數。

You can’t compare a numeric column directly to a string containing digits in all brands of database. Some databases may implicitly cast the string to a sensible numeric equivalent, but in standard SQL you have to use the CAST( ) function deliberately to convert a string to a numeric data type.

P.198(原文P.)
然而,如果你用了一個參數 is_active = ? 來構造這個表達式,優化器不知道在預處理這條語句的時候你最終會傳入那個值,因此很有可能就選擇了錯誤的優化方案。
(Bear: 這裡要配合書上的範例一起看才清楚。)

Quoting Dynamic Values

Query parameters are usually the best solution, but in rare cases a query with parameter placeholders causes the query optimizer to make odd decisions about which indexes to use.

For example, suppose you have a column in the Accounts table called is_active. This column stores a true value for 99 percent of the rows, giving it an uneven distribution of values. A query that searches for is_active = false would benefit from an index, but it would be a waste to read the index for a query searching for is_active = true. However, if you used a parameter in the expression is_active = ?, the optimizer can’t know which value you will supply when you execute the prepared query, so it’s liable to choose the wrong optimization plan.

In exotic cases like this, it could be better to interpolate values directly into the SQL statement, in spite of the general recommendation to use query parameters. If you do this, you should quote the strings carefully.

$quoted_active = $pdo->quote($_REQUEST["active"]);
$sql = "SELECT * FROM Accounts WHERE is_active = {$quoted_active}";
$stmt = $pdo->query($sql);

Make sure you use a function that is mature and well-tested against obscure SQL security issues. Most data access libraries include such a string-quoting function. For example, in PHP, use PDO::quote( ). Don’t try to implement your own quoting function unless you have studied the security risks thoroughly.

P.199(原文P.246)
下面的PHP例子使用了一些PHP內置的數組函數來生成一個佔位符數組,其條目個數和 $bug_list 的條目個數一致,然後在插入到SQL表達式之前,將這些數組用逗號拼接在一起。

The following example in PHP uses some built-in array functions to produce an array of placeholders the same length as $bug_list and then joins that array with comma separators before interpolating it into the SQL expression.

$sql = "SELECT * FROM Bugs WHERE bug_id IN ("
. join(",", array_fill(0, count($bug_list), "?")) . ")";
$stmt = $pdo->prepare($sql);
$stmt->execute($bug_list);

Use this technique to parameterize a list of values

P.200(原文P.248)
選項需要硬編碼在程序中,但對於表明、列名、和SQL關鍵字來說,這麼做還是很恰當的。
(Bear: 這裡要配合書上的範例一起看才清楚。)

Using this technique has several advantages:
You never combine user input with your SQL query, so you reduce the risk of SQL Injection.
You can make any part of an SQL statement dynamic, including identifiers, SQL keywords, and even entire expressions.
You have an easy and efficient way to validate user choices.
You decouple the internal details of your database queries from the user interface.

The choices are hard-coded in your application, but this is appropriate for table names, column names, and SQL keywords. Choices over the full range of strings or numbers are typical for data values, but not for identifiers or syntax.

P.201(原文P.249)
讓用戶輸入內容,但永遠別讓用戶輸入代碼。

Let users input values, but never let users input code.

P.204(原文P.)
當資料庫生成一個新的偽鍵時,這個值是根據上次生成的值來計算的(即使上次生成的這條記錄已經被刪除了,也不會有任何影響),並不是按照現有記錄中的最大值來計算的,這種一些數據庫開發人員的假設相違背。

Even if you do accomplish this cleanup, it’s short-lived. When a pseudokey generates a new value, the value is greater than the last value it generated (even if the row with that value has since been deleted or changed), not the highest value currently in the table, as some database programmers assume. Suppose you update the row with the greatest bug_id value 4 to the lower unused value to fill a gap. The next row you insert using the default pseudokey generator will allocate 5, leaving a new gap at 4.

P.207(原文P.256)
我們可以算一下:如果你每天24小時不停地插入新的數據,平均每秒產生 1000 行記錄,用完32位整數型所能表示的這些數字總共需要136年。

But do the math: if you generate unique primary key values as you insert 1,000 rows per second, 24 hours per day, you can continue for 136 years before you use all values in an unsigned 32-bit integer.