SQL query for co-authored books
SQL expert Rudy Limeback gives advice for writing a query that involves a number of tables.
My book table has 4 instances of co-authored books, but I cannot retrieve them. So in other words, I'm trying to list the book (id, title, and cost) and author (id, name) and publisher (id, name) info for all co-authored books in the collection. This is what I have so far to join the tables together:
SELECT book.id, book.title, book.cost, author.id, author.name, publisher.id, publisher.name FROM book, author, publisher, writtenby WHERE book.id = writtenby.bookid AND writtenby.authorid = author.id AND book.publid = publisher.id;
I'm not sure where to go from here to get the four rows.
Whenever a query involves a number of tables, I find it helpful always to start writing the join with the most restrictive table first.
For example, if you are writing a query to retrieve all the books from a specific publisher, the restriction in this query would be the specific publisher. Think of it as the difference between:
-
Retrieve all books, then retrieve the publisher for each book, and then throw away both the book and the publisher if the publisher isn't the right one.
versus
-
Retrieve the specific publisher, then retrieve only the books for that publisher.
Does that make sense? See the difference?
Okay, so my rules for writing the FROM clause are:
-
Start with the most restrictive table first.
-
If the first table has any many-to-one relationships, join those tables next. Specifically, for each row of the first table, there is at most one row of the joined table(s).
-
If any of the tables thus far joined has any one-to-many relationships, you can join only one of these. Otherwise, if you join more than one one-to-many table, you get cross join effects.
-
Continue by joining many-to-one tables as necessary.
So, in your query, the table you want to start with is books. Why is this most restrictive? Because you want only co-authored books, and there are many fewer co-authored books than those written by a single author.
Next, join to publishers, because a given book has just one publisher. (If a book has two publishers, it isn't the same book. It might have the same name, but it'll have a different ISBN.)
Now you can join to writtenby, which is your books-to-authors relationship table. Specifically, for a given book, it is a one-to-many relationship.
Finally, you can join to authors, because from writtenby to authors, it's a many-to-one relationship.
This should all make a lot more sense if you were to abandon the old style FROM clause comma-delimited list of tables with WHERE conditions, and started using JOIN syntax:
SELECT book.id , book.title , book.cost , author.id , author.name , publisher.id , publisher.name FROM book INNER JOIN publisher ON publisher.id = book.publid INNER JOIN writtenby ON writtenby.bookid = book.id INNER JOIN author ON author.id = writtenby.authorid
Doesn't that look better? The intent is certainly clearer, I think.
Now the only remaining task is to restrict the first table, the books table, to only co-authored books. That requires a subquery:
WHERE book.id IN ( SELECT bookid FROM writtenby GROUP BY bookid HAVING COUNT(*) > 1 )
The subquery is uncorrelated, which means it can be executed before the main query, to produce a list of book ids for only co-authored books. Thus the main query is most efficient because it retrieves the publisher and author data only for those books.