SQL query for co-authored books

SQL expert Rudy Limeback gives advice for writing a query that involves a number of tables.

My book table has 4 instances of co-authored books, but I cannot retrieve them. So in other words, I'm trying to list the book (id, title, and cost) and author (id, name) and publisher (id, name) info for all co-authored books in the collection. This is what I have so far to join the tables together:

SELECT book.id, book.title, book.cost,
       author.id, author.name,
       publisher.id, publisher.name
FROM book, author, publisher, writtenby
WHERE book.id = writtenby.bookid
AND writtenby.authorid = author.id
AND book.publid = publisher.id;

I'm not sure where to go from here to get the four rows.

Whenever a query involves a number of tables, I find it helpful always to start writing the join with the most restrictive table first.

For example, if you are writing a query to retrieve all the books from a specific publisher, the restriction in this query would be the specific publisher. Think of it as the difference between:

  • Retrieve all books, then retrieve the publisher for each book, and then throw away both the book and the publisher if the publisher isn't the right one.

versus

  • Retrieve the specific publisher, then retrieve only the books for that publisher.

Does that make sense? See the difference?

Okay, so my rules for writing the FROM clause are:

  1. Start with the most restrictive table first.

  2. If the first table has any many-to-one relationships, join those tables next. Specifically, for each row of the first table, there is at most one row of the joined table(s).

  3. If any of the tables thus far joined has any one-to-many relationships, you can join only one of these. Otherwise, if you join more than one one-to-many table, you get cross join effects.

  4. Continue by joining many-to-one tables as necessary.

So, in your query, the table you want to start with is books. Why is this most restrictive? Because you want only co-authored books, and there are many fewer co-authored books than those written by a single author.

Next, join to publishers, because a given book has just one publisher. (If a book has two publishers, it isn't the same book. It might have the same name, but it'll have a different ISBN.)

Now you can join to writtenby, which is your books-to-authors relationship table. Specifically, for a given book, it is a one-to-many relationship.

Finally, you can join to authors, because from writtenby to authors, it's a many-to-one relationship.

This should all make a lot more sense if you were to abandon the old style FROM clause comma-delimited list of tables with WHERE conditions, and started using JOIN syntax:

SELECT book.id
     , book.title
     , book.cost
     , author.id
     , author.name
     , publisher.id
     , publisher.name
  FROM book
INNER
  JOIN publisher
    ON publisher.id = book.publid
INNER
  JOIN writtenby
    ON writtenby.bookid = book.id
INNER
  JOIN author
    ON author.id = writtenby.authorid

Doesn't that look better? The intent is certainly clearer, I think.

Now the only remaining task is to restrict the first table, the books table, to only co-authored books. That requires a subquery:

 WHERE book.id IN
       ( SELECT bookid
           FROM writtenby
         GROUP 
             BY bookid
         HAVING COUNT(*) > 1 )

The subquery is uncorrelated, which means it can be executed before the main query, to produce a list of book ids for only co-authored books. Thus the main query is most efficient because it retrieves the publisher and author data only for those books.

Dig Deeper on Oracle development languages