SQL: joins and duplicates.

Here’s a little SQL problem for the lazy web. It’s something that I’m trying to implement in Glom for the Repository Analyzer:

Let’s say we have these tables:

Packages:

package_name package_description
something something description
somethingelse something else description
somethingmore something more description

and Package Scans:

package_name version license_id
something 0.1 43
something 0.2 43
something 0.3 44
somethingelse 1.5 43
somethingmore 0.9 40

Now, I want to get the package.package_description for all packages that appear in package_scans with license_id 43, which would look like this:

‘something description’
‘something else description’

The best I can do so far is a SELECT on package_scans, doing a LEFT OUTER JOIN:

SELECT “relationship_package”.”package_description” FROM “package_scans” LEFT OUTER JOIN “packages” AS “relationship_package” ON (“package_scans”.”package_name” = “relationship_package”.”package_name”) WHERE “package_scans”.”license_id” = 43;

which gives me duplicates, like so:

‘something description’
‘something description’
‘something else description’

If possible I’d like to do this without GROUP BY. I feel there must be a simpler way to say “give me a row for each record in packages for which the (indirect) relationship is true”. If the main FROM table could somehow be packages, instead of package_scans, then the LEFT OUTER JOIN would cause me to have only one row for each relevant packages record. In general, Glom never gives you repeat rows because that’s confusing.

If I can figure out what SQL should be generated, I could imagine that I might get that result in Glom by defining a relationship in terms of a doubly-related field. So the user could say “Show me records from the packages_with_package_scan_license_id” relationship. That relationship would be defined as something like

records from packages where licenses.license_id == packages::package_scans::license_id

But maybe a GROUP BY option really is the clearest.

Update: I feel like the sub-select idea might be what I want, if I can figure the syntax out. I like the idea of showing records from a relationship that is itself defined by a link between a value and a field in another related set of records.

9 thoughts on “SQL: joins and duplicates.

  1. You could do:

    SELECT DISTINCT r.package_description
    FROM relationship_package r
    LEFT JOIN package_scans p ON r.package_name=p.package_name
    WHERE p.license_id=43

    Of course, this is really just another way of writing “GROUP BY”…
    You could also use a subselect, if that suits you better.

  2. Have you tried: SELECT DISTINCT “relationship_package”.”package_description” FROM …..

  3. Been awhile since I worked in something other than MSSQL, but assuming you have an EXISTS clause of some sort, it would be a matter of:

    SELECT package_description
    FROM packages p
    WHERE EXISTS(
    SELECT 1
    FROM package_scans s
    WHERE p.package_name = s.package_name AND s.license_id = 43
    )

  4. “Exists” is the right choice. Distinct will work, but it’s slow on mass data (the database has to pull lots of data, then do all those comparisons within the resultset just to get rid of duplicate rows again). So it’s considered bad style to use distinct for cleaning up deficient queries (just like group-by – group-by is meant for segmenting data for aggregate functions, not for removing duplicates).

    From http://www.databasejournal.com/features/postgresql/article.php/3437821:
    A SELECT DISTINCT query is frequently a “code smell”; it indicates that something is not quite right. That is because using the DISTINCT keyword means that redundant data is being pulled from the database and then discarded. Usually it is better to rewrite the query’s FROM and WHERE clauses to use a subquery to filter the data correctly so that you only get back what you want.

  5. EXISTS will work, but if with many database engines using ‘IN’ will be more efficient in this case:

    SELECT * FROM packages WHERE package_name IN (SELECT package_name FROM package_scans WHERE license_id = 43);

    EXISTS is is more efficient if few rows match, IN is often more efficient where many rows match. This certainly true for postgres.

  6. I chose this quote not because it refers to PostgreSql, but because it points out that DISTINCT is likely to be the wrong choice. On the database engines I worked with, EXISTS and IN either lead to an identical execution plan, or EXISTS performed better, mostly on large subquery results. That has to do with the fact that with IN all rows might be fetched, while EXISTS may return as soon as it finds a single match. There are situations where IN is faster though, namely with small subquery resultsets – but then the difference will hardly matter anyway. Another reason why I prefer EXISTS over IN is because IN will fail on NULL values – that’s often overseen.

    On modern DBs you shouldn’t have to worry too much about IN vs. EXISTS, the query optimizer will most likely take care of it and make the right decision. The execution plans tend to be equal.

  7. P.S. To sum up, IN vs. EXISTS performance will depend on the underlying database implementation. But I don’t have any experience with PostgreSql on this issue.

  8. apologies if this isn’t worthwhile.. my brain is fried right now. (not that it’s all that great when it’s working.)

    I ran into a similar problem a while back (before mysql supported subqueries) and someone suggested this solution:
    (I of course have no idea what the performance impact is.. but it works for me.)

    SELECT * FROM jobs j

    LEFT JOIN employees_jobs ej ON (ej.job_id = j.id)

    LEFT JOIN employees_jobs AS ej2 ON ej.job_id=ej2.job_id

    WHERE ej2.job_id IS NULL

Comments are closed.