Below is the data in TestingTable1 BUYER_ID | ITEM_ID | CREATED_TIME -----------+-------------------+------------------------ 1345653 110909316904 2012-07-09 21:29:06 1345653 151851771618 2012-07-09 19:57:33 1345653 221065796761 2012-07-09 19:31:48 1345653 400307563710 2012-07-09 18:57:33 1345653 310411560125 2012-07-09 16:09:49 1345653 120945302103 2012-07-09 13:40:23 1345653 261060982989 2012-07-09 09:02:21
Below is the data in TestingTable2 USER_ID | PRODUCT_ID | LAST_TIME -----------+-------------------+------------------- 1345653 110909316904 2012-07-09 21:30:06 1345653 152851771618 2012-07-09 19:57:33 1345653 221065796761 2012-07-09 19:31:48 1345653 400307563710 2012-07-09 18:57:33
I need to Compare `TestingTable2` with `TestingTable1` on `BUYER_ID` and `USER_ID`. And I need to find all (basically the count) the missing and mismatch entries in `TestingTable2` after comparing from `TestingTable1`. I created SQL fiddle for this-[url]http://sqlfiddle.com/#!3/d87b2/1[/url]If you run my query in the SQL Fiddle, you will get output as- BUYER_ID ERROR 1345653 5
which is right as last `three` rows from `TestingTable1` is missing in `TestingTable2` and rest `two` are mismatch after comparison from `TestingTable1` on `BUYER_ID` and `USER_ID`.Now the complicated thing is starting.**Problem Statement-**In my current output, I am getting ERROR count as `5`. So if you see first row in both the tables `ITEM_ID` and `PRODUCT_ID` are same but `CREATED_TIME` and `LAST_TIME` is not same, and difference between those two times is of only `1 minute`. So currently I am reporting that as a mismatch, but what I need is that if the difference between them is within `15 minutes range`, then I don't want to report as an error. So after implementing this feature in my current query, I will be getting error count as `4` because difference is within `15 minutes range` for the first row.So below is my solution for this problem, and below is the sql query that works fine in SQL server(which will give error count as 4) but not in Hive as `Hive supports only equality JOINS` and I cannot run the below query in Hive. So I need some other way (alternate way) of doing this problem. Is it possible to do the date difference condition in where clause somehow? `Basically how I can rewrite the below SQL query` in some other way such that it would fulfill my all the requirements above. SELECT TT.BUYER_ID, COUNT(*) FROM ( SELECT testingtable1.buyer_id, testingtable1.item_id, testingtable1.created_time FROM testingtable2 RIGHT JOIN testingtable1 ON ( testingtable1.item_id = testingtable2.product_id AND testingtable1.BUYER_ID = testingtable2.USER_ID AND ABS(DATEDIFF(mi, testingtable1.created_time, testingtable2.last_time)) <= 15 ) WHERE testingtable2.product_id IS NULL ) TT GROUP BY TT.BUYER_ID;
Expected Output that I need after implementing the above feature- BUYER_ID ERROR 1345653 4