`
lovejuan1314
  • 浏览: 337075 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

PostgreSQL 的Distinct on

阅读更多
今天上午处理一个SQL语句的问题,情况是,有一个clicks表,以及geoip和geoip_location表。
clicks表中存取的是客户的点击数,geoip和geoip_location分别存储了各个国家的ip网段和地址信息。
通过如下SQL可以看出每一个ip符合条件的点击数
select count(click_ip) as count,click_ip from 
fb_fy_clicks 
where earn_id = 191 AND if_pay='t' AND date(crdate)='2007-12-24' 
group by click_ip;

下面通过ip找到所在的国家
select * from 
(select count(click_ip) as count,click_ip from fb_fy_clicks 
where earn_id = 191 AND if_pay='t' AND date(crdate)='2007-12-24' 
group by click_ip) as tb 
left join geoip G 
on ip4(tb.click_ip) <<= G.network 
left join geoip_location as geoip 
on G.location_id = geoip.id

这样虽然说通过ip找到国家了,但是有的ip是别人通过代理或者伪装的ip,所以有多条ip的重复记录,尽管这些只是少数,但是我们需要把这些对应多个国家的ip按照1个来算。如果仅仅使用distinct我们无法过滤掉多出来的国家。正好postgresql有distinct on

文档记载:

我们还可以用任意表达式来判断什么行可以认为是独立的:

SELECT DISTINCT ON (expression [, expression ...]) select_list ...

这里 expression 是任意值表达式, 它为所有行计算。如果一个行集合里所有行计算出的该表达式的值是一样的, 那么我们认为它们是重复的并且因此只有第一行保留在输出中。 请注意这里的一个集合的"第一行"是不可预料的, 除非你在足够多的字段上对该查询排了序,保证到达DISTINCT过滤器的行的顺序是唯一的。 (DISTINCT ON处理是发生在ORDER BY排序后面的。)

DISTINCT ON子句不是 SQL 标准的一部分, 有时候有人认为它是一个糟糕的风格,因为它的结果是不可判定的。 如果用有选择的GROUP BY和在FROM中的子查询,那么我们可以避免使用这个构造, 但是通常它是更方便的候选方法。

正好符合需求:

将SQL改为:

select distinct on (click_ip) click_ip,network,country,count from (
(select count(click_ip) as count,click_ip from fb_fy_clicks
 where earn_id = 191 AND if_pay='t' AND date_trunc('day',crdate)='2007-12-24' 
group by click_ip) as tb left join geoip G 
on ip4(tb.click_ip) <<= G.network 
left join geoip_location as geoip 
on G.location_id = geoip.id) as tmp


还有一种情况就是,有些client_ip重复的记录不需要被过滤掉,比如一些client_ip重复的记录,但是他的data_platform字段是不一样的,这样的记录是不应该被过滤掉的。这时,我们使用
 SELECT DISTINCT ON (client_ip,data_platform) client_ip,SUM(impressions) AS  impressions,crdate,data_platform from views a,ads b 
 WHERE a.ad_id = b.id  
 GROUP BY client_ip,data_platform,crdate


真实例子:

SELECT ad_id,country,SUM(impressions) AS impressions,dates,source FROM 
(SELECT  DISTINCT ON (client_ip,ad_id,source) client_ip,ad_id,COALESCE(country,'OTHER') AS country,impressions,dates,source FROM 
(((SELECT DATE_TRUNC('DAY',a.crdate) AS dates,b.ad_id,a.client_ip,COUNT(a.fb_uid) AS impressions, 0 AS source FROM views_20081020 a,ads b WHERE a.ad_id = b.id  AND a.fb_uid IS NOT NULL GROUP BY b.ad_id,a.client_ip,dates ORDER BY b.ad_id asc) 
UNION ALL 
(SELECT DATE_TRUNC('DAY',a.crdate) AS dates,b.ad_id,a.client_ip,COUNT(a.bebo_uid) AS impressions, 1 AS source FROM views_20081020 a,ads b WHERE a.ad_id = b.id  AND a.bebo_uid IS NOT NULL GROUP BY b.ad_id,a.client_ip,dates ORDER BY b.ad_id asc )UNION ALL 
(SELECT DATE_TRUNC('DAY',a.crdate) AS dates,b.ad_id,a.client_ip,COUNT(a.ms_uid) AS impressions, 2 AS source FROM views_20081020 a,ads b WHERE a.ad_id = b.id  AND a.ms_uid IS NOT NULL  GROUP BY b.ad_id,a.client_ip,dates ORDER BY b.ad_id asc ) 
UNION ALL 
(SELECT DATE_TRUNC('DAY',a.crdate) AS dates,b.ad_id,a.client_ip,COUNT(a.hi5_uid) AS impressions, 3 AS source FROM views_20081020 a,ads b WHERE a.ad_id = b.id  AND a.hi5_uid IS NOT NULL GROUP BY b.ad_id,a.client_ip,dates ORDER BY b.ad_id asc)) AS tmp_ads_imprs_country_stats 
LEFT JOIN geoip G ON ip4(tmp_ads_imprs_country_stats.client_ip) <<= G.network  LEFT JOIN geoip_location AS geoip_loc ON G.location_id = geoip_loc.id ) AS tmp_ads_imprs_country_stats_t ) AS FOO GROUP BY ad_id,country,dates,source ORDER BY ad_id ASC;

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics