较真查证平台数据可视化分析

前面我们对较真查证平台进行了抓取,本篇将对这些数据进行可视化分析。具体使用pyecharts库进行图标绘制。

pyecharts介绍

echarts是一个由百度开源的数据可视化,凭借良好的交互性,精巧的图表设计,得到了众多开发者的任何,而pyecharts就是对echarts的python分装。其包含以下特性:

  • 囊括了30+常见图标
  • 支持主流的Jupyter环境
  • 可继承至Flask django等主流web框架
  • 多达400+地图文件以及原生的百度地图,为地理数据可视化提供支持
  • 提供详细的说明文档帮助快速上手

安装pyecharts

直接使用pip安装即可:

pip install pyecharts

In [2]:
import pyecharts

# 导入pyecharts 输出版本验证安装成功
print(pyecharts.__version__)
1.6.2

以饼图显示谣言的真假比例

在抓取的谣言数据中,result列表示谣言的结论,其中分为真、假、疑这三种情况,并且而对于假又分为:钓鱼贴、都市传说、假新闻、旧闻重炒、伪常识、伪科学、洋葱新闻、谣言、疑似诈骗几类。我们就这二种情况的比例绘制饼图来对比数据。

提取数据

从sqlite3数据库中提取真、假、疑三种情况数据,以及钓鱼贴、都市传说、假新闻、旧闻重炒、伪常识、伪科学、洋葱新闻、谣言、疑似诈骗几类数据:

In [73]:
import sqlite3

conn = sqlite3.connect('jiaozhen.db')

zhen = conn.execute("select count(*) from yaoyan where yaoyan.result like '真%'")
jia = conn.execute("select count(*) from yaoyan where yaoyan.result like '假%'")
yi = conn.execute("select count(*) from yaoyan where yaoyan.result like '疑%'")

zhenSum = zhen.fetchone()[0]
jiaSum = jia.fetchone()[0]
yiSum = yi.fetchone()[0]

jia1 = zhen = conn.execute("select count(*) from yaoyan where yaoyan.result like '%钓鱼贴'")
jia2 = zhen = conn.execute("select count(*) from yaoyan where yaoyan.result like '%都市传说'")
jia3 = zhen = conn.execute("select count(*) from yaoyan where yaoyan.result like '%假新闻'")
jia4 = zhen = conn.execute("select count(*) from yaoyan where yaoyan.result like '%旧闻重炒'")
jia5 = zhen = conn.execute("select count(*) from yaoyan where yaoyan.result like '%伪常识'")
jia6 = zhen = conn.execute("select count(*) from yaoyan where yaoyan.result like '%伪科学'")
jia7 = zhen = conn.execute("select count(*) from yaoyan where yaoyan.result like '%洋葱新闻'")
jia8 = zhen = conn.execute("select count(*) from yaoyan where yaoyan.result like '%谣言'")
jia9 = zhen = conn.execute("select count(*) from yaoyan where yaoyan.result like '%疑似诈骗'")

jia1Sum = jia1.fetchone()[0]
jia2Sum = jia2.fetchone()[0]
jia3Sum = jia3.fetchone()[0]
jia4Sum = jia4.fetchone()[0]
jia5Sum = jia5.fetchone()[0]
jia6Sum = jia6.fetchone()[0]
jia7Sum = jia7.fetchone()[0]
jia8Sum = jia8.fetchone()[0]
jia9Sum = jia9.fetchone()[0]
conn.close()
print('真:%d 假:%d 疑:%d'%(zhenSum,jiaSum,yiSum))
print('钓鱼贴:%d 都市传说:%d 假新闻:%d 旧闻重炒:%d 伪常识:%d 伪科学:%d 洋葱新闻:%d 谣言:%d 疑似诈骗:%d'%(jia1Sum,jia2Sum,jia3Sum,jia4Sum,jia5Sum,jia6Sum,jia7Sum,jia8Sum,jia9Sum))
真:759 假:3234 疑:905
钓鱼贴:3 都市传说:3 假新闻:33 旧闻重炒:14 伪常识:302 伪科学:170 洋葱新闻:8 谣言:2688 疑似诈骗:13

绘制饼图

我们直接从数据库中提取需要的数据,接下来就是绘制饼图了:

In [80]:
# 构建饼图数据
data = [('真',zhenSum),('假',jiaSum),('疑',yiSum)]
data1 = [('钓鱼贴',jia1Sum),('都市传说',jia2Sum),('假新闻',jia3Sum),('旧闻重炒',jia4Sum),('伪常识',jia5Sum),('伪科学',jia6Sum),('洋葱新闻',jia7Sum),('谣言',jia8Sum),('疑似诈骗',jia9Sum)]

from pyecharts import options as opts
from pyecharts.charts import Pie

c = (
    # 创建饼图对象
    Pie()
    # 添加饼图数据
    # 数据格式为[(key1,value1),(key2,value2)....]
    .add("",data,center=['25%','50%'])
    .add("",data1,center=['75%','50%'])
    # 设置全局配置项,其中设置了饼图标题
    .set_global_opts(title_opts=opts.TitleOpts(title='谣言分析',subtitle='基于较真查证平台',pos_left='center',pos_top=20))
)
# 由于使用jupyter notebook编写,故渲染函数为render_notebook()
# 也可以直接使用render()渲染会返回一个html页面
c.render_notebook()
Out[80]:

对谣言数据进行关键词分析并绘制词云图

我们使用jeba中文分词类库基于TF-IDF算法对谣言数据中的title列数据进行分析,并以权重来制作词云图。

获取谣言的title数据

In [106]:
# 获取数据库连接对象
conn = sqlite3.connect('jiaozhen.db')
# 执行sql语句返回结果
titles = conn.execute('select title from yaoyan;')
sentence = ''
# 将结果组成字符串
for title in titles.fetchall():
    sentence = sentence + title[0] + '。'

print(sentence[:100])
癌症不是病。浸润癌是真正的癌症。癌前病变就是癌症。心脏也会得癌症。癌症是不治之症。癌症会传染。癌症不能治愈。癌症不能预防。癌症要忌口。多数癌症是可以预防的。癌症会传染。癌症是不治之症。大麻能治愈癌症。

使用jieba进行关键词分析

In [111]:
import jieba.analyse

# 引入停止词,避免类似的、如果等干扰
jieba.analyse.set_stop_words("chineseStopWordsYaoyan.txt")
# 基于TF-IDE算法进行分析
keywords = jieba.analyse.extract_tags(sentence,topK=100,withWeight=True)
print(keywords)
[('冠状病毒', 0.058462898304318875), ('致癌', 0.05249288642782296), ('癌症', 0.044930645168488395), ('治疗', 0.04100180852651826), ('预防', 0.03770966414055841), ('新型', 0.03495441936984637), ('减肥', 0.033171112928573054), ('感染', 0.03210196397596318), ('健康', 0.03064994393034247), ('病毒', 0.028577967767414385), ('食物', 0.027370793884137174), ('糖尿病', 0.026066404343296233), ('有害', 0.025829027541285677), ('抗癌', 0.025709770498291477), ('肺炎', 0.02352451071092846), ('疫苗', 0.023450680414313165), ('鸡蛋', 0.02271386672373288), ('儿童', 0.021035823714621385), ('乳腺癌', 0.02070524907422755), ('子宫', 0.01896980083371385), ('维生素', 0.018887474482755424), ('牛奶', 0.01831721888173516), ('微信', 0.018195993155098934), ('营养', 0.01816315961228691), ('患者', 0.018116197396575343), ('手机', 0.017907207549244672), ('狂犬病', 0.017607885921211947), ('肝癌', 0.017144520502232686), ('保健品', 0.01697170326900114), ('HPV', 0.01649011879680841), ('酸奶', 0.016410622309974313), ('艾滋病', 0.016385864956773213), ('人体', 0.016186411558358545), ('胃癌', 0.01615116397435788), ('饮料', 0.01600367907998478), ('致癌物', 0.015879122331192923), ('食品', 0.015360368286472602), ('性早熟', 0.015352869224614726), ('宫颈癌', 0.015094682086177702), ('宝宝', 0.01490595426535388), ('传染', 0.014756011069948628), ('肺癌', 0.014721429769439211), ('女性', 0.014642102190112251), ('补钙', 0.014591642068036528), ('感冒', 0.014364737822926655), ('脱发', 0.014311467809788814), ('10', 0.014215619652421043), ('12', 0.014215619652421043), ('香蕉', 0.014085349125788623), ('喝咖啡', 0.01400463875650685), ('中毒', 0.013729865690777207), ('巧克力', 0.013634333921062594), ('螺杆菌', 0.013446466609993341), ('孕妇', 0.01342731927464041), ('小龙虾', 0.01339637737788242), ('辐射', 0.013249654811379852), ('吃水果', 0.013180340101284245), ('风险', 0.01313332912140411), ('消毒', 0.01304317885075057), ('微波炉', 0.012853840779837328), ('隐形眼镜', 0.01270549375009418), ('食用', 0.012624148558953576), ('喝酒', 0.012602136391763698), ('新冠', 0.012509745294130518), ('口罩', 0.012348060078310503), ('孩子', 0.012322489241985353), ('死亡', 0.012198748004877758), ('服用', 0.01206118536812072), ('幽门', 0.01200296704028729), ('含有', 0.011993499265277778), ('手术', 0.0118077432833809), ('蔬菜', 0.011775545714452053), ('可乐', 0.011660879339980498), ('喝牛奶', 0.011642854148687216), ('疾病', 0.011455410813752854), ('草莓', 0.011444077640981735), ('高血压', 0.01138953167480023), ('自闭症', 0.011372495721936833), ('输液', 0.011364241141219558), ('腹泻', 0.011360402847593227), ('功效', 0.01134499398224648), ('肌瘤', 0.011336056779018264), ('引发', 0.011180936395094179), ('转基因', 0.011133646506493532), ('卵巢', 0.010854002829472032), ('婴儿', 0.010738659863227739), ('接种', 0.010675639249339801), ('吸烟', 0.010638578622829624), ('避孕药', 0.010623740816630993), ('有毒', 0.010618174308700056), ('长期', 0.010474132821150589), ('牙膏', 0.010391385380766742), ('胆固醇', 0.01027775670656155), ('神奇', 0.010274974698274828), ('神药', 0.01023524614974315), ('爆炸', 0.010210389986429796), ('降低', 0.01020517757141933), ('农药', 0.01016193612065449), ('阿司匹林', 0.010127902790253995), ('水果', 0.010120922814326486)]

绘制词云图

In [114]:
from pyecharts.charts import WordCloud

c = (
    WordCloud()
    .add('',keywords)
)
c.render_notebook()
Out[114]: