博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
NLP 主题抽取 Topic LDA代码实践 gensim包 代码
阅读量:4029 次
发布时间:2019-05-24

本文共 3699 字,大约阅读时间需要 12 分钟。

NLP 主题抽取Topic LDA代码实践 gensim包 代码

原创作品, 转载请注明出处:[     ]

From RxNLP.

        分享一个代码实践:用gensim包的LDA模型实践NLP的一个典型任务,主题抽取。

        顺带提一点,对于NLP任务,最好的方式就是先在代码上跑通起来,然后再进行理论深究,最后自己实现DIY学习模型算法框架。

        顺带再提一点,跑通NLP或者ML任务,推荐在Python下用成熟的包如sklearn、numpy等进行,高效。对自己要求严一点的话,再在Java下用相关包跑一遍,然后就能对比不同的语言平台下的差异了。

        不废话了,上代码,注释很清楚明了(注释英文写的,将就着阅读吧谢谢)。

import gensimfrom sklearn.datasets import fetch_20newsgroupsfrom gensim.utils import simple_preprocessfrom gensim.parsing.preprocessing import STOPWORDSfrom gensim.corpora import Dictionaryimport osfrom pprint import pprint

# 第一步 准备数据,fetch_20newsgroups来自于sklearn的dataset

news_dataset = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))documents = news_dataset.dataprint "In the dataset there are", len(documents), "textual documents"print "And this is the first one:\n", documents[0]

# 第二步 token化句子(分词、去stopword、等等),并词袋表示出句子向量

def tokenize(text):    return [token for token in simple_preprocess(text) if token not in STOPWORDS]# print "After the tokenizer, the previous document becomes:\n", tokenize(documents[0])# Next step: tokenize all the documents and build a count dictionary, that contains the count of the tokens over the complete text corpus.processed_docs = [tokenize(doc) for doc in documents]word_count_dict = Dictionary(processed_docs)# print "In the corpus there are", len(word_count_dict), "unique tokens"# print "\n",word_count_dict,"\n"word_count_dict.filter_extremes(no_below=20, no_above=0.1)  # word must appear >10 times, and no more than 20% documents# print "After filtering, in the corpus there are only", len(word_count_dict), "unique tokens"bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]  # bow all document of corpus
# 第三步 LDA 上模型

model_name = "./model.lda"if os.path.exists(model_name):    lda_model = gensim.models.LdaModel.load(model_name)    print "loaded from old"else:    # preprocess()    lda_model = gensim.models.LdaModel(bag_of_words_corpus, num_topics=100, id2word=word_count_dict, passes=5)#num_topics: the maximum numbers of topic that can provide    lda_model.save(model_name)    print "loaded from new"

# 第四步 验证非登录句子或者文档的主题抽取能力情况,做了三个实验

# 1.# if you don't assign the target document, then# every running of lda_model.print_topics(k) gonna get top k topic_keyword from whole the corpora documents in the bag_of_words_corpus from 0-n.# and if given a single new document, it will only analyse this document, and output top k topic_keyword from this document.pprint(lda_model.print_topics(30,6))#by default num_topics=10, no more than LdaModel's; by default num_words=10, no limitationprint "\n"# pprint(lda_model.print_topics(10))# 2.# when you assign a particular document for it to assign:# pprint(lda_model[bag_of_words_corpus[0]].print_topics(10))for index, score in sorted(lda_model[bag_of_words_corpus[0]], key=lambda tup: -1 * tup[1]):    print "Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5))printprint news_dataset.target_names[news_dataset.target[0]]  # bag_of_words_corpus align to news_datasetprint "\n"# 3.# process an unseed documentunseen_document = "In my spare time I either play badmington or drive my car"print "The unseen document is composed by the following text:", unseen_documentprintbow_vector = word_count_dict.doc2bow(tokenize(unseen_document))for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1 * tup[1]):    print "Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 7))

refs:

http://nbviewer.jupyter.org/gist/boskaiolo/cc3e1341f59bfbd02726

http://www.voidcn.com/blog/u010297828/article/p-4995136.html
http://radimrehurek.com/gensim/models/ldamodel.html
http://blog.csdn.net/accumulate_zhang/article/details/62453672

你可能感兴趣的文章
iOS ASI和AFN有什么区别
查看>>
iOS QQ侧滑菜单(高仿)
查看>>
iOS 扫一扫功能开发
查看>>
iOS app之间的跳转以及传参数
查看>>
iOS __block和__weak的区别
查看>>
Android(三)数据存储之XML解析技术
查看>>
Spring JTA应用之JOTM配置
查看>>
spring JdbcTemplate 的若干问题
查看>>
Servlet和JSP的线程安全问题
查看>>
GBK编码下jQuery Ajax中文乱码终极暴力解决方案
查看>>
jQuery性能优化指南
查看>>
Oracle 物化视图
查看>>
PHP那点小事--三元运算符
查看>>
解决国内NPM安装依赖速度慢问题
查看>>
Brackets安装及常用插件安装
查看>>
在CentOS 7系统上搭建LNMP 环境
查看>>
Centos 7(Linux)环境下安装PHP(编译添加)相应动态扩展模块so(以openssl.so为例)
查看>>
fastcgi_param 详解
查看>>
Nginx配置文件(nginx.conf)配置详解
查看>>
标记一下
查看>>