6.3. Topic Extraction in RSS-Feed Corpus#
Author: Johannes Maucher
Last update: 2018-11-16
In the notebook 01gensimDocModelSimple the concepts of dictionaries, document models, tf-idf and similarity have been described using an example of a samll document collection. Moreover, in notebook 02LatentSemanticIndexing LSI based topic extraction and document clustering have also been introduced by a small playground example.
The current notebook applies these concepts to a real corpus of RSS-Feeds, which has been generated and accessed in previous notebooks of this lecture:
6.3.1. Read documents from a corpus#
The contents of the RSS-Fedd corpus are imported by NLTK’s CategorizedPlaintextCorpusReader
as already done in previous notebooks of this lecture:
#!pip install wordcloud
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
from nltk.corpus import stopwords
stopwordlist=stopwords.words('english')
from wordcloud import WordCloud
rootDir="../Data/ENGLISH"
filepattern=r"(?!\.)[\w_]+(/RSS/FeedText/)[\w-]+/[\w-]+\.txt"
#filepattern=r"(?!\.)[\w_]+(/RSS/FullText/)[\w-]+/[\w-]+\.txt"
catpattern=r"([\w_]+)/.*"
rssreader=CategorizedPlaintextCorpusReader(rootDir,filepattern,cat_pattern=catpattern)
singleDoc=rssreader.paras(categories="TECH")[0]
print("The first paragraph:\n",singleDoc)
print("Number of paragraphs in the corpus: ",len(rssreader.paras(categories="TECH")))
The first paragraph:
[['Radar', 'trends', 'to', 'watch', ':', 'May', '2022', 'April', 'was', 'the', 'month', 'for', 'large', 'language', 'models', '.'], ['There', 'was', 'one', 'announcement', 'after', 'another', ';', 'most', 'new', 'models', 'were', 'larger', 'than', 'the', 'previous', 'ones', ',', 'several', 'claimed', 'to', 'be', 'significantly', 'more', 'energy', 'efficient', '.'], ['The', 'largest', '(', 'as', 'far', 'as', 'we', 'know', ')', 'is', 'Google', '’', 's', 'GLAM', ',', 'with', '1', '.', '2', 'trillion', 'parameters', '–', 'but', 'requiring', 'significantly', 'less', 'energy', 'to', 'train', 'than', 'GPT', '-', '3', '.'], ['Chinchilla', 'has', '[…]']]
Number of paragraphs in the corpus: 40
techdocs=[[w.lower() for sent in singleDoc for w in sent if (len(w)>1 and w.lower() not in stopwordlist)] for singleDoc in rssreader.paras(categories="TECH")]
print("Number of documents in category Tech: ",len(techdocs))
Number of documents in category Tech: 40
generaldocs=[[w.lower() for sent in singleDoc for w in sent if (len(w)>1 and w.lower() not in stopwordlist)] for singleDoc in rssreader.paras(categories="GENERAL")]
print("Number of documents in category General: ",len(generaldocs))
Number of documents in category General: 40
alldocs=techdocs+generaldocs
print("Total number of documents: ",len(alldocs))
Total number of documents: 80
6.3.1.1. Remove duplicate news#
def removeDuplicates(nestedlist):
listOfTuples=[tuple(liste) for liste in nestedlist]
uniqueListOfTuples=list(set(listOfTuples))
return [list(menge) for menge in uniqueListOfTuples]
techdocs=removeDuplicates(techdocs)
generaldocs=removeDuplicates(generaldocs)
alldocs=removeDuplicates(alldocs)
print("Number of unique documents in category Tech: ",len(techdocs))
print("Number of unique documents in category General: ",len(generaldocs))
print("Total number of unique documents: ",len(alldocs))
Number of unique documents in category Tech: 20
Number of unique documents in category General: 18
Total number of unique documents: 38
alltechString=" ".join([w for doc in techdocs for w in doc])
print(len(alltechString))
allgeneralString=" ".join([w for doc in generaldocs for w in doc])
print(len(allgeneralString))
7364
2146
wordcloudTech=WordCloud().generate(alltechString)
wordcloudGeneral=WordCloud().generate(allgeneralString)
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(20,18))
plt.title("Tech News")
plt.subplot(1,2,1)
plt.imshow(wordcloudTech, interpolation='bilinear')
plt.axis("off")
plt.subplot(1,2,2)
plt.imshow(wordcloudGeneral, interpolation='bilinear')
plt.title("General News")
plt.axis("off")
(-0.5, 399.5, 199.5, -0.5)

6.3.2. Gensim-representation of imported RSS-feeds#
from gensim import corpora, models, similarities
dictionary = corpora.Dictionary(alldocs)
dictionary.save('../Data/feedwordsDE.dict') # store the dictionary, for future reference
print(len(dictionary))
820
import random
first_doc = techdocs[0]
print(first_doc)
first_vec = dictionary.doc2bow(first_doc)
print(f"Sparse BoW representation of single document: {first_vec}")
for word in random.choices(first_doc, k=3):
print(f"Index of word {word} is {dictionary.token2id[word]}")
['recommendations', 'us', 'live', 'household', 'communal', 'device', 'like', 'amazon', 'echo', 'google', 'home', 'hub', 'probably', 'use', 'play', 'music', 'live', 'people', 'may', 'find', 'time', 'spotify', 'pandora', 'algorithm', 'seems', 'know', 'well', 'find', 'songs', 'creeping', '[…]']
Sparse BoW representation of single document: [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 2), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1)]
Index of word echo is 6
Index of word device is 5
Index of word spotify is 24
Sparse BoW representation of entire tech-corpus and entire general-news-corpus:
techcorpus = [dictionary.doc2bow(doc) for doc in techdocs]
generalcorpus = [dictionary.doc2bow(doc) for doc in generaldocs]
print(generaldocs[:3])
[['women', 'waiting', 'hear', 'vanished', 'loved', 'ones', 'stop', 'village', 'region', 'west', 'kyiv', 'hear', 'story', 'someone', 'vanished'], ['sainsbury', 'says', 'shoppers', 'watching', 'every', 'penny', 'supermarket', 'profits', 'jump', 'warns', 'tougher', 'times', 'ahead', 'consumers', 'finances', 'squeezed'], ['brings', 'back', 'passengers', 'cross', 'channel', 'route', 'sackings', 'tuesday', 'marks', 'first', 'time', 'drive', 'passengers', 'tourists', 'use', 'almost', 'six', 'weeks']]
6.3.3. Find similiar documents#
index = similarities.SparseMatrixSimilarity(techcorpus, num_features=len(dictionary))
sims = index[first_vec]
#print(list(enumerate(sims)))
simlist = sims.argsort()
print(simlist)
mostSimIdx=simlist[-2]
[ 2 7 18 4 11 5 9 8 6 12 16 19 17 3 14 13 10 15 1 0]
print("Refernce document is:\n",first_doc)
print("Most similar document:\n",techdocs[mostSimIdx])
Refernce document is:
['microsoft365r', 'outlook', 'support', 'cran', 'hong', 'ooi', 'happy', 'announce', 'microsoft365r', 'cran', 'outlook', 'email', 'support', 'quick', 'summary', 'new', 'features', 'send', 'reply', 'forward', 'emails', 'optionally', 'composed', 'blastula', 'emayili', 'copy', 'move', 'emails', 'folders', 'create', 'delete', 'copy', 'move', 'folders', 'add', 'remove', 'download', 'attachments', 'sample', 'write', 'email', 'using', 'blastula', 'library', 'microsoft365r', '1st', 'one', 'personal', 'microsoft', 'account', '2nd', 'work', 'school', 'account', 'outl']
Most similar document:
['outlook', 'client', 'support', 'microsoft365r', 'available', 'beta', 'test', 'hong', 'ooi', 'announcement', 'beta', 'outlook', 'email', 'client', 'part', 'microsoft365r', 'package', 'install', 'github', 'repository', 'devtools', '::', 'install_github', '("', 'azure', 'microsoft365r', '")', 'client', 'provides', 'following', 'features', 'send', 'reply', 'forward', 'emails', 'optionally', 'composed', 'blastula', 'emayili', 'copy', 'move', 'emails', 'folders', 'create', 'delete', 'copy', 'move', 'folders', 'add', 'remove', 'download', 'attachments', 'plan', 'submit', 'cran', 'sometime', 'next', 'month', 'period', 'public', 'testing', 'please', 'give', 'try', 'give', 'feedback', 'either', 'via', 'email', 'opening', 'issue', '...']
6.3.4. Find topics by Latent Semantic Indexing (LSI)#
6.3.4.1. Generate tf-idf model of corpus#
tfidf = models.TfidfModel(techcorpus)
corpus_tfidf = tfidf[techcorpus]
print("Display TF-IDF- Model of first 2 documents of the corpus")
for doc in corpus_tfidf[:2]:
print(doc)
Display TF-IDF- Model of first 2 documents of the corpus
[(13, 0.055648773453563255), (19, 0.15879721214500278), (20, 0.15879721214500278), (21, 0.31759442429000556), (22, 0.08531278174327886), (23, 0.10056217885820973), (24, 0.12205499694414082), (25, 0.24410999388828164), (26, 0.12205499694414082), (27, 0.24410999388828164), (28, 0.1469688608034478), (29, 0.08531278174327886), (30, 0.12205499694414082), (31, 0.12205499694414082), (32, 0.17062556348655772), (33, 0.24410999388828164), (34, 0.12205499694414082), (35, 0.08531278174327886), (36, 0.24410999388828164), (37, 0.12205499694414082), (38, 0.10056217885820973), (39, 0.04857056654241692), (40, 0.15879721214500278), (41, 0.08531278174327886), (42, 0.19145989097204336), (43, 0.24410999388828164), (44, 0.055648773453563255), (45, 0.04857056654241692), (46, 0.12205499694414082), (47, 0.15879721214500278), (48, 0.20112435771641946), (49, 0.15879721214500278), (50, 0.15879721214500278), (51, 0.12205499694414082), (52, 0.12205499694414082), (53, 0.15879721214500278), (54, 0.15879721214500278), (55, 0.10056217885820973), (56, 0.12205499694414082), (57, 0.11129754690712651), (58, 0.10056217885820973), (59, 0.12205499694414082), (60, 0.12205499694414082)]
[(22, 0.06757935386523892), (24, 0.09668419738474388), (25, 0.09668419738474388), (26, 0.09668419738474388), (27, 0.19336839476948775), (28, 0.058209687039009896), (29, 0.06757935386523892), (30, 0.09668419738474388), (31, 0.09668419738474388), (32, 0.13515870773047783), (33, 0.19336839476948775), (34, 0.09668419738474388), (35, 0.06757935386523892), (36, 0.19336839476948775), (37, 0.09668419738474388), (39, 0.03847451034573397), (42, 0.15166233545091412), (43, 0.19336839476948775), (45, 0.03847451034573397), (46, 0.09668419738474388), (48, 0.15931791067295262), (51, 0.09668419738474388), (52, 0.09668419738474388), (55, 0.07965895533647631), (57, 0.04408141519405023), (61, 0.12578904090424883), (62, 0.09668419738474388), (63, 0.058209687039009896), (64, 0.09668419738474388), (65, 0.09668419738474388), (66, 0.09668419738474388), (67, 0.058209687039009896), (68, 0.25157808180849767), (69, 0.37736712271274647), (70, 0.09668419738474388), (71, 0.12578904090424883), (72, 0.12578904090424883), (73, 0.09668419738474388), (74, 0.07965895533647631), (75, 0.25157808180849767), (76, 0.09668419738474388), (77, 0.09668419738474388), (78, 0.12578904090424883), (79, 0.07965895533647631), (80, 0.12578904090424883), (81, 0.12578904090424883), (82, 0.06757935386523892), (83, 0.12578904090424883), (84, 0.12578904090424883), (85, 0.12578904090424883), (86, 0.12578904090424883), (87, 0.09668419738474388), (88, 0.12578904090424883), (89, 0.12578904090424883), (90, 0.12578904090424883), (91, 0.12578904090424883), (92, 0.12578904090424883), (93, 0.12578904090424883), (94, 0.12578904090424883), (95, 0.12578904090424883)]
6.3.4.2. Generate LSI model from tf-idf model#
techdictionary = corpora.Dictionary(techdocs)
NumTopics=20
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=NumTopics) # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf]
Display first 10 topics:
lsi.print_topics(10)
[(0,
'-0.349*"microsoft365r" + -0.189*"microsoft" + -0.156*"outlook" + -0.155*"move" + -0.155*"emails" + -0.155*"folders" + -0.155*"copy" + -0.142*"teams" + -0.133*"365" + -0.132*"email"'),
(1,
'0.195*"move" + 0.195*"folders" + 0.195*"emails" + 0.195*"copy" + 0.174*"client" + -0.163*"app" + 0.150*"blastula" + 0.144*"outlook" + 0.137*"account" + -0.125*"shiny"'),
(2,
'-0.180*"gpt" + -0.179*"models" + -0.179*"significantly" + -0.179*"energy" + -0.144*"tool" + -0.144*"verbal" + -0.144*"descriptions" + -0.144*"surprising" + -0.133*"metaverse" + -0.128*"2022"'),
(3,
'-0.225*"packages" + -0.215*"azurer" + 0.188*"azure" + 0.159*"cosmos" + 0.159*"db" + -0.156*"update" + -0.126*"june" + -0.126*"caching" + 0.119*"azurecosmosr" + 0.114*"functions"'),
(4,
'-0.204*"teams" + -0.181*"team" + 0.145*"app" + 0.137*"metaverse" + 0.135*"azure" + -0.135*"()" + -0.133*"list_teams" + -0.126*"chats" + -0.116*"list" + 0.115*"cosmos"'),
(5,
'0.254*"metaverse" + 0.164*"could" + 0.163*"live" + 0.134*"barr" + 0.134*"cause" + 0.134*"epstein" + 0.133*"think" + 0.131*"get" + -0.129*"packages" + 0.127*"find"'),
(6,
'0.316*"ai" + 0.237*"adoption" + 0.200*"security" + 0.200*"companies" + 0.200*"data" + 0.181*"secure" + 0.181*"future" + 0.125*"ukraine" + 0.107*"like" + 0.101*"help"'),
(7,
'-0.249*"party" + -0.249*"middleman" + -0.246*"two" + -0.213*"ukraine" + -0.142*"cyber" + -0.142*"offensive" + -0.124*"building" + -0.124*"actually" + -0.124*"despite" + -0.124*"comes"'),
(8,
'0.305*"pendulum" + 0.305*"slowly" + 0.305*"swing" + 0.239*"way" + 0.153*"underneath" + 0.153*"oscillate" + 0.153*"watching" + 0.153*"pendulums" + 0.153*"cliche" + 0.153*"earth"'),
(9,
'-0.293*"barr" + -0.293*"cause" + -0.293*"epstein" + 0.226*"find" + -0.196*"ms" + 0.178*"live" + -0.147*"think" + -0.115*"app" + 0.113*"hub" + 0.113*"probably"')]
6.3.4.3. Determine the most relevant documents for a selected topic#
Generate a numpy array docTopic
. The entry in row \(i\), column \(j\) of this array is the relevance value for topic \(j\) in document \(i\).
import numpy as np
numdocs= len(corpus_lsi)
docTopic=np.zeros((numdocs,NumTopics))
for d,doc in enumerate(corpus_lsi): # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
for t,top in enumerate(doc):
docTopic[d,t]=top[1]
print(docTopic.shape)
print(docTopic)
(20, 20)
[[-6.30607635e-01 5.36371130e-01 -5.55603925e-02 -1.10429913e-02
1.02211588e-01 -4.19538185e-02 5.23341721e-02 -2.01324536e-02
9.36476226e-03 1.48590022e-02 -5.64484933e-02 -1.76516387e-02
-2.52711931e-02 3.12620379e-02 3.97842969e-03 9.22752537e-03
-2.30506582e-02 -6.48237130e-02 -2.02715658e-01 -4.94701015e-01]
[-6.04091276e-01 5.73628893e-01 -6.18943219e-02 6.50059353e-02
1.36255030e-01 -6.08886307e-02 1.62551604e-02 -4.69463769e-02
1.80244057e-03 2.64716158e-02 -4.91570907e-02 4.11850780e-04
-1.38490360e-02 2.17868203e-02 -5.68861208e-02 -1.21444467e-03
1.17206225e-03 -1.12703328e-02 8.58949481e-02 5.08620690e-01]
[-3.03491346e-02 -3.01121930e-02 -6.77552212e-02 3.23638202e-02
-4.63767220e-02 1.53101006e-01 1.56397015e-01 -7.49283302e-01
8.67874972e-02 3.13360382e-02 1.88340293e-01 5.51809912e-01
4.37718308e-02 -1.52561947e-01 -5.61926302e-02 5.06158688e-02
-2.66355793e-02 -7.51413228e-03 2.26857889e-02 -1.96536220e-02]
[-2.38455142e-01 -2.19308458e-01 3.91593208e-02 -5.52749115e-01
8.52552324e-02 -3.03535230e-01 1.20765156e-01 -4.48972721e-02
-4.98519173e-02 1.09683911e-01 -3.41206046e-01 6.74118871e-02
1.00182191e-01 -2.17779384e-01 1.85474429e-01 2.48904436e-01
6.05801293e-03 4.01479785e-01 -1.39144820e-01 4.57341981e-02]
[-1.30514812e-01 -2.61599673e-01 -2.26025793e-02 3.51731394e-01
3.09718477e-01 -1.38122021e-01 -2.03790332e-01 2.26756102e-02
2.37883789e-02 2.09293409e-01 -4.34966679e-01 2.98409037e-01
-2.08772531e-01 -6.63282576e-02 7.43404012e-02 -4.95913075e-01
8.03070307e-02 -1.54263843e-02 -6.70225583e-02 -4.82225129e-03]
[-2.39291785e-02 -9.08826976e-02 -1.50914455e-01 1.84879449e-01
1.21793894e-01 -8.71784281e-02 6.09229394e-01 2.38374387e-01
1.17823122e-01 8.31075571e-02 4.36596145e-02 1.42232253e-01
6.09593569e-01 2.07261955e-01 4.14597348e-02 -1.56094605e-01
-4.61079434e-03 2.59587458e-03 -1.30098110e-02 5.16299877e-03]
[-6.00555645e-02 -4.24333870e-02 -5.37637107e-01 8.42846021e-02
-2.83060151e-01 -3.15977611e-01 -2.30785026e-01 2.72557924e-03
-4.79833413e-02 -1.04435348e-01 1.02944148e-01 7.97312448e-02
5.60029929e-03 9.24503801e-02 3.87596688e-01 -2.15310685e-02
-5.24117200e-01 -1.97832407e-02 -1.64142487e-02 2.01135662e-02]
[-5.20350369e-02 -1.27413098e-01 -1.70000161e-01 1.52500791e-01
-2.59232036e-01 1.88208007e-03 2.92959380e-01 -4.84237389e-01
-3.82058687e-02 2.65143039e-02 -3.73275357e-01 -5.49920209e-01
-7.65006791e-02 2.18507134e-01 -1.21632828e-01 -1.31537160e-01
-3.59551551e-02 9.40152611e-02 3.95207661e-02 -9.41403597e-03]
[-3.31438174e-02 -5.85697458e-02 -2.37501046e-01 -7.34280933e-02
1.38103118e-01 3.49474204e-01 1.25963065e-02 9.85750386e-02
-2.14843025e-01 -6.98404822e-01 -3.62263659e-01 1.47303665e-01
1.09220812e-01 -1.05286508e-01 -2.31339278e-01 -1.09470554e-03
-1.33637238e-01 -3.11444068e-02 -1.79309295e-02 6.94238150e-03]
[-1.61050673e-01 -2.91378078e-01 3.24281690e-03 4.59682332e-01
3.28382424e-01 -1.46496542e-01 -1.22444828e-01 -5.54832087e-02
9.10591673e-02 -1.30813831e-02 -1.11762204e-01 -1.33900871e-02
1.55657016e-02 2.04509065e-01 -5.91564921e-02 6.69807910e-01
-2.56235721e-04 -1.22461081e-01 -1.82093736e-04 -1.17298188e-02]
[-3.29249374e-01 -1.25962351e-01 5.96766984e-02 -1.05894085e-01
-3.82966304e-01 1.41928175e-01 5.42660489e-03 2.31757937e-01
1.26086962e-01 3.81256938e-02 -5.32890500e-02 3.62634554e-01
-2.47599249e-01 5.31280907e-01 -2.13788006e-01 3.97709499e-02
-5.12263062e-03 2.61484597e-01 1.68221955e-01 -3.40522935e-02]
[-3.25893490e-02 -8.76820895e-02 -2.27412859e-01 1.41942539e-01
6.05588332e-02 -1.16822246e-01 5.94238497e-01 2.00987313e-01
2.67688072e-02 -1.37257949e-01 1.87424278e-01 1.36874494e-02
-6.05406630e-01 -2.74189112e-01 4.97206939e-02 7.97070518e-02
4.99653184e-02 2.22867194e-02 9.69866279e-03 8.62316810e-03]
[-5.33905602e-02 -4.24298675e-02 -6.17203363e-01 -1.34125655e-02
-2.08692510e-01 -2.45048817e-01 -2.14462005e-01 2.33784858e-02
-1.35940941e-01 9.44141370e-03 9.69567755e-02 2.76592924e-02
1.23863952e-01 -6.73404319e-02 -2.27623925e-01 2.65831198e-02
6.01991599e-01 7.77353107e-03 -1.74550715e-03 -2.23350919e-02]
[-4.00844315e-01 -3.24949514e-01 9.11471660e-02 1.04395173e-01
2.31670203e-01 1.16353600e-02 -1.46794990e-01 -9.07407616e-02
-9.44173151e-03 -2.30344118e-01 4.97195570e-01 -1.91447431e-01
2.74699492e-02 7.53051304e-02 -1.39743044e-01 -2.12480301e-01
-4.06660410e-02 3.60845244e-01 -2.97923148e-01 4.70168334e-02]
[-4.45924921e-01 -3.05884540e-01 1.57109756e-01 2.12990044e-02
-4.84610759e-01 2.37641647e-01 6.68994379e-02 8.80260282e-02
-3.12489221e-02 6.06898283e-02 -5.06910900e-02 9.57178901e-03
3.55987544e-02 -1.19518735e-01 1.26107051e-01 5.05165707e-02
8.27138312e-02 -3.89276668e-01 -4.03182809e-01 1.29000337e-01]
[-5.81560226e-01 -2.45734286e-01 1.23075810e-01 1.54166545e-01
-1.14514682e-01 1.14518802e-01 -6.11618709e-02 5.62301732e-02
-7.11764977e-02 -7.41040091e-03 6.73110933e-02 -1.22910218e-01
1.74988079e-01 -3.68605996e-01 1.40620323e-01 -3.26573462e-02
1.16039205e-03 4.75057047e-02 5.48853241e-01 -1.17337520e-01]
[-5.44930230e-02 -5.21152589e-02 -3.05858855e-01 -1.76446419e-01
2.99841550e-01 5.42571242e-01 -2.17581122e-02 -6.70149820e-02
3.75545737e-02 2.77580366e-02 3.92058027e-02 -8.15804008e-02
-8.16578008e-02 2.71991030e-01 5.64157257e-01 2.15813842e-02
2.61723574e-01 1.77175147e-02 3.96573579e-02 1.52695717e-02]
[-2.40429387e-01 -3.00937333e-01 4.86646674e-03 -5.38328944e-01
2.71375403e-01 -3.04424712e-01 5.61104075e-02 -1.10447283e-01
5.53415322e-03 -6.77444081e-02 1.14757571e-01 -4.16385133e-02
-8.48757857e-02 2.19092393e-01 -1.42329942e-01 -1.51535623e-01
-4.75179702e-02 -4.80079755e-01 1.71547693e-01 -5.50752265e-03]
[-4.17200254e-02 -1.12661003e-01 -3.56508178e-01 -9.65223444e-02
1.83043297e-01 3.61581754e-01 -6.04764837e-03 1.27011215e-01
-2.72832045e-01 5.84164466e-01 6.55580141e-02 -5.56132826e-02
-2.06603903e-02 -1.24119099e-01 -3.47351088e-01 7.34893572e-02
-3.20369927e-01 -6.98185724e-04 -3.36751412e-02 2.06396280e-03]
[-2.87846724e-02 -3.94687297e-02 -2.39531313e-01 -1.17457712e-01
-4.48507424e-03 1.19559026e-01 -1.25202515e-01 6.12179492e-02
8.91104369e-01 -7.86395688e-03 -7.40060549e-02 -1.31303915e-01
3.36466165e-02 -2.16138123e-01 -1.46654709e-01 -1.72530754e-02
-7.55198851e-02 -8.55902389e-03 -1.88849640e-02 8.46486444e-03]]
Select an arbitrary topic-id and determine the documents, which have the highest relevance value for this topic:
topicId=7 #select an arbitrary topic-id
topicRelevance=docTopic[:,topicId]
docsoftopic= np.array(topicRelevance).argsort()
relevanceValue= np.sort(topicRelevance)
print(docsoftopic) #most relevant document for selected topic is at first position
print(relevanceValue) #highest relevance document/topic-relevance-value is at first position
[ 2 7 17 13 16 9 1 3 0 6 4 12 15 19 14 8 18 11 10 5]
[-0.7492833 -0.48423739 -0.11044728 -0.09074076 -0.06701498 -0.05548321
-0.04694638 -0.04489727 -0.02013245 0.00272558 0.02267561 0.02337849
0.05623017 0.06121795 0.08802603 0.09857504 0.12701122 0.20098731
0.23175794 0.23837439]
TOP=8
print("Selected Topic:\n",lsi.show_topic(topicId))
print("#"*50)
print("Docs with the highest negative value w.r.t the selected topic")
for idx in docsoftopic[:TOP]:
print("-"*20)
print(idx,"\n",techdocs[idx])
print("#"*50)
print("Docs with the highest positive value w.r.t the selected topic")
for idx in docsoftopic[-TOP:]:
print("-"*20)
print(idx,"\n",techdocs[idx])
Selected Topic:
[('party', -0.2488676878930281), ('middleman', -0.2488676878930281), ('two', -0.24596576983629964), ('ukraine', -0.2134234697117422), ('cyber', -0.1422823131411615), ('offensive', -0.1422823131411615), ('building', -0.12443384394651405), ('actually', -0.12443384394651405), ('despite', -0.12443384394651405), ('comes', -0.12443384394651405)]
##################################################
Docs with the highest negative value w.r.t the selected topic
--------------------
2
['building', 'better', 'middleman', 'comes', 'mind', 'hear', 'term', 'two', 'sided', 'market', '?”', 'maybe', 'imagine', 'party', 'needs', 'something', 'interact', 'party', 'provides', 'despite', 'number', 'two', 'name', 'actually', 'someone', 'else', 'involved', 'middleman', 'entity', 'sits', 'parties', 'make', '[…]']
--------------------
7
['day', 'kyiv', 'experience', 'working', 'ukraine', 'offensive', 'cyber', 'team', 'jeffrey', 'carrmarch', '22', '2022', 'russia', 'invaded', 'ukraine', 'february', '24th', 'working', 'two', 'offensive', 'cyber', 'operators', 'gurmo', 'main', 'intelligence', 'directorate', 'ministry', 'defense', 'ukraine', 'several', 'months', 'trying', 'help', 'raise', 'funds', 'expand', 'development', 'osint', 'open', '[…]']
--------------------
17
['azurer', 'update', 'new', 'may', 'june', 'hong', 'ooi', 'summary', 'updates', 'azurer', 'family', 'packages', 'may', 'june', '2021', 'azureauth', 'change', 'default', 'caching', 'behaviour', 'disable', 'cache', 'running', 'inside', 'shiny', 'update', 'shiny', 'vignette', 'clean', 'redirect', 'page', 'authenticating', 'thanks', 'tyler', 'littlefield', ').', 'add', 'create_azurer_dir', 'function', 'create', 'caching', 'directory', 'manually', 'useful', 'non', 'interactive', 'sessions', 'also', 'jupyter', 'notebooks', 'technically', 'interactive', 'sense', 'cannot', 'read', 'user', 'input', 'console', 'prompt', 'azuregraph', 'add', 'enhanced', 'support', 'paging', 'api', 'many', '...']
--------------------
13
['using', 'microsoft365r', 'shiny', 'hong', 'ooi', 'article', 'lightly', 'edited', 'version', 'microsoft365r', 'shiny', 'vignette', 'latest', 'microsoft365r', 'release', 'describe', 'incorporate', 'microsoft365r', 'interactive', 'authentication', 'azure', 'active', 'directory', 'aad', 'shiny', 'web', 'app', 'steps', 'involved', 'register', 'app', 'aad', 'use', 'app', 'id', 'authenticate', 'get', 'oauth', 'token', 'pass', 'token', 'microsoft365r', 'functions', 'app', 'registration', 'default', 'microsoft365r', 'app', 'registration', 'works', 'package', 'used', 'local', 'machine', 'support', 'running', 'remote', 'server', '...']
--------------------
16
['identity', 'problems', 'get', 'bigger', 'metaverse', 'hype', 'surrounding', 'metaverse', 'results', 'something', 'real', 'could', 'improve', 'way', 'live', 'work', 'play', 'could', 'create', 'hellworld', 'get', 'want', 'whatever', 'people', 'think', 'read', 'metaverse', 'originally', 'imagined', 'snow', 'crash', 'vision', '[…]']
--------------------
9
['azurecosmosr', 'interface', 'azure', 'cosmos', 'db', 'hong', 'ooi', 'last', 'week', 'announced', 'azurecosmosr', 'interface', 'azure', 'cosmos', 'db', 'fully', 'managed', 'nosql', 'database', 'service', 'azure', 'post', 'gives', 'short', 'rundown', 'main', 'features', 'azurecosmosr', 'explaining', 'azure', 'cosmos', 'db', 'tricky', 'excerpt', 'official', 'description', 'azure', 'cosmos', 'db', 'fully', 'managed', 'nosql', 'database', 'modern', 'app', 'development', 'single', 'digit', 'millisecond', 'response', 'times', 'automatic', 'instant', 'scalability', 'guarantee', 'speed', 'scale', 'business', 'continuity', 'assured', 'sla', 'backed', 'availability', 'enterprise', 'grade', 'security', 'app', 'development', 'faster', 'productive', 'thanks', 'turnkey', 'multi', 'region', '...']
--------------------
1
['outlook', 'client', 'support', 'microsoft365r', 'available', 'beta', 'test', 'hong', 'ooi', 'announcement', 'beta', 'outlook', 'email', 'client', 'part', 'microsoft365r', 'package', 'install', 'github', 'repository', 'devtools', '::', 'install_github', '("', 'azure', 'microsoft365r', '")', 'client', 'provides', 'following', 'features', 'send', 'reply', 'forward', 'emails', 'optionally', 'composed', 'blastula', 'emayili', 'copy', 'move', 'emails', 'folders', 'create', 'delete', 'copy', 'move', 'folders', 'add', 'remove', 'download', 'attachments', 'plan', 'submit', 'cran', 'sometime', 'next', 'month', 'period', 'public', 'testing', 'please', 'give', 'try', 'give', 'feedback', 'either', 'via', 'email', 'opening', 'issue', '...']
--------------------
3
['new', 'azurer', 'hong', 'ooi', 'update', 'happening', 'azurer', 'suite', 'packages', 'first', 'may', 'noticed', 'holiday', 'season', 'packages', 'updated', 'cran', 'change', 'maintainer', 'email', 'non', 'microsoft', 'address', 'left', 'microsoft', 'role', 'westpac', 'bank', 'australia', 'sad', 'leaving', 'intend', 'continue', 'maintaining', 'updating', 'packages', 'end', 'changes', 'recently', 'submitted', 'cran', 'shortly', 'azureauth', 'allows', 'obtaining', 'tokens', 'organizations', '”...']
##################################################
Docs with the highest positive value w.r.t the selected topic
--------------------
15
['microsoft365r', 'interface', 'microsoft', '365', 'suite', 'happy', 'announce', 'microsoft365r', 'package', 'working', 'microsoft', '365', 'formerly', 'known', 'office', '365', 'suite', 'cloud', 'services', 'microsoft365r', 'extends', 'interface', 'microsoft', 'graph', 'api', 'provided', 'azuregraph', 'package', 'provide', 'lightweight', 'yet', 'powerful', 'interface', 'sharepoint', 'onedrive', 'support', 'teams', 'outlook', 'soon', 'come', 'microsoft365r', 'available', 'cran', 'install', 'development', 'version', 'github', 'devtools', '::', 'install_github', '("', 'azure', 'microsoft365r', '").', 'authentication', 'first', 'time', 'call', 'one', 'microsoft365r', 'functions', 'see', '),', 'use', 'internet', 'browser', 'authenticate', 'azure', 'active', 'directory', 'aad', '),...']
--------------------
19
['general', 'purpose', 'pendulum', 'pendulums', 'swing', 'one', 'way', 'swing', 'back', 'way', 'oscillate', 'quickly', 'slowly', 'slowly', 'watch', 'earth', 'rotate', 'underneath', 'cliche', 'talk', 'technical', 'trend', 'pendulum', ',”', 'though', 'accurate', 'often', 'enough', 'may', 'watching', 'one', '[…]']
--------------------
14
['teams', 'support', 'microsoft365r', 'hong', 'ooi', 'happy', 'announce', 'version', 'microsoft365r', 'interface', 'microsoft', '365', 'cran', 'version', 'adds', 'support', 'microsoft', 'teams', 'much', 'requested', 'feature', 'access', 'team', 'microsoft', 'teams', 'use', 'get_team', '()', 'function', 'provide', 'team', 'name', 'id', 'also', 'list', 'teams', 'list_teams', '().', 'return', 'objects', 'r6', 'class', 'ms_team', 'methods', 'working', 'channels', 'drives', 'list_teams', '()', 'team']
--------------------
8
['epstein', 'barr', 'cause', 'cause', 'one', 'intriguing', 'news', 'stories', 'new', 'year', 'claimed', 'epstein', 'barr', 'virus', 'ebv', 'cause', 'multiple', 'sclerosis', 'ms', '),', 'suggested', 'antiviral', 'medications', 'vaccinations', 'epstein', 'barr', 'could', 'eliminate', 'ms', 'md', 'epidemiologist', 'think', 'article', 'forces', 'us', 'think', '[…]']
--------------------
18
['recommendations', 'us', 'live', 'household', 'communal', 'device', 'like', 'amazon', 'echo', 'google', 'home', 'hub', 'probably', 'use', 'play', 'music', 'live', 'people', 'may', 'find', 'time', 'spotify', 'pandora', 'algorithm', 'seems', 'know', 'well', 'find', 'songs', 'creeping', '[…]']
--------------------
11
['ai', 'adoption', 'enterprise', '2022', 'december', '2021', 'january', '2022', 'asked', 'recipients', 'data', 'ai', 'newsletters', 'participate', 'annual', 'survey', 'ai', 'adoption', 'particularly', 'interested', 'anything', 'changed', 'since', 'last', 'year', 'companies', 'farther', 'along', 'ai', 'adoption', 'working', 'applications', 'production', 'using', 'tools', 'like', 'automl', 'generate', '[…]']
--------------------
10
['microsoft365r', 'testers', 'wanted', 'hong', 'ooi', 'microsoft365r', 'author', 'updated', 'package', 'github', 'following', 'features', 'add', 'support', 'shared', 'mailboxes', 'get_business_outlook', '().', 'access', 'shared', 'mailbox', 'supply', 'one', 'arguments', 'shared_mbox_id', 'shared_mbox_name', 'shared_mbox_email', 'specifying', 'id', 'displayname', 'email', 'address', 'mailbox', 'respectively', 'add', 'support', 'teams', 'chats', 'including', 'one', 'one', 'group', 'meeting', 'chats', ').', 'use', 'list_chats', '()', 'function', 'list', 'chats', 'participating', 'get_chat', '()`', 'function', 'retrieve', 'specific', 'chat', 'chat', 'object', 'class', 'ms_chat', 'similar', 'methods', 'channel', 'send', 'list', 'retrieve', 'messages', ',...']
--------------------
5
['future', 'security', 'future', 'cybersecurity', 'shaped', 'need', 'companies', 'secure', 'networks', 'data', 'devices', 'identities', 'includes', 'adopting', 'security', 'frameworks', 'like', 'zero', 'trust', 'help', 'companies', 'secure', 'internal', 'information', 'systems', 'data', 'cloud', 'sheer', 'volume', 'new', 'threats', 'today', 'security', 'landscape', 'become', 'complex', '[…]']
import gensim
lda = gensim.models.ldamodel.LdaModel(corpus_tfidf, num_topics=20, id2word = dictionary)
#!pip install pyLDAvis
#import pyLDAvis.gensim as gensimvis
#import pyLDAvis
#vis_en = gensimvis.prepare(lda, corpus_tfidf, dictionary)
#pyLDAvis.display(vis_en)