Accessing Text with Langchain

1.3. Accessing Text with Langchain#

Langchain is a framework for developing LLM applications. It provides modules and functions to

access text from different sources
interact with a large variety of LLMs
build AI agents such as chatbots
build advanced applications e.g. entire Retrieval Augmented Generation (RAG) pipelines

The main package langchain contains chains, agents, and retrieval strategies that make up an application’s cognitive architecture. In addition the package langchain-community contains third party integrations that are maintained by the LangChain community. This contains all integrations for various components such as LLMs, vector stores, retrievers, etc.

#!pip install langchain
#!pip install -U langchain-community
#!pip install unstructured
#!pip install playwright
#!pip install asyncio

1.3.1. Accessing Text Files#

For accessing text from different sources like textfiles, pdf-documents, csv-files and webpages, langchain provides different Loaders, e.g. the TextLoader, which is applied below for accessing a local textfile.

As can be seen in the outputs below, each Loader returns a list of Document-objects. Each document object contains 2 components:

page_content contains the textual content extracted from the document page.
metadata is an ensemble of additional details, like the document’s source (the file it originates from), the page number, file type, and other information elements.

from langchain.document_loaders import TextLoader

loader = TextLoader("../Data/ZeitOnlineLandkartenA.txt",encoding="utf-8")
text=loader.load()

text

[Document(metadata={'source': '../Data/ZeitOnlineLandkartenA.txt'}, page_content='Landkarten mit Mehrwert\nOb als Reiseführer, Nachrichtenkanal oder Bürgerinitiative: Digitale Landkarten lassen sich vielseitig nutzen. \n\nZEIT ONLINE stellt einige der interessantesten Dienste vor.\n\nDie Zeit, in der Landkarten im Netz bloß der Routenplanung dienten, ist längst vorbei. Denn mit den digitalen Karten von Google Maps und der Open-Source-Alternative OpenStreetMap kann man sich spannendere Dinge als den Weg von A nach B anzeigen lassen. Über offene Programmschnittstellen (API) lassen sich Daten von anderen Websites mit dem Kartenmaterial verknüpfen oder eigene Informationen eintragen. Das Ergebnis nennt sich Mashup – ein Mischmasch aus Karten und Daten sozusagen. Die Bewertungscommunity Qype nutzt diese Möglichkeit schon lange, um Adressen und Bewertungen miteinander zu verknüpfen und mithilfe von Google Maps darzustellen. Auch Immobilienbörsen, Branchenbücher und Fotodienste kommen kaum noch ohne eigene Kartenfunktion aus. Dank der Integration von Geodaten in Smartphones werden soziale \nKartendienste immer beliebter. Auch sie nutzen die offenen Schnittstellen. Neben kommerziellen Diensten profitieren aber auch Privatpersonen und unabhängige \nProjekte von den Möglichkeiten des frei zugänglichen Kartenmaterials. Das Open-Data-Netzwerk versucht, öffentlich zugängliche Daten zu sammeln und neue \nMöglichkeiten für Bürger herauszuarbeiten. So können Anwohner in England schon länger über FixMyStreet Reparaturaufträge direkt an die Behörden übermitteln.\nUnter dem Titel Frankfurt-Gestalten gibt es seit Frühjahr ein ähnliches Pilotprojekt für Frankfurt am Main. Hier geht es um weit mehr als Reparaturen. Die Seite soll \neinen aktiven Dialog zwischen Bürgern und ihrer Stadt ermöglichen – partizipative Lokalpolitik ist das Stichwort. Tausende dieser Mashups und Initiativen gibt es inzwischen. Sie bewegen sich zwischen bizarr und faszinierend, unterhaltsam und informierend. ZEIT ONLINE stellt einige der interessantesten vor. Sie zeigen, was man mit öffentlichen Datensammlungen alles machen kann.')]

print("Returned list contains %d Document objects"%len(text))

Returned list contains 1 Document objects

print("Metadata of the single document-object: ",text[0].metadata)

Metadata of the single document-object:  {'source': '../Data/ZeitOnlineLandkartenA.txt'}

print("Page Content of the single document-object: \n",text[0].page_content)

Page Content of the single document-object: 
 Landkarten mit Mehrwert
Ob als Reiseführer, Nachrichtenkanal oder Bürgerinitiative: Digitale Landkarten lassen sich vielseitig nutzen. 

ZEIT ONLINE stellt einige der interessantesten Dienste vor.

Die Zeit, in der Landkarten im Netz bloß der Routenplanung dienten, ist längst vorbei. Denn mit den digitalen Karten von Google Maps und der Open-Source-Alternative OpenStreetMap kann man sich spannendere Dinge als den Weg von A nach B anzeigen lassen. Über offene Programmschnittstellen (API) lassen sich Daten von anderen Websites mit dem Kartenmaterial verknüpfen oder eigene Informationen eintragen. Das Ergebnis nennt sich Mashup – ein Mischmasch aus Karten und Daten sozusagen. Die Bewertungscommunity Qype nutzt diese Möglichkeit schon lange, um Adressen und Bewertungen miteinander zu verknüpfen und mithilfe von Google Maps darzustellen. Auch Immobilienbörsen, Branchenbücher und Fotodienste kommen kaum noch ohne eigene Kartenfunktion aus. Dank der Integration von Geodaten in Smartphones werden soziale 
Kartendienste immer beliebter. Auch sie nutzen die offenen Schnittstellen. Neben kommerziellen Diensten profitieren aber auch Privatpersonen und unabhängige 
Projekte von den Möglichkeiten des frei zugänglichen Kartenmaterials. Das Open-Data-Netzwerk versucht, öffentlich zugängliche Daten zu sammeln und neue 
Möglichkeiten für Bürger herauszuarbeiten. So können Anwohner in England schon länger über FixMyStreet Reparaturaufträge direkt an die Behörden übermitteln.
Unter dem Titel Frankfurt-Gestalten gibt es seit Frühjahr ein ähnliches Pilotprojekt für Frankfurt am Main. Hier geht es um weit mehr als Reparaturen. Die Seite soll 
einen aktiven Dialog zwischen Bürgern und ihrer Stadt ermöglichen – partizipative Lokalpolitik ist das Stichwort. Tausende dieser Mashups und Initiativen gibt es inzwischen. Sie bewegen sich zwischen bizarr und faszinierend, unterhaltsam und informierend. ZEIT ONLINE stellt einige der interessantesten vor. Sie zeigen, was man mit öffentlichen Datensammlungen alles machen kann.

1.3.2. Accessing CSV Files#

from langchain.document_loaders.csv_loader import CSVLoader


#loader = CSVLoader(file_path='../Data/healthcare_dataset.csv') #comma-separated file
loader = CSVLoader(file_path='../Data/imdb/labeledTrainData.tsv',csv_args={'delimiter': '\t'}) #tab-separated file
csvdata = loader.load()

As can be seen below, the CSVLoader returns one document-object for each row in the csv-file:

print("Returned list contains %d Document objects"%len(csvdata))

Returned list contains 25000 Document objects

print(csvdata[0].metadata)

{'source': '../Data/imdb/labeledTrainData.tsv', 'row': 0}

print(csvdata[0].page_content)

id: 5814_8
sentiment: 1
review: With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter.

1.3.3. Accessing all Files in Directory#

#!pip install unstructured
#!pip install python-magic-bin
#!pip install --upgrade nltk

#nltk.download()

from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader('../Data/textdir', glob="*.txt",show_progress=True)
textdocs=loader.load()

100%|█████████████████████████████████████████████| 2/2 [00:03<00:00,  1.67s/it]

len(textdocs)

print(textdocs[1].page_content[:500])

PROJECT GUTENBERG AND DUNCAN RESEARCH SHAREWARE

(c)1991

Project Gutenberg has made arrangements with Duncan Research for the distribution of Duncan Research Electronic Library text. No money is solicited by Project Gutenberg. All donations go to:

Barbara  Duncan Duncan Research P.O. Box  2782 Champaign,   IL 61825 - 2782

Please, if you send in a request for information, donate enough, or more than enough to cover the cost of writing, printing, etc. as well as the cost of postage.

This is Sh

1.3.4. Accessing Html#

Below we demonstrate the use of 2 different Loaders provided by langchain, which can be applied for crawling Websites:

The WebBaseURLLoader class returns a single document for each provided URL. The page content of the returned document-object contains the entire content of the webpage and requires a parser like Beautifulsoup to extract dedicated elements of the .html
The UnstructuredURLLoader can be configured such that it returns the different elements of the .html, each element as a single document-object. In this way e.g. only the elements which contain only narrative text can be filtered as demonstrated below.

from langchain.document_loaders import UnstructuredURLLoader, WebBaseLoader
from langchain.docstore.document import Document
#from unstructured.cleaners.core import remove_punctuation,clean,clean_extra_whitespace,replace_unicode_quotes

url="https://www.deeplearning.ai/the-batch/galore-a-memory-saving-method-for-pretraining-and-fine-tuning-llms/"

1.3.4.1. WebBaseURLLoader#

loader=WebBaseLoader(url)
page=loader.load()
print("Length of returned list: ",len(page))

Length of returned list:  1

for d in page[0].metadata.keys():
    print(d,"\t",page[0].metadata[d])
#print(page[0].metadata)

source 	 https://www.deeplearning.ai/the-batch/galore-a-memory-saving-method-for-pretraining-and-fine-tuning-llms/
title 	 GaLore, A Memory-Saving Method for Pretraining and Fine-Tuning LLMs
description 	 Low-rank adaptation (LoRA) reduces memory requirements when fine-tuning large language models, but it isn’t as conducive to pretraining. Researchers devised a method...
language 	 en

page[0].page_content

"GaLore, A Memory-Saving Method for Pretraining and Fine-Tuning LLMs✨ New course! Enroll in  Retrieval Optimization: From Tokenization to Vector QuantizationExplore CoursesAI NewsletterThe BatchAndrew's LetterData PointsML ResearchBlogCommunityForumEventsAmbassadorsAmbassador SpotlightResourcesCompanyAboutCareersContactStart LearningWeekly IssuesAndrew's LettersData PointsML ResearchBusinessScienceAI & SocietyCultureHardwareAI CareersAboutSubscribeThe BatchMachine Learning ResearchArticleLike LoRA, But for Pretraining GaLore, a memory-saving method for pretraining and fine-tuning LLMsMachine Learning ResearchGenerative AILarge Language Models (LLMs)PublishedJul 10, 2024Reading time3 min readShareLow-rank adaptation (LoRA) reduces memory requirements when fine-tuning large language models, but it isn’t as conducive to pretraining. Researchers devised a method that achieves similar memory savings but works well for both fine-tuning and pretraining.What’s new:\xa0Jiawei Zhao and colleagues at California Institute of Technology, Meta, University of Texas at Austin, and Carnegie Mellon proposed\xa0Gradient Low-Rank Projection\xa0(GaLore), an optimizer modification that saves memory during training by reducing the sizes of optimizer states. They used this approach to pretrain a 7B parameter transformer using a consumer-grade Nvidia RTX 4090 GPU.Key insight:\xa0LoRA\xa0saves memory during training by learning to approximate a change in the weight matrix of each layer in a neural network using the product of two smaller matrices. This approximation results in good performance when fine-tuning (though not quite as good as fine-tuning all weights) but worse performance when pretraining from a random initialization. The authors proved theoretically that updating weights according to an approximate gradient matrix — which reduces the memory required to store optimizer states — can yield the same performance as using the exact gradient matrix (at least for deep neural networks with ReLU activation functions and classification loss functions). Updating weights only once using an approximate gradient matrix is insufficient. However, updating weights repeatedly using gradient approximations\xa0that change with each training step (because the inputs change between training steps)\xa0achieves an effect similar to training weights in the usual way.\xa0How it works:\xa0GaLore approximates a network’s gradient matrix divided into layer-wise matrices. Given a layer’s gradient matrix G (size m x n), GaLore computes a smaller matrix P (size r x m). It uses PG, a smaller approximation of the gradient matrix (size r x n), to update optimizer states.\xa0To further save memory, it updates layers one at a time instead of all at once, following\xa0LOMO.\xa0At each training step, for each layer, GaLore computed the layer-wise gradient matrix normally.GaLore computed a smaller matrix P that, when multiplied by the gradient matrix, yielded a smaller matrix that approximated the weight update. GaLore computed P every 200 training steps (that is, it used the same P for 200 training steps at a time before computing a new P).GaLore multiplied P by the gradient matrix to compute a smaller, approximate version of the gradient matrix. It used this smaller version to update the Adam optimizer’s internal states, requiring less memory to store the optimizer’s internal states. Then the optimizer used its internal states to update the smaller matrix.GaLore multiplied P by the smaller matrix to produce a full-sized approximation of the gradient matrix. It used the full-sized approximation to update the current layer’s weights.\xa0Results:\xa0The authors tested GaLore in both pretraining and fine-tuning scenarios.The authors compared GaLore to Adam while pretraining five transformer architectures from 60 million to 7 billion parameters to generate the next token in\xa0web text. GaLore (set up to represent its internal states using 8-bit numbers) pretrained LLaMA 7B from scratch using 22GB of memory, while Adam (modified to represent its internal states using 8-bit numbers) needed 46GB of memory. After training on 19.7 billion tokens, LLaMA 7B achieved 14.65 perplexity, while Adam achieved 14.61 perplexity (a measure of how well a model reproduces validation examples, lower is better).They also used GaLore to fine-tune RoBERTaBase on the multi-task benchmark\xa0GLUE. GaLore needed 253MB of memory and achieved a score of 85.89 (averaging eight of 11 GLUE tasks), while LoRA needed 257MB of memory and reached 85.61.Why it matters:\xa0LoRA’s ability to fine-tune large models using far less memory makes it a very popular fine-tuning method. GaLore is a theoretically motivated approach to memory-efficient training that’s good for both pretraining and fine-tuning.We're thinking:\xa0LoRA-style approximation has been unlocking data- and memory-efficient approaches in\xa0a\xa0variety\xa0of machine learning situations — an exciting trend as models grow and demand for compute resources intensifies.ShareSubscribe to The BatchStay updated with weekly AI News and Insights delivered to your inboxCoursesThe BatchCommunityCareersAbout"

1.3.4.2. UnstructuredURLLoader#

loader=UnstructuredURLLoader([url],mode="elements")
page=loader.load()
print("Length of returned list: ",len(page))

Length of returned list:  25

for d in page[0].metadata.keys():
    print(d,"\t",page[0].metadata[d])

category_depth 	 0
languages 	 ['eng']
filetype 	 text/html
url 	 https://www.deeplearning.ai/the-batch/galore-a-memory-saving-method-for-pretraining-and-fine-tuning-llms/
category 	 Title

for p in page:
    print("Category: \t",p.metadata["category"])

Category: 	 Title
Category: 	 Title
Category: 	 Title
Category: 	 Title
Category: 	 Title
Category: 	 UncategorizedText
Category: 	 Title
Category: 	 NarrativeText
Category: 	 Title
Category: 	 NarrativeText
Category: 	 NarrativeText
Category: 	 NarrativeText
Category: 	 NarrativeText
Category: 	 ListItem
Category: 	 ListItem
Category: 	 ListItem
Category: 	 ListItem
Category: 	 NarrativeText
Category: 	 ListItem
Category: 	 ListItem
Category: 	 NarrativeText
Category: 	 NarrativeText
Category: 	 Title
Category: 	 Title
Category: 	 NarrativeText

def generate_document(url):
 "Return a langchain Document for the given url"
 loader = UnstructuredURLLoader(urls=[url],
                                mode="elements",
                                )
 elements = loader.load()
 selected_elements = [e for e in elements if e.metadata['category']=="NarrativeText"]
 full_clean = " ".join([e.page_content for e in selected_elements])
 return Document(page_content=full_clean, metadata={"source":url})

doc=generate_document(url=url)

print(doc.page_content)

3 min read Low-rank adaptation (LoRA) reduces memory requirements when fine-tuning large language models, but it isn’t as conducive to pretraining. Researchers devised a method that achieves similar memory savings but works well for both fine-tuning and pretraining. What’s new: Jiawei Zhao and colleagues at California Institute of Technology, Meta, University of Texas at Austin, and Carnegie Mellon proposed Gradient Low-Rank Projection (GaLore), an optimizer modification that saves memory during training by reducing the sizes of optimizer states. They used this approach to pretrain a 7B parameter transformer using a consumer-grade Nvidia RTX 4090 GPU. Key insight: LoRA saves memory during training by learning to approximate a change in the weight matrix of each layer in a neural network using the product of two smaller matrices. This approximation results in good performance when fine-tuning (though not quite as good as fine-tuning all weights) but worse performance when pretraining from a random initialization. The authors proved theoretically that updating weights according to an approximate gradient matrix — which reduces the memory required to store optimizer states — can yield the same performance as using the exact gradient matrix (at least for deep neural networks with ReLU activation functions and classification loss functions). Updating weights only once using an approximate gradient matrix is insufficient. However, updating weights repeatedly using gradient approximations that change with each training step (because the inputs change between training steps) achieves an effect similar to training weights in the usual way. How it works: GaLore approximates a network’s gradient matrix divided into layer-wise matrices. Given a layer’s gradient matrix G (size m x n), GaLore computes a smaller matrix P (size r x m). It uses PG, a smaller approximation of the gradient matrix (size r x n), to update optimizer states. To further save memory, it updates layers one at a time instead of all at once, following LOMO. Results: The authors tested GaLore in both pretraining and fine-tuning scenarios. Why it matters: LoRA’s ability to fine-tune large models using far less memory makes it a very popular fine-tuning method. GaLore is a theoretically motivated approach to memory-efficient training that’s good for both pretraining and fine-tuning. We're thinking: LoRA-style approximation has been unlocking data- and memory-efficient approaches in a variety of machine learning situations — an exciting trend as models grow and demand for compute resources intensifies. Stay updated with weekly AI News and Insights delivered to your inbox

1.3.5. Accessing PDF#

#!pip install pypdf

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("../Data/2015-lecun.pdf")
pages = loader.load_and_split()

pages[0]

Document(metadata={'source': '../Data/2015-lecun.pdf', 'page': 0}, page_content='See discussions, st ats, and author pr ofiles f or this public ation at : https://www .researchgate.ne t/public ation/277411157\nDeep Learning\nArticle \xa0\xa0 in\xa0\xa0Nature · May 2015\nDOI: 10.1038/nat ure14539 \xa0·\xa0Sour ce: PubMed\nCITATIONS\n57,367READS\n199,643\n3 author s, including:\nY. Bengio\nUniv ersité de Montr éal\n894 PUBLICA TIONS \xa0\xa0\xa0475,801  CITATIONS \xa0\xa0\xa0\nSEE PROFILE\nAll c ontent f ollo wing this p age was uplo aded b y Y. Bengio  on 28 A ugust 2015.\nThe user has r equest ed enhanc ement of the do wnlo aded file.')

pages[1].metadata

{'source': '../Data/2015-lecun.pdf', 'page': 1}

print(pages[1].page_content[:1000])

1Facebook AI Research, 770 Broadway, New York, New York 10003 USA. 2New York University, 715 Broadway, New York, New York 10003, USA. 3Department of Computer Science and Operations 
Research Université de Montréal, Pavillon André-Aisenstadt, PO Box 6128  Centre-Ville STN Montréal, Quebec H3C 3J7, Canada. 4Google, 1600 Amphitheatre Parkway, Mountain View, California 
94043, USA. 5Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 3G4, Canada.Machine-learning technology powers many aspects of modern 
society: from web searches to content filtering on social net -
works to recommendations on e-commerce websites, and 
it is increasingly present in consumer products such as cameras and 
smartphones. Machine-learning systems are used to identify objects 
in images, transcribe speech into text, match news items, posts or 
products with users’ interests, and select relevant results of search. 
Increasingly, these applications make use of a class