# Accessing Text with Langchain
[Langchain](https://python.langchain.com/) is a framework for developing LLM applications. It provides modules and functions to 
* access text from different sources
* interact with a large variety of LLMs
* build AI agents such as chatbots
* build advanced applications e.g. entire Retrieval Augmented Generation (RAG) pipelines

The main package `langchain` contains *chains*, *agents*, and *retrieval strategies* that make up an application's cognitive architecture. In addition the package `langchain-community` contains **third party integrations** that are maintained by the LangChain community. This contains all integrations for various components such as LLMs, vector stores, retrievers, etc. 

In [1]:
#!pip install langchain
#!pip install -U langchain-community
#!pip install unstructured
#!pip install playwright
#!pip install asyncio

## Accessing Text Files

For accessing text from different sources like textfiles, pdf-documents, csv-files and webpages, `langchain` provides different `Loaders`, e.g. the `TextLoader`, which is applied below for accessing a local textfile.

As can be seen in the outputs below, each `Loader` returns a list of `Document`-objects. Each document object contains 2 components:
* **page_content** contains the textual content extracted from the document page.
* **metadata** is an ensemble of additional details, like the document’s source (the file it originates from), the page number, file type, and other information elements. 

In [4]:
from langchain.document_loaders import TextLoader

loader = TextLoader("../Data/ZeitOnlineLandkartenA.txt",encoding="utf-8")
text=loader.load()

In [5]:
text

[Document(metadata={'source': '../Data/ZeitOnlineLandkartenA.txt'}, page_content='Landkarten mit Mehrwert\nOb als Reiseführer, Nachrichtenkanal oder Bürgerinitiative: Digitale Landkarten lassen sich vielseitig nutzen. \n\nZEIT ONLINE stellt einige der interessantesten Dienste vor.\n\nDie Zeit, in der Landkarten im Netz bloß der Routenplanung dienten, ist längst vorbei. Denn mit den digitalen Karten von Google Maps und der Open-Source-Alternative OpenStreetMap kann man sich spannendere Dinge als den Weg von A nach B anzeigen lassen. Über offene Programmschnittstellen (API) lassen sich Daten von anderen Websites mit dem Kartenmaterial verknüpfen oder eigene Informationen eintragen. Das Ergebnis nennt sich Mashup – ein Mischmasch aus Karten und Daten sozusagen. Die Bewertungscommunity Qype nutzt diese Möglichkeit schon lange, um Adressen und Bewertungen miteinander zu verknüpfen und mithilfe von Google Maps darzustellen. Auch Immobilienbörsen, Branchenbücher und Fotodienste kommen kaum no

In [6]:
print("Returned list contains %d Document objects"%len(text))

Returned list contains 1 Document objects


In [7]:
print("Metadata of the single document-object: ",text[0].metadata)

Metadata of the single document-object:  {'source': '../Data/ZeitOnlineLandkartenA.txt'}


In [8]:
print("Page Content of the single document-object: \n",text[0].page_content)

Page Content of the single document-object: 
 Landkarten mit Mehrwert
Ob als Reiseführer, Nachrichtenkanal oder Bürgerinitiative: Digitale Landkarten lassen sich vielseitig nutzen. 

ZEIT ONLINE stellt einige der interessantesten Dienste vor.

Die Zeit, in der Landkarten im Netz bloß der Routenplanung dienten, ist längst vorbei. Denn mit den digitalen Karten von Google Maps und der Open-Source-Alternative OpenStreetMap kann man sich spannendere Dinge als den Weg von A nach B anzeigen lassen. Über offene Programmschnittstellen (API) lassen sich Daten von anderen Websites mit dem Kartenmaterial verknüpfen oder eigene Informationen eintragen. Das Ergebnis nennt sich Mashup – ein Mischmasch aus Karten und Daten sozusagen. Die Bewertungscommunity Qype nutzt diese Möglichkeit schon lange, um Adressen und Bewertungen miteinander zu verknüpfen und mithilfe von Google Maps darzustellen. Auch Immobilienbörsen, Branchenbücher und Fotodienste kommen kaum noch ohne eigene Kartenfunktion aus. Dank d

## Accessing CSV Files

In [9]:
from langchain.document_loaders.csv_loader import CSVLoader


#loader = CSVLoader(file_path='../Data/healthcare_dataset.csv') #comma-separated file
loader = CSVLoader(file_path='../Data/imdb/labeledTrainData.tsv',csv_args={'delimiter': '\t'}) #tab-separated file
csvdata = loader.load()

As can be seen below, the `CSVLoader` returns one document-object for each row in the csv-file:

In [10]:
print("Returned list contains %d Document objects"%len(csvdata))

Returned list contains 25000 Document objects


In [11]:
print(csvdata[0].metadata)

{'source': '../Data/imdb/labeledTrainData.tsv', 'row': 0}


In [12]:
print(csvdata[0].page_content)

id: 5814_8
sentiment: 1
review: With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual featu

## Accessing all Files in Directory

In [21]:
#!pip install unstructured
#!pip install python-magic-bin
#!pip install --upgrade nltk

In [3]:
#nltk.download()

In [4]:
from langchain.document_loaders import DirectoryLoader

In [5]:
loader = DirectoryLoader('../Data/textdir', glob="*.txt",show_progress=True)
textdocs=loader.load()

100%|█████████████████████████████████████████████| 2/2 [00:03<00:00,  1.67s/it]


In [7]:
len(textdocs)

2

In [10]:
print(textdocs[1].page_content[:500])

PROJECT GUTENBERG AND DUNCAN RESEARCH SHAREWARE

(c)1991

Project Gutenberg has made arrangements with Duncan Research for the distribution of Duncan Research Electronic Library text. No money is solicited by Project Gutenberg. All donations go to:

Barbara  Duncan Duncan Research P.O. Box  2782 Champaign,   IL 61825 - 2782

Please, if you send in a request for information, donate enough, or more than enough to cover the cost of writing, printing, etc. as well as the cost of postage.

This is Sh


## Accessing Html

Below we demonstrate the use of 2 different `Loaders` provided by *langchain*, which can be applied for crawling Websites:
* The `WebBaseURLLoader` class returns a single document for each provided URL. The page content of the returned document-object contains the entire content of the webpage and requires a parser like `Beautifulsoup` to extract dedicated elements of the .html
* The `UnstructuredURLLoader` can be configured such that it returns the different elements of the .html, each element as a single document-object. In this way e.g. only the elements which contain only *narrative text* can be filtered as demonstrated below. 

In [11]:
from langchain.document_loaders import UnstructuredURLLoader, WebBaseLoader
from langchain.docstore.document import Document
#from unstructured.cleaners.core import remove_punctuation,clean,clean_extra_whitespace,replace_unicode_quotes

In [12]:
url="https://www.deeplearning.ai/the-batch/galore-a-memory-saving-method-for-pretraining-and-fine-tuning-llms/"

### WebBaseURLLoader

In [13]:
loader=WebBaseLoader(url)
page=loader.load()
print("Length of returned list: ",len(page))


Length of returned list:  1


In [14]:

for d in page[0].metadata.keys():
    print(d,"\t",page[0].metadata[d])
#print(page[0].metadata)

source 	 https://www.deeplearning.ai/the-batch/galore-a-memory-saving-method-for-pretraining-and-fine-tuning-llms/
title 	 GaLore, A Memory-Saving Method for Pretraining and Fine-Tuning LLMs
description 	 Low-rank adaptation (LoRA) reduces memory requirements when fine-tuning large language models, but it isn’t as conducive to pretraining. Researchers devised a method...
language 	 en


In [15]:
page[0].page_content

"GaLore, A Memory-Saving Method for Pretraining and Fine-Tuning LLMs✨ New course! Enroll in  Retrieval Optimization: From Tokenization to Vector QuantizationExplore CoursesAI NewsletterThe BatchAndrew's LetterData PointsML ResearchBlogCommunityForumEventsAmbassadorsAmbassador SpotlightResourcesCompanyAboutCareersContactStart LearningWeekly IssuesAndrew's LettersData PointsML ResearchBusinessScienceAI & SocietyCultureHardwareAI CareersAboutSubscribeThe BatchMachine Learning ResearchArticleLike LoRA, But for Pretraining GaLore, a memory-saving method for pretraining and fine-tuning LLMsMachine Learning ResearchGenerative AILarge Language Models (LLMs)PublishedJul 10, 2024Reading time3 min readShareLow-rank adaptation (LoRA) reduces memory requirements when fine-tuning large language models, but it isn’t as conducive to pretraining. Researchers devised a method that achieves similar memory savings but works well for both fine-tuning and pretraining.What’s new:\xa0Jiawei Zhao and colleague

### UnstructuredURLLoader

In [16]:
loader=UnstructuredURLLoader([url],mode="elements")
page=loader.load()
print("Length of returned list: ",len(page))


Length of returned list:  25


In [17]:
for d in page[0].metadata.keys():
    print(d,"\t",page[0].metadata[d])

category_depth 	 0
languages 	 ['eng']
filetype 	 text/html
url 	 https://www.deeplearning.ai/the-batch/galore-a-memory-saving-method-for-pretraining-and-fine-tuning-llms/
category 	 Title


In [18]:
for p in page:
    print("Category: \t",p.metadata["category"])

Category: 	 Title
Category: 	 Title
Category: 	 Title
Category: 	 Title
Category: 	 Title
Category: 	 UncategorizedText
Category: 	 Title
Category: 	 NarrativeText
Category: 	 Title
Category: 	 NarrativeText
Category: 	 NarrativeText
Category: 	 NarrativeText
Category: 	 NarrativeText
Category: 	 ListItem
Category: 	 ListItem
Category: 	 ListItem
Category: 	 ListItem
Category: 	 NarrativeText
Category: 	 ListItem
Category: 	 ListItem
Category: 	 NarrativeText
Category: 	 NarrativeText
Category: 	 Title
Category: 	 Title
Category: 	 NarrativeText


In [19]:
def generate_document(url):
 "Return a langchain Document for the given url"
 loader = UnstructuredURLLoader(urls=[url],
                                mode="elements",
                                )
 elements = loader.load()
 selected_elements = [e for e in elements if e.metadata['category']=="NarrativeText"]
 full_clean = " ".join([e.page_content for e in selected_elements])
 return Document(page_content=full_clean, metadata={"source":url})

In [20]:
doc=generate_document(url=url)

In [21]:
print(doc.page_content)

3 min read Low-rank adaptation (LoRA) reduces memory requirements when fine-tuning large language models, but it isn’t as conducive to pretraining. Researchers devised a method that achieves similar memory savings but works well for both fine-tuning and pretraining. What’s new: Jiawei Zhao and colleagues at California Institute of Technology, Meta, University of Texas at Austin, and Carnegie Mellon proposed Gradient Low-Rank Projection (GaLore), an optimizer modification that saves memory during training by reducing the sizes of optimizer states. They used this approach to pretrain a 7B parameter transformer using a consumer-grade Nvidia RTX 4090 GPU. Key insight: LoRA saves memory during training by learning to approximate a change in the weight matrix of each layer in a neural network using the product of two smaller matrices. This approximation results in good performance when fine-tuning (though not quite as good as fine-tuning all weights) but worse performance when pretraining fr

## Accessing PDF

In [22]:
#!pip install pypdf

In [23]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("../Data/2015-lecun.pdf")
pages = loader.load_and_split()

In [24]:
pages[0]

Document(metadata={'source': '../Data/2015-lecun.pdf', 'page': 0}, page_content='See discussions, st ats, and author pr ofiles f or this public ation at : https://www .researchgate.ne t/public ation/277411157\nDeep Learning\nArticle \xa0\xa0 in\xa0\xa0Nature · May 2015\nDOI: 10.1038/nat ure14539 \xa0·\xa0Sour ce: PubMed\nCITATIONS\n57,367READS\n199,643\n3 author s, including:\nY. Bengio\nUniv ersité de Montr éal\n894 PUBLICA TIONS \xa0\xa0\xa0475,801  CITATIONS \xa0\xa0\xa0\nSEE PROFILE\nAll c ontent f ollo wing this p age was uplo aded b y Y. Bengio  on 28 A ugust 2015.\nThe user has r equest ed enhanc ement of the do wnlo aded file.')

In [25]:
pages[1].metadata

{'source': '../Data/2015-lecun.pdf', 'page': 1}

In [26]:
print(pages[1].page_content[:1000])

1Facebook AI Research, 770 Broadway, New York, New York 10003 USA. 2New York University, 715 Broadway, New York, New York 10003, USA. 3Department of Computer Science and Operations 
Research Université de Montréal, Pavillon André-Aisenstadt, PO Box 6128  Centre-Ville STN Montréal, Quebec H3C 3J7, Canada. 4Google, 1600 Amphitheatre Parkway, Mountain View, California 
94043, USA. 5Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 3G4, Canada.Machine-learning technology powers many aspects of modern 
society: from web searches to content filtering on social net -
works to recommendations on e-commerce websites, and 
it is increasingly present in consumer products such as cameras and 
smartphones. Machine-learning systems are used to identify objects 
in images, transcribe speech into text, match news items, posts or 
products with users’ interests, and select relevant results of search. 
Increasingly, these applications make use of a class 