URL-based data scraping to create a knowledge base for AI Assistant

Dan4 · April 18, 2025, 7:52am

A workaround for this until the feature is built out. Use a website downloader to download the website pages as HTML files. Convert HTML files to TXT files. Upload TXT files to the AI app and use the Instructions section to tell the bot to use those files to answer questions.

Max · April 18, 2025, 12:27pm

Hey there, @Dan4 and welcome to the Community

Great suggestion, thanks for sharing

Salvatore_Salvia · April 18, 2025, 3:52pm

If you have a lot of pages in your website, consider using a free AI code generator, like MS Visual Code Studio with MS Co-Pilot to write web scraping code from a sitemap. In my case, I used my Woocommerce product sitemap (105 pages). The script outputted in JSON format, which I converted to a text file, then added the txt file to my Elfsight AI-bot files. You can query the chatbot here.

Here’s the python code with references…

To scrape an entire website using its sitemap and format the content for an LLM, you can modify your script to first fetch and parse the sitemap, extract all the URLs, and then scrape each URL. Here’s how you can do it:

Updated Script to Use a Sitemap

import requests
from bs4 import BeautifulSoup
import json
import xml.etree.ElementTree as ET

def fetch_sitemap_urls(sitemap_url):
try:
# Fetch the sitemap
response = requests.get(sitemap_url)
response.raise_for_status()

# Parse the sitemap XML
root = ET.fromstring(response.content)
urls = [url.text for url in root.iter('{[http://www.sitemaps.org/schemas/sitemap/0.9}loc'](http://www.sitemaps.org/schemas/sitemap/0.9}loc'))]

print(f"Found {len(urls)} URLs in the sitemap.")
return urls
except requests.exceptions.RequestException as e:
print(f"Error fetching the sitemap: {e}")
return []

def scrape_website(url):
try:
# Send a GET request to the website
response = requests.get(url)
response.raise_for_status() # Raise an error for bad status codes

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Extract the main content (modify the tag and class as needed)
content = soup.find_all(['p', 'h1', 'h2', 'h3']) # Extract paragraphs and headers

# Clean and structure the content
formatted_content = []
for element in content:
text = element.get_text(strip=True)
if text: # Skip empty text
formatted_content.append({
"type": [element.name](http://element.name),
"content": text
})

return formatted_content
except requests.exceptions.RequestException as e:
print(f"Error fetching the URL {url}: {e}")
return []

def scrape_from_sitemap(sitemap_url):
# Fetch all URLs from the sitemap
urls = fetch_sitemap_urls(sitemap_url)

# Scrape each URL and collect the content
all_content = []
for url in urls:
print(f"Scraping URL: {url}")
content = scrape_website(url)
if content:
all_content.append({
"source_url": url,
"content": content
})

# Save all content to a JSON file
output_file = "sitemap_scraped_content.json"
with open(output_file, "w", encoding="utf-8") as f:
json.dump(all_content, f, indent=4, ensure_ascii=False)

print(f"All content successfully scraped and saved to {output_file}")

# Example usage
if __name__ == "__main__":
sitemap_url = input("Enter the sitemap URL: ")
scrape_from_sitemap(sitemap_url)

How It Works:

Fetch Sitemap:

The fetch_sitemap_urls function fetches the sitemap XML and extracts all URLs using the xml.etree.ElementTree module.

Scrape Each URL:

The scrape_website function scrapes the content of each URL, extracting paragraphs (p) and headers (h1, h2, h3).

Aggregate Content:

The scrape_from_sitemap function iterates through all URLs, scrapes their content, and aggregates it into a list.

Save to JSON:

The aggregated content is saved to a JSON file (sitemap_scraped_content.json).

Example Output:

The JSON file will look like this:

[
{
"source_url": "[https://example.com/page1](https://example.com/page1)",
"content": [
{
"type": "h1",
"content": "Page Title"
},
{
"type": "p",
"content": "This is a paragraph of text."
}
]
},
{
"source_url": "[https://example.com/page2](https://example.com/page2)",
"content": [
{
"type": "h2",
"content": "Section Title"
},
{
"type": "p",
"content": "Another paragraph of text."
}
]
}
]

Notes:

Sitemap URL: Ensure the sitemap URL is accessible (e.g., [https://example.com/sitemap.xml](https://example.com/sitemap.xml`)).
Tags to Extract: Modify the soup.find_all() method to include additional tags if needed.
Respect Robots.txt: Ensure the website allows scraping by checking its robots.txt file.
Error Handling: The script includes basic error handling for failed requests.

user23678 · April 25, 2025, 4:08pm

This is exactly what I was doing for our chat bot from elf site… I am still in the testing phase of this. We directed the bot to our website and then asked it for information on a specific artist and it could not answer. So we have taken our list of artists available at www.metrogallerylincoln.com and are about to give it to the bot. Our list comes from our database and not the website and has to be edited from 164k characters down to 28k characters because the list also includes the title of each piece. When this gets completed it would be great for customers to be able to get answers about any artist or piece of art in the gallery just by asking the bot and since our inventory updates constantly it would be much easier for us to have the software update the bot.

Salvatore_Salvia · April 25, 2025, 4:57pm

I see you have a webpage for each artist. That means you have or should have a sitemap for those pages. You can run the attached python script and enter your artist pages sitemap URL. The script will grab each page’s h1, h2, h3, paragraph text along with the page URLs and output it to a JSON file. Simply open the file in notepad and save it as a text file. Then upload the text file to your Elfsight chatbot.

Attached is the import_requests_sitemap.py file. I had MS Co-pilot write the code in the Visual Studio Code app, which is free. See notes below the script.

To scrape an entire website using its sitemap and format the content for an LLM, MS Co-pilot created a script to first fetch and parse the sitemap, extract all the URLs, and then scrape each URL.

Updated Script to Use a Sitemap

import requests
from bs4 import BeautifulSoup
import json
import xml.etree.ElementTree as ET

def fetch_sitemap_urls(sitemap_url):
try:
# Fetch the sitemap
response = requests.get(sitemap_url)
response.raise_for_status()

# Parse the sitemap XML
root = ET.fromstring(response.content)
urls = [url.text for url in root.iter('{[http://www.sitemaps.org/schemas/sitemap/0.9}loc'](http://www.sitemaps.org/schemas/sitemap/0.9}loc'))]

print(f"Found {len(urls)} URLs in the sitemap.")
return urls
except requests.exceptions.RequestException as e:
print(f"Error fetching the sitemap: {e}")
return []

def scrape_website(url):
try:
# Send a GET request to the website
response = requests.get(url)
response.raise_for_status() # Raise an error for bad status codes

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Extract the main content (modify the tag and class as needed)
content = soup.find_all(['p', 'h1', 'h2', 'h3']) # Extract paragraphs and headers

# Clean and structure the content
formatted_content = []
for element in content:
text = element.get_text(strip=True)
if text: # Skip empty text
formatted_content.append({
"type": [element.name](http://element.name),
"content": text
})

return formatted_content
except requests.exceptions.RequestException as e:
print(f"Error fetching the URL {url}: {e}")
return []

def scrape_from_sitemap(sitemap_url):
# Fetch all URLs from the sitemap
urls = fetch_sitemap_urls(sitemap_url)

# Scrape each URL and collect the content
all_content = []
for url in urls:
print(f"Scraping URL: {url}")
content = scrape_website(url)
if content:
all_content.append({
"source_url": url,
"content": content
})

# Save all content to a JSON file
output_file = "sitemap_scraped_content.json"
with open(output_file, "w", encoding="utf-8") as f:
json.dump(all_content, f, indent=4, ensure_ascii=False)

print(f"All content successfully scraped and saved to {output_file}")

# Example usage
if __name__ == "__main__":
sitemap_url = input("Enter the sitemap URL: ")
scrape_from_sitemap(sitemap_url)

How It Works:

Fetch Sitemap:

The fetch_sitemap_urls function fetches the sitemap XML and extracts all URLs using the xml.etree.ElementTree module.

Scrape Each URL:

The scrape_website function scrapes the content of each URL, extracting paragraphs (p) and headers (h1, h2, h3).

Aggregate Content:

The scrape_from_sitemap function iterates through all URLs, scrapes their content, and aggregates it into a list.

Save to JSON:

The aggregated content is saved to a JSON file (sitemap_scraped_content.json).

Example Output:

The JSON file will look like this:

[
{
"source_url": "[https://example.com/page1](https://example.com/page1)",
"content": [
{
"type": "h1",
"content": "Page Title"
},
{
"type": "p",
"content": "This is a paragraph of text."
}
]
},
{
"source_url": "[https://example.com/page2](https://example.com/page2)",
"content": [
{
"type": "h2",
"content": "Section Title"
},
{
"type": "p",
"content": "Another paragraph of text."
}
]
}
]

Notes:

Sitemap URL: Ensure the sitemap URL is accessible (e.g., [https://example.com/sitemap.xml](https://example.com/sitemap.xml`)).
Tags to Extract: Modify the soup.find_all() method to include additional tags if needed.
Respect Robots.txt: Ensure the website allows scraping by checking its robots.txt file.
Error Handling: The script includes basic error handling for failed requests.

To scrape an entire website using its sitemap and format the content for an LLM, you can modify your script to first fetch and parse the sitemap, extract all the URLs, and then scrape each URL. Here’s how you can do it:

In Visual Studio Code, tell Co-pilot to install python then run the script. The first time the script will make import requests for the necessary module.

Updated Script to Use a Sitemap

import requests
from bs4 import BeautifulSoup
import json
import xml.etree.ElementTree as ET

def fetch_sitemap_urls(sitemap_url):
try:
# Fetch the sitemap
response = requests.get(sitemap_url)
response.raise_for_status()

# Parse the sitemap XML
root = ET.fromstring(response.content)
urls = [url.text for url in root.iter('{[http://www.sitemaps.org/schemas/sitemap/0.9}loc'](http://www.sitemaps.org/schemas/sitemap/0.9}loc'))]

print(f"Found {len(urls)} URLs in the sitemap.")
return urls
except requests.exceptions.RequestException as e:
print(f"Error fetching the sitemap: {e}")
return []

def scrape_website(url):
try:
# Send a GET request to the website
response = requests.get(url)
response.raise_for_status() # Raise an error for bad status codes

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Extract the main content (modify the tag and class as needed)
content = soup.find_all(['p', 'h1', 'h2', 'h3']) # Extract paragraphs and headers

# Clean and structure the content
formatted_content = []
for element in content:
text = element.get_text(strip=True)
if text: # Skip empty text
formatted_content.append({
"type": [element.name](http://element.name),
"content": text
})

return formatted_content
except requests.exceptions.RequestException as e:
print(f"Error fetching the URL {url}: {e}")
return []

def scrape_from_sitemap(sitemap_url):
# Fetch all URLs from the sitemap
urls = fetch_sitemap_urls(sitemap_url)

# Scrape each URL and collect the content
all_content = []
for url in urls:
print(f"Scraping URL: {url}")
content = scrape_website(url)
if content:
all_content.append({
"source_url": url,
"content": content
})

# Save all content to a JSON file
output_file = "sitemap_scraped_content.json"
with open(output_file, "w", encoding="utf-8") as f:
json.dump(all_content, f, indent=4, ensure_ascii=False)

print(f"All content successfully scraped and saved to {output_file}")

# Example usage
if __name__ == "__main__":
sitemap_url = input("Enter the sitemap URL: ")
scrape_from_sitemap(sitemap_url)

How It Works:

Fetch Sitemap:

The fetch_sitemap_urls function fetches the sitemap XML and extracts all URLs using the xml.etree.ElementTree module.

Scrape Each URL:

The scrape_website function scrapes the content of each URL, extracting paragraphs (p) and headers (h1, h2, h3).

Aggregate Content:

The scrape_from_sitemap function iterates through all URLs, scrapes their content, and aggregates it into a list.

Save to JSON:

The aggregated content is saved to a JSON file (sitemap_scraped_content.json).

Example Output:

The JSON file will look like this:

[
{
"source_url": "[https://example.com/page1](https://example.com/page1)",
"content": [
{
"type": "h1",
"content": "Page Title"
},
{
"type": "p",
"content": "This is a paragraph of text."
}
]
},
{
"source_url": "[https://example.com/page2](https://example.com/page2)",
"content": [
{
"type": "h2",
"content": "Section Title"
},
{
"type": "p",
"content": "Another paragraph of text."
}
]
}
]

Notes:

Sitemap URL: Ensure the sitemap URL is accessible (e.g., [https://example.com/sitemap.xml](https://example.com/sitemap.xml`)).
Tags to Extract: Modify the soup.find_all() method to include additional tags if needed.
Respect Robots.txt: Ensure the website allows scraping by checking its robots.txt file.
Error Handling: The script includes basic error handling for failed requests.

(Attachment import_requests_sitemap.py is missing)

user23678 · April 25, 2025, 8:31pm

I wish we had a web page for each artist. We don’t, we have them for a few. What we do have is a list we took out of our database for each product (piece of art) available in our online store. We then deleted each title from the list and were left with a list of names. Less then 1 percent of those have there own web page. This whole thing is the reason we found this site in the first place. We were looking for a forum where artists could upload their own information and images. It would rapidly expand the art and artist available at Metro. We are also using square and it is a WYSIWYG setup so we are not able to do much.

Salvatore_Salvia · April 25, 2025, 8:53pm

Your list of artists needs to include a URL for the page with their art. I see a bunch under Nebraska Artists. You then tell the bot to always include a URL in the reply. For those artists without a page you feed the bot the list of names with whatever bio info you have. Here is a page that shows you how to access your sitemap.
For your site it would be https://www.metrogallerylincoln.com/sitemap.xml

Topic		Replies	Views
Training AI Chat Bot with sitemap AI Chatbot Gathering-Feedback	3	186	January 27, 2025
Live & Real Time Knowledge Base AI Chatbot Gathering-Feedback	7	27	April 9, 2025
Ability to edit base resonses General Questions ai-chatbot	3	20	January 16, 2025
Generate Statistics or Reports AI Chatbot Gathering-Feedback	3	14	March 3, 2025
Separate Web Link for AI Virtual Assistant AI Chatbot	1	18	March 4, 2025

URL-based data scraping to create a knowledge base for AI Assistant

Updated Script to Use a Sitemap

How It Works:

Example Output:

Notes:

Updated Script to Use a Sitemap

How It Works:

Example Output:

Notes:

Updated Script to Use a Sitemap

How It Works:

Example Output:

Notes:

Related topics