A workaround for this until the feature is built out. Use a website downloader to download the website pages as HTML files. Convert HTML files to TXT files. Upload TXT files to the AI app and use the Instructions section to tell the bot to use those files to answer questions.
Hey there, @Dan4 and welcome to the Community
Great suggestion, thanks for sharing
If you have a lot of pages in your website, consider using a free AI code generator, like MS Visual Code Studio with MS Co-Pilot to write web scraping code from a sitemap. In my case, I used my Woocommerce product sitemap (105 pages). The script outputted in JSON format, which I converted to a text file, then added the txt file to my Elfsight AI-bot files. You can query the chatbot here.
Here’s the python code with references…
To scrape an entire website using its sitemap and format the content for an LLM, you can modify your script to first fetch and parse the sitemap, extract all the URLs, and then scrape each URL. Here’s how you can do it:
Updated Script to Use a Sitemap
import requests
from bs4 import BeautifulSoup
import json
import xml.etree.ElementTree as ET
def fetch_sitemap_urls(sitemap_url):
try:
# Fetch the sitemap
response = requests.get(sitemap_url)
response.raise_for_status()
# Parse the sitemap XML
root = ET.fromstring(response.content)
urls = [url.text for url in root.iter('{[http://www.sitemaps.org/schemas/sitemap/0.9}loc'](http://www.sitemaps.org/schemas/sitemap/0.9}loc'))]
print(f"Found {len(urls)} URLs in the sitemap.")
return urls
except requests.exceptions.RequestException as e:
print(f"Error fetching the sitemap: {e}")
return []
def scrape_website(url):
try:
# Send a GET request to the website
response = requests.get(url)
response.raise_for_status() # Raise an error for bad status codes
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Extract the main content (modify the tag and class as needed)
content = soup.find_all(['p', 'h1', 'h2', 'h3']) # Extract paragraphs and headers
# Clean and structure the content
formatted_content = []
for element in content:
text = element.get_text(strip=True)
if text: # Skip empty text
formatted_content.append({
"type": [element.name](http://element.name),
"content": text
})
return formatted_content
except requests.exceptions.RequestException as e:
print(f"Error fetching the URL {url}: {e}")
return []
def scrape_from_sitemap(sitemap_url):
# Fetch all URLs from the sitemap
urls = fetch_sitemap_urls(sitemap_url)
# Scrape each URL and collect the content
all_content = []
for url in urls:
print(f"Scraping URL: {url}")
content = scrape_website(url)
if content:
all_content.append({
"source_url": url,
"content": content
})
# Save all content to a JSON file
output_file = "sitemap_scraped_content.json"
with open(output_file, "w", encoding="utf-8") as f:
json.dump(all_content, f, indent=4, ensure_ascii=False)
print(f"All content successfully scraped and saved to {output_file}")
# Example usage
if __name__ == "__main__":
sitemap_url = input("Enter the sitemap URL: ")
scrape_from_sitemap(sitemap_url)
How It Works:
- Fetch Sitemap:
- The
fetch_sitemap_urls
function fetches the sitemap XML and extracts all URLs using thexml.etree.ElementTree
module.
- Scrape Each URL:
- The
scrape_website
function scrapes the content of each URL, extracting paragraphs (p
) and headers (h1
,h2
,h3
).
- Aggregate Content:
- The
scrape_from_sitemap
function iterates through all URLs, scrapes their content, and aggregates it into a list.
- Save to JSON:
- The aggregated content is saved to a JSON file (
sitemap_scraped_content.json
).
Example Output:
The JSON file will look like this:
[
{
"source_url": "[https://example.com/page1](https://example.com/page1)",
"content": [
{
"type": "h1",
"content": "Page Title"
},
{
"type": "p",
"content": "This is a paragraph of text."
}
]
},
{
"source_url": "[https://example.com/page2](https://example.com/page2)",
"content": [
{
"type": "h2",
"content": "Section Title"
},
{
"type": "p",
"content": "Another paragraph of text."
}
]
}
]
Notes:
- Sitemap URL: Ensure the sitemap URL is accessible (e.g.,
[https://example.com/sitemap.xml
](https://example.com/sitemap.xml`)). - Tags to Extract: Modify the
soup.find_all()
method to include additional tags if needed. - Respect Robots.txt: Ensure the website allows scraping by checking its
robots.txt
file. - Error Handling: The script includes basic error handling for failed requests.
This is exactly what I was doing for our chat bot from elf site… I am still in the testing phase of this. We directed the bot to our website and then asked it for information on a specific artist and it could not answer. So we have taken our list of artists available at www.metrogallerylincoln.com and are about to give it to the bot. Our list comes from our database and not the website and has to be edited from 164k characters down to 28k characters because the list also includes the title of each piece. When this gets completed it would be great for customers to be able to get answers about any artist or piece of art in the gallery just by asking the bot and since our inventory updates constantly it would be much easier for us to have the software update the bot.
I see you have a webpage for each artist. That means you have or should have a sitemap for those pages. You can run the attached python script and enter your artist pages sitemap URL. The script will grab each page’s h1, h2, h3, paragraph text along with the page URLs and output it to a JSON file. Simply open the file in notepad and save it as a text file. Then upload the text file to your Elfsight chatbot.
Attached is the import_requests_sitemap.py file. I had MS Co-pilot write the code in the Visual Studio Code app, which is free. See notes below the script.
To scrape an entire website using its sitemap and format the content for an LLM, MS Co-pilot created a script to first fetch and parse the sitemap, extract all the URLs, and then scrape each URL.
Updated Script to Use a Sitemap
import requests
from bs4 import BeautifulSoup
import json
import xml.etree.ElementTree as ET
def fetch_sitemap_urls(sitemap_url):
try:
# Fetch the sitemap
response = requests.get(sitemap_url)
response.raise_for_status()
# Parse the sitemap XML
root = ET.fromstring(response.content)
urls = [url.text for url in root.iter('{[http://www.sitemaps.org/schemas/sitemap/0.9}loc'](http://www.sitemaps.org/schemas/sitemap/0.9}loc'))]
print(f"Found {len(urls)} URLs in the sitemap.")
return urls
except requests.exceptions.RequestException as e:
print(f"Error fetching the sitemap: {e}")
return []
def scrape_website(url):
try:
# Send a GET request to the website
response = requests.get(url)
response.raise_for_status() # Raise an error for bad status codes
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Extract the main content (modify the tag and class as needed)
content = soup.find_all(['p', 'h1', 'h2', 'h3']) # Extract paragraphs and headers
# Clean and structure the content
formatted_content = []
for element in content:
text = element.get_text(strip=True)
if text: # Skip empty text
formatted_content.append({
"type": [element.name](http://element.name),
"content": text
})
return formatted_content
except requests.exceptions.RequestException as e:
print(f"Error fetching the URL {url}: {e}")
return []
def scrape_from_sitemap(sitemap_url):
# Fetch all URLs from the sitemap
urls = fetch_sitemap_urls(sitemap_url)
# Scrape each URL and collect the content
all_content = []
for url in urls:
print(f"Scraping URL: {url}")
content = scrape_website(url)
if content:
all_content.append({
"source_url": url,
"content": content
})
# Save all content to a JSON file
output_file = "sitemap_scraped_content.json"
with open(output_file, "w", encoding="utf-8") as f:
json.dump(all_content, f, indent=4, ensure_ascii=False)
print(f"All content successfully scraped and saved to {output_file}")
# Example usage
if __name__ == "__main__":
sitemap_url = input("Enter the sitemap URL: ")
scrape_from_sitemap(sitemap_url)
How It Works:
- Fetch Sitemap:
- The
fetch_sitemap_urls
function fetches the sitemap XML and extracts all URLs using thexml.etree.ElementTree
module.
- Scrape Each URL:
- The
scrape_website
function scrapes the content of each URL, extracting paragraphs (p
) and headers (h1
,h2
,h3
).
- Aggregate Content:
- The
scrape_from_sitemap
function iterates through all URLs, scrapes their content, and aggregates it into a list.
- Save to JSON:
- The aggregated content is saved to a JSON file (
sitemap_scraped_content.json
).
Example Output:
The JSON file will look like this:
[
{
"source_url": "[https://example.com/page1](https://example.com/page1)",
"content": [
{
"type": "h1",
"content": "Page Title"
},
{
"type": "p",
"content": "This is a paragraph of text."
}
]
},
{
"source_url": "[https://example.com/page2](https://example.com/page2)",
"content": [
{
"type": "h2",
"content": "Section Title"
},
{
"type": "p",
"content": "Another paragraph of text."
}
]
}
]
Notes:
- Sitemap URL: Ensure the sitemap URL is accessible (e.g.,
[https://example.com/sitemap.xml
](https://example.com/sitemap.xml`)). - Tags to Extract: Modify the
soup.find_all()
method to include additional tags if needed. - Respect Robots.txt: Ensure the website allows scraping by checking its
robots.txt
file. - Error Handling: The script includes basic error handling for failed requests.
To scrape an entire website using its sitemap and format the content for an LLM, you can modify your script to first fetch and parse the sitemap, extract all the URLs, and then scrape each URL. Here’s how you can do it:
In Visual Studio Code, tell Co-pilot to install python then run the script. The first time the script will make import requests for the necessary module.
Updated Script to Use a Sitemap
import requests
from bs4 import BeautifulSoup
import json
import xml.etree.ElementTree as ET
def fetch_sitemap_urls(sitemap_url):
try:
# Fetch the sitemap
response = requests.get(sitemap_url)
response.raise_for_status()
# Parse the sitemap XML
root = ET.fromstring(response.content)
urls = [url.text for url in root.iter('{[http://www.sitemaps.org/schemas/sitemap/0.9}loc'](http://www.sitemaps.org/schemas/sitemap/0.9}loc'))]
print(f"Found {len(urls)} URLs in the sitemap.")
return urls
except requests.exceptions.RequestException as e:
print(f"Error fetching the sitemap: {e}")
return []
def scrape_website(url):
try:
# Send a GET request to the website
response = requests.get(url)
response.raise_for_status() # Raise an error for bad status codes
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Extract the main content (modify the tag and class as needed)
content = soup.find_all(['p', 'h1', 'h2', 'h3']) # Extract paragraphs and headers
# Clean and structure the content
formatted_content = []
for element in content:
text = element.get_text(strip=True)
if text: # Skip empty text
formatted_content.append({
"type": [element.name](http://element.name),
"content": text
})
return formatted_content
except requests.exceptions.RequestException as e:
print(f"Error fetching the URL {url}: {e}")
return []
def scrape_from_sitemap(sitemap_url):
# Fetch all URLs from the sitemap
urls = fetch_sitemap_urls(sitemap_url)
# Scrape each URL and collect the content
all_content = []
for url in urls:
print(f"Scraping URL: {url}")
content = scrape_website(url)
if content:
all_content.append({
"source_url": url,
"content": content
})
# Save all content to a JSON file
output_file = "sitemap_scraped_content.json"
with open(output_file, "w", encoding="utf-8") as f:
json.dump(all_content, f, indent=4, ensure_ascii=False)
print(f"All content successfully scraped and saved to {output_file}")
# Example usage
if __name__ == "__main__":
sitemap_url = input("Enter the sitemap URL: ")
scrape_from_sitemap(sitemap_url)
How It Works:
- Fetch Sitemap:
- The
fetch_sitemap_urls
function fetches the sitemap XML and extracts all URLs using thexml.etree.ElementTree
module.
- Scrape Each URL:
- The
scrape_website
function scrapes the content of each URL, extracting paragraphs (p
) and headers (h1
,h2
,h3
).
- Aggregate Content:
- The
scrape_from_sitemap
function iterates through all URLs, scrapes their content, and aggregates it into a list.
- Save to JSON:
- The aggregated content is saved to a JSON file (
sitemap_scraped_content.json
).
Example Output:
The JSON file will look like this:
[
{
"source_url": "[https://example.com/page1](https://example.com/page1)",
"content": [
{
"type": "h1",
"content": "Page Title"
},
{
"type": "p",
"content": "This is a paragraph of text."
}
]
},
{
"source_url": "[https://example.com/page2](https://example.com/page2)",
"content": [
{
"type": "h2",
"content": "Section Title"
},
{
"type": "p",
"content": "Another paragraph of text."
}
]
}
]
Notes:
- Sitemap URL: Ensure the sitemap URL is accessible (e.g.,
[https://example.com/sitemap.xml
](https://example.com/sitemap.xml`)). - Tags to Extract: Modify the
soup.find_all()
method to include additional tags if needed. - Respect Robots.txt: Ensure the website allows scraping by checking its
robots.txt
file. - Error Handling: The script includes basic error handling for failed requests.
(Attachment import_requests_sitemap.py is missing)
I wish we had a web page for each artist. We don’t, we have them for a few. What we do have is a list we took out of our database for each product (piece of art) available in our online store. We then deleted each title from the list and were left with a list of names. Less then 1 percent of those have there own web page. This whole thing is the reason we found this site in the first place. We were looking for a forum where artists could upload their own information and images. It would rapidly expand the art and artist available at Metro. We are also using square and it is a WYSIWYG setup so we are not able to do much.
Your list of artists needs to include a URL for the page with their art. I see a bunch under Nebraska Artists. You then tell the bot to always include a URL in the reply. For those artists without a page you feed the bot the list of names with whatever bio info you have. Here is a page that shows you how to access your sitemap.
For your site it would be https://www.metrogallerylincoln.com/sitemap.xml
I urgently need the same functionality. Either the bot reads the data directly from the website and all its subpages, or it searches a created XML file with the URLs for its results.
I also have the same request
Hey there and welcome to the Community
Many thanks for pointing us in this direction! We see the interest from the Community in this feature and understand its importance for the users. Hopefully the devs will be to consider this request in the future.
We’ll keep you posted here in case of any changes
You need a sitemap that lists the links to all the pages you want to include for the bot. Browse to YOURSITE.COM/SITEMAP to see if it already exists. If not, your web hosting company support should assist you in finding or creating one.
I’ve used Visual Studio Code with Git Hup Co-pilot to write python code to scrape the sitemap and create a Json file of the contents. You can then upload that file to the Elfsight bot or convert it to a txt file for the bot. In Instructions, you tell the bot to always include the url link when referencing metadata from pages.
Friends, I’m happy to say that we’ve started working on this feature!
It’s currently in the Design stage, and we’ll keep you updated on the progress here
A post was split to a new topic: Chatbot uses wrong links
I tried the AI chatbot and thought this is already a base feature. Was surprised it is not! If it is there I go for it.
Greetings, @Tom_Fuchs and welcome to the Community
Many thanks for the feedback!
I completely understand your point. Let’s hope the devs will release it as soon as possible and we’ll keep you posted here on any progress
Attached is a python file I developed with GITHUB AI to scrape a website sitemap (best use: website content pages sitemap). Load the file into Visual Code or another python editor. Running the code will prompt you for the website URL address and generate both a JSON file and a text file of the sitemap data that includes all the metadata and the URL of each page. The script removes brackets, braces, and extra lines that are included in the JSON file. Upload either one to your Elfsight chat-bot to have it learn your site..
Regards,
Hola Salvatore, no veo el archivo adjunto, me lo puedes enviar, es una pasada lo que has creado y tengo ganas de probarlo
Mucha gracias
El El mié, 11 jun 2025 a las 21:48, Salvatore Salvia via Elfsight Community <notifications@elfsight.discoursemail.com> escribió:
Robofi.ai looks interesting. But my next step is to build my own chat bot and own it.
Sal “Tory” Salvia
Sailflix.com
301-641-3589 (m)
s/v Sparkle Plenty
Fort Pierce, Florida
Update: the feature was moved to to the Development stage!