> ## Documentation Index > Fetch the complete documentation index at: https://docs.hellotars.com/llms.txt > Use this file to discover all available pages before exploring further. # Learn how to create Knowledge Base by scraping webpages > Scrape and index content from websites for your knowledge base Transform your website content into intelligent knowledge bases that power your AI Agents. This method automatically discovers, scrapes, and processes web pages to create comprehensive knowledge repositories. ### What you get * **Automatic content discovery** from your website structure * **Text extraction** from web pages and structured data * **Link following** to discover related content * **Content organization** and indexing for optimal search * **Dynamic content support** for JavaScript-heavy sites ### Perfect for * **Company websites** with product information and documentation * **Help centers** and support documentation * **Blog content** and articles * **Documentation sites** and wikis * **E-commerce sites** with product catalogs ### Step-by-step creation process Navigate to the Create Knowledge Base section and click on the `Website URL` card to begin creating a website-based Knowledge Base. Create Knowledge Base page showing Website URL card selection

Create Knowledge Base page showing Website URL card selection

This opens the Knowledge Base configuration modal where you can set up your website scraping parameters. Set up the fundamental information for your Knowledge Base in the basic configuration section. Basic configuration showing Knowledge Base name and URL fields

Basic configuration showing Knowledge Base name and URL fields

**Required fields:** * **Knowledge Base Name**: Enter a descriptive name for your Knowledge Base * **Website Base URL**: Provide the base URL of the website you want to scrape The system will use this URL as the starting point for content discovery and crawling. Customize the scraping behavior and parameters for optimal content extraction. Advanced settings showing scraping configuration options

Advanced settings showing scraping configuration options

**Advanced configuration options:** * **Number of URLs to scrape**: Specify the maximum number of pages to crawl. The default is 200, with a maximum limit of 10,000. * **Crawl depth**: Set the number of levels deep to crawl within the website hierarchy. The default is 20, with a maximum of 100. * **Custom user agent**: Provide an optional custom identification string for web requests. * **Base URL crawling**: Limit crawling to a single level by using only base URLs, ignoring the depth setting. * **Browser rendering**: Enable this option to render content using a browser for sites with dynamic JavaScript. * **Scraper engine**: Choose between the legacy Scrapy engine and the modern Crawl4AI engine. **Optimal settings**: For most websites, default settings work well. Use browser rendering for JavaScript-heavy sites and adjust depth based on your site structure. Click the `Add Knowledge Base` button to create your Knowledge Base with the configured settings. Add Knowledge Base button to proceed with creation

Add Knowledge Base button to proceed with creation

Your Knowledge Base is created and you're redirected to the configuration page where you can manage data resources. View your newly created Knowledge Base configuration page with the added data resource.

Knowledge base configuration page showing added data source

**Configuration overview:** * **Data resources**: Displays the base URL of the website you have added. * **Fetch links**: Initiates the scraping process. * **Indexing type**: Choose from Sentence level Indexing or Section level Indexing. The default is Sentence level Indexing. * **Training status**: Shows the current processing state of your Knowledge Base. * **Load method**: Choose how the system retrieves web content. Select between Direct request, which fetches data directly from the server, or Browser environment, which simulates a browser to render and extract content. The default is Direct request if `Use browser` is disabled. * **Menu Actions**: Provides options to delete the added data resource. This page serves as your central hub for managing all aspects of your Knowledge Base. Click the `Fetch Links` button to begin the web scraping process. Active fetching process showing real-time scraping progress and logs

Active fetching process showing real-time scraping progress and logs

**Fetching process:** * **Content discovery**: The system identifies and navigates through web pages. * **Text extraction**: Content is extracted from each identified page. * **Data processing**: Extracted content is cleaned and organized. * **Progress monitoring**: Real-time logs display scraping activity and status updates. **Scraping Time**: The fetching process may take several minutes depending on website size and complexity. Monitor the progress logs for detailed status updates. After fetching completes, review the extracted content and select pages for training. Post-fetch state showing completed scraping with content ready for training

Post-fetch state showing completed scraping with content ready for training

**Review and selection:** * **Extracted pages**: Access a list of all discovered and scraped web pages, each with clickable links. * **Word count**: Review the number of words extracted from each page. * **Page selection**: Select specific pages to include in the training process. * **Pre-filtering**: Filter pages based on content relevance and word count to optimize training preparation. All content has been successfully extracted, and you can now select the desired pages for the training phase. Initiate the training process and monitor the real-time progress as your content is processed and indexed. Pre-training state showing content ready for Knowledge Base training

Pre-training state showing content ready for Knowledge Base training

**Training phase:** * **Initiate training**: Begin by clicking the `Train` button to start the AI processing. * **Data reading**: The system reads and imports the stored data resources for processing. * **URL filtering**: Unnecessary URLs are filtered out to ensure only relevant data is processed. * **Text chunking**: The text is divided into smaller, manageable chunks for efficient processing. * **Embedding generation**: These text chunks are converted into vector embeddings using AI models. * **Vector storage**: The generated embeddings are stored in vector databases like Qdrant and Weaviate for efficient retrieval. * **Index optimization**: The system optimizes the index to enhance search and retrieval performance. Once the training is complete, your Knowledge Base will be fully functional and ready for integration with AI Agents. If you wish to see the content of the training after the post-train phase, click on the particular link that is fetched to view the extracted content. Preview of content ready for review

**Content viewing:** * **Preview content**: Click on the links to view detailed content extracted from each page. * **Content analysis**: Evaluate the quality and relevance of the extracted data for further refinement. Optionally add more data sources to expand your Knowledge Base content. Add new data source option for expanding Knowledge Base

Add new data source option for expanding Knowledge Base

**Additional data sources:** * **Add data resource**: Click to add more websites or content sources. * **Same configuration**: Use the same modal interface for additional sources. * **Multiple sources**: Combine content from multiple websites or URLs. * **Unified knowledge**: All sources contribute to a single comprehensive Knowledge Base. **Expanding content**: Adding multiple data sources creates more comprehensive knowledge bases with broader coverage of your domain. Process and train any newly added data sources to integrate them into your Knowledge Base. Training process for newly added data sources

Training process for newly added data sources

**Multi-source training:** * **Fetch new content**: Scrape additional data sources. * **Integrate content**: Combine with the existing Knowledge Base. * **Retrain system**: Update embeddings with new content. * **Unified knowledge**: All sources work together seamlessly. Completed training for all data sources

All data sources have been successfully trained and integrated into your Knowledge Base. Your website-based Knowledge Base is now complete and available in your knowledge bases library. Completed Knowledge Base showing in My Knowledge Bases with status and data sources

Completed Knowledge Base showing in My Knowledge Bases with status and data sources

**Knowledge base details:** * **Name and description**: Your configured Knowledge Base information * **Creation date**: When the Knowledge Base was created * **Training status**: Confirmed as trained and ready * **Data sources count**: Number of websites/content sources included * **Configure button**: Access to modify settings and add more sources **Ready to use**: Your Knowledge Base is immediately available for connecting to AI Agents and providing intelligent responses based on your website content.