> ## Documentation Index
> Fetch the complete documentation index at: https://docs.hellotars.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Learn how to create Knowledge Base by scraping webpages

> Scrape and index content from websites for your knowledge base

Transform your website content into intelligent knowledge bases that power your AI Agents. This method automatically discovers, scrapes, and processes web pages to create comprehensive knowledge repositories.

### What you get

* **Automatic content discovery** from your website structure
* **Text extraction** from web pages and structured data
* **Link following** to discover related content
* **Content organization** and indexing for optimal search
* **Dynamic content support** for JavaScript-heavy sites

### Perfect for

* **Company websites** with product information and documentation
* **Help centers** and support documentation
* **Blog content** and articles
* **Documentation sites** and wikis
* **E-commerce sites** with product catalogs

### Step-by-step creation process

<Steps>
  <Step title="Select Website data source">
    Navigate to the Create Knowledge Base section and click on the `Website URL` card to begin creating a website-based Knowledge Base.

    <Frame>
      <img src="https://mintcdn.com/tars-c52ebe98/pzfXZlbe5u4r3EET/images/dashboard/agent/knowledge/website/step-1.png?fit=max&auto=format&n=pzfXZlbe5u4r3EET&q=85&s=b0f982b06137b285a1605e7518d42b0d" alt="Create Knowledge Base page showing Website URL card selection" width="1968" height="981" data-path="images/dashboard/agent/knowledge/website/step-1.png" />
    </Frame>

    <Check>
      This opens the Knowledge Base configuration modal where you can set up your website scraping parameters.
    </Check>
  </Step>

  <Step title="Configure basic settings">
    Set up the fundamental information for your Knowledge Base in the basic
    configuration section.

    <Frame>
      <img src="https://mintcdn.com/tars-c52ebe98/pzfXZlbe5u4r3EET/images/dashboard/agent/knowledge/website/step-2-basic.png?fit=max&auto=format&n=pzfXZlbe5u4r3EET&q=85&s=164488f3b13e330fbc716ee14fd8030f" alt="Basic configuration showing Knowledge Base name and URL fields" width="1930" height="981" data-path="images/dashboard/agent/knowledge/website/step-2-basic.png" />
    </Frame>

    **Required fields:**

    * **Knowledge Base Name**: Enter a descriptive name for your Knowledge Base
    * **Website Base URL**: Provide the base URL of the website you want to scrape

    <Info>
      The system will use this URL as the starting point for content discovery and
      crawling.
    </Info>
  </Step>

  <Step title="Configure advanced scraping settings">
    Customize the scraping behavior and parameters for optimal content extraction.

    <Frame>
      <img src="https://mintcdn.com/tars-c52ebe98/pzfXZlbe5u4r3EET/images/dashboard/agent/knowledge/website/step-2-advanced.png?fit=max&auto=format&n=pzfXZlbe5u4r3EET&q=85&s=32e8a3f020a0e7d00a368f38aad70c25" alt="Advanced settings showing scraping configuration options" width="1930" height="981" data-path="images/dashboard/agent/knowledge/website/step-2-advanced.png" />
    </Frame>

    **Advanced configuration options:**

    * **Number of URLs to scrape**: Specify the maximum number of pages to crawl. The default is 200, with a maximum limit of 10,000.
    * **Crawl depth**: Set the number of levels deep to crawl within the website hierarchy. The default is 20, with a maximum of 100.
    * **Custom user agent**: Provide an optional custom identification string for web requests.
    * **Base URL crawling**: Limit crawling to a single level by using only base URLs, ignoring the depth setting.
    * **Browser rendering**: Enable this option to render content using a browser for sites with dynamic JavaScript.
    * **Scraper engine**: Choose between the legacy Scrapy engine and the modern Crawl4AI engine.
          <Tip>
            **Optimal settings**: For most websites, default settings work well. Use
            browser rendering for JavaScript-heavy sites and adjust depth based on your
            site structure.
          </Tip>
  </Step>

  <Step title="Add Knowledge Base">
    Click the `Add Knowledge Base` button to create your Knowledge Base with the
    configured settings.

    <Frame>
      <img src="https://mintcdn.com/tars-c52ebe98/pzfXZlbe5u4r3EET/images/dashboard/agent/knowledge/website/step-3-add-knowledge.png?fit=max&auto=format&n=pzfXZlbe5u4r3EET&q=85&s=eaa11fd927707de977c3d109df335ef5" alt="Add Knowledge Base button to proceed with creation" width="1930" height="981" data-path="images/dashboard/agent/knowledge/website/step-3-add-knowledge.png" />
    </Frame>

    <Check>
      Your Knowledge Base is created and you're redirected to the configuration
      page where you can manage data resources.
    </Check>
  </Step>

  <Step title="Review Knowledge Base configuration">
    View your newly created Knowledge Base configuration page with the added data
    resource.

    <Frame>
      <img src="https://mintcdn.com/tars-c52ebe98/pzfXZlbe5u4r3EET/images/dashboard/agent/knowledge/website/step-4-overview.png?fit=max&auto=format&n=pzfXZlbe5u4r3EET&q=85&s=acda60247d3ae40c574d4af8963d16e0" alt="Knowledge base configuration page showing added data source" width="1965" height="981" data-path="images/dashboard/agent/knowledge/website/step-4-overview.png" />
    </Frame>

    **Configuration overview:**

    * **Data resources**: Displays the base URL of the website you have added.
    * **Fetch links**: Initiates the scraping process.
    * **Indexing type**: Choose from Sentence level Indexing or Section level Indexing. The default is Sentence level Indexing.
    * **Training status**: Shows the current processing state of your Knowledge Base.
    * **Load method**: Choose how the system retrieves web content. Select between Direct request, which fetches data directly from the server, or Browser environment, which simulates a browser to render and extract content. The default is Direct request if `Use browser` is disabled.
    * **Menu Actions**: Provides options to delete the added data resource.

    <Info>
      This page serves as your central hub for managing all aspects of your
      Knowledge Base.
    </Info>
  </Step>

  <Step title="Initiate content fetching">
    Click the `Fetch Links` button to begin the web scraping process.

    <Frame>
      <img src="https://mintcdn.com/tars-c52ebe98/pzfXZlbe5u4r3EET/images/dashboard/agent/knowledge/website/step-5-while-fetch.png?fit=max&auto=format&n=pzfXZlbe5u4r3EET&q=85&s=e12a92c4484aa30f110f1008affa4473" alt="Active fetching process showing real-time scraping progress and logs" width="1968" height="981" data-path="images/dashboard/agent/knowledge/website/step-5-while-fetch.png" />
    </Frame>

    **Fetching process:**

    * **Content discovery**: The system identifies and navigates through web pages.
    * **Text extraction**: Content is extracted from each identified page.
    * **Data processing**: Extracted content is cleaned and organized.
    * **Progress monitoring**: Real-time logs display scraping activity and status updates.

    <Warning>
      **Scraping Time**: The fetching process may take several minutes depending
      on website size and complexity. Monitor the progress logs for detailed
      status updates.
    </Warning>
  </Step>

  <Step title="Review and Select Pages">
    After fetching completes, review the extracted content and select pages for
    training.

    <Frame>
      <img src="https://mintcdn.com/tars-c52ebe98/pzfXZlbe5u4r3EET/images/dashboard/agent/knowledge/website/step-7-pre-train.png?fit=max&auto=format&n=pzfXZlbe5u4r3EET&q=85&s=cf14d061ccc975c41015df326fe5412c" alt="Post-fetch state showing completed scraping with content ready for training" width="1968" height="981" data-path="images/dashboard/agent/knowledge/website/step-7-pre-train.png" />
    </Frame>

    **Review and selection:**

    * **Extracted pages**: Access a list of all discovered and scraped web pages, each with clickable links.
    * **Word count**: Review the number of words extracted from each page.
    * **Page selection**: Select specific pages to include in the training process.
    * **Pre-filtering**: Filter pages based on content relevance and word count to optimize training preparation.

    <Check>
      All content has been successfully extracted, and you can now select the
      desired pages for the training phase.
    </Check>
  </Step>

  <Step title="Training and Monitoring">
    Initiate the training process and monitor the real-time progress as your
    content is processed and indexed.

    <Frame>
      <img src="https://mintcdn.com/tars-c52ebe98/pzfXZlbe5u4r3EET/images/dashboard/agent/knowledge/website/step-7-post-train.png?fit=max&auto=format&n=pzfXZlbe5u4r3EET&q=85&s=8af5e0bcaec80adf471dc29a4fc151ed" alt="Pre-training state showing content ready for Knowledge Base training" width="1968" height="981" data-path="images/dashboard/agent/knowledge/website/step-7-post-train.png" />
    </Frame>

    **Training phase:**

    * **Initiate training**: Begin by clicking the `Train` button to start the AI processing.
    * **Data reading**: The system reads and imports the stored data resources for processing.
    * **URL filtering**: Unnecessary URLs are filtered out to ensure only relevant data is processed.
    * **Text chunking**: The text is divided into smaller, manageable chunks for efficient processing.
    * **Embedding generation**: These text chunks are converted into vector embeddings using AI models.
    * **Vector storage**: The generated embeddings are stored in vector databases like Qdrant and Weaviate for efficient retrieval.
    * **Index optimization**: The system optimizes the index to enhance search and retrieval performance.

    <Check>
      Once the training is complete, your Knowledge Base will be fully functional and ready for integration with AI Agents.
    </Check>
  </Step>

  <Step title="View extracted content">
    If you wish to see the content of the training after the post-train phase,
    click on the particular link that is fetched to view the extracted content.

    <Frame>
      <img src="https://mintcdn.com/tars-c52ebe98/pzfXZlbe5u4r3EET/images/dashboard/agent/knowledge/website/step-7-pre-view-content.png?fit=max&auto=format&n=pzfXZlbe5u4r3EET&q=85&s=5245a35bc50e29bd6f73d2e49d5242fc" alt="Preview of content ready for review" width="1968" height="981" data-path="images/dashboard/agent/knowledge/website/step-7-pre-view-content.png" />
    </Frame>

    <Frame>
      <img src="https://mintcdn.com/tars-c52ebe98/pzfXZlbe5u4r3EET/images/dashboard/agent/knowledge/website/step-7-extracted-content.png?fit=max&auto=format&n=pzfXZlbe5u4r3EET&q=85&s=3cafb83b15b4d75fd2efa0df39c97943" alt="Detailed view of extracted content" width="1920" height="981" data-path="images/dashboard/agent/knowledge/website/step-7-extracted-content.png" />
    </Frame>

    **Content viewing:**

    * **Preview content**: Click on the links to view detailed content extracted from each page.
    * **Content analysis**: Evaluate the quality and relevance of the extracted data for further refinement.
  </Step>

  <Step title="Add additional data sources (optional)">
    Optionally add more data sources to expand your Knowledge Base content.

    <Frame>
      <img src="https://mintcdn.com/tars-c52ebe98/pzfXZlbe5u4r3EET/images/dashboard/agent/knowledge/website/step-8-new-datasource.png?fit=max&auto=format&n=pzfXZlbe5u4r3EET&q=85&s=5336b7810e5c90d329d8e8ba3d185c87" alt="Add new data source option for expanding Knowledge Base" width="1939" height="981" data-path="images/dashboard/agent/knowledge/website/step-8-new-datasource.png" />
    </Frame>

    **Additional data sources:**

    * **Add data resource**: Click to add more websites or content sources.
    * **Same configuration**: Use the same modal interface for additional sources.
    * **Multiple sources**: Combine content from multiple websites or URLs.
    * **Unified knowledge**: All sources contribute to a single comprehensive Knowledge Base.

    <Tip>
      **Expanding content**: Adding multiple data sources creates more
      comprehensive knowledge bases with broader coverage of your domain.
    </Tip>
  </Step>

  <Step title="Train additional content">
    Process and train any newly added data sources to integrate them into your
    Knowledge Base.

    <Frame>
      <img src="https://mintcdn.com/tars-c52ebe98/pzfXZlbe5u4r3EET/images/dashboard/agent/knowledge/website/step-9-fetch-and-train-new.png?fit=max&auto=format&n=pzfXZlbe5u4r3EET&q=85&s=fb603f7bc3e729e1956d363a09763304" alt="Training process for newly added data sources" width="1965" height="981" data-path="images/dashboard/agent/knowledge/website/step-9-fetch-and-train-new.png" />
    </Frame>

    **Multi-source training:**

    * **Fetch new content**: Scrape additional data sources.
    * **Integrate content**: Combine with the existing Knowledge Base.
    * **Retrain system**: Update embeddings with new content.
    * **Unified knowledge**: All sources work together seamlessly.

    <Frame>
      <img src="https://mintcdn.com/tars-c52ebe98/k37W35HpBrLrDZ5b/images/dashboard/agent/knowledge/website/step-9-trained-new.png?fit=max&auto=format&n=k37W35HpBrLrDZ5b&q=85&s=2fe4c89dd7d86538c4c9233198b5f639" alt="Completed training for all data sources" width="1941" height="981" data-path="images/dashboard/agent/knowledge/website/step-9-trained-new.png" />
    </Frame>

    <Check>
      All data sources have been successfully trained and integrated into your
      Knowledge Base.
    </Check>
  </Step>

  <Step title="Knowledge base ready">
    Your website-based Knowledge Base is now complete and available in your knowledge bases library.

    <Frame>
      <img src="https://mintcdn.com/tars-c52ebe98/pzfXZlbe5u4r3EET/images/dashboard/agent/knowledge/website/step-10-kb-added-to-my-kbs.png?fit=max&auto=format&n=pzfXZlbe5u4r3EET&q=85&s=c048df4b85419b86006c93afef7c931c" alt="Completed Knowledge Base showing in My Knowledge Bases with status and data sources" width="1952" height="981" data-path="images/dashboard/agent/knowledge/website/step-10-kb-added-to-my-kbs.png" />
    </Frame>

    **Knowledge base details:**

    * **Name and description**: Your configured Knowledge Base information
    * **Creation date**: When the Knowledge Base was created
    * **Training status**: Confirmed as trained and ready
    * **Data sources count**: Number of websites/content sources included
    * **Configure button**: Access to modify settings and add more sources

    <Note>
      **Ready to use**: Your Knowledge Base is immediately available for connecting to AI Agents and providing intelligent responses based on your website content.
    </Note>
  </Step>
</Steps>
