Documentation

Select a category on the left, to get your answers quickly

Opensolr Web Crawler - The Comprehensive, AI-Driven Web Search Solution

Introducing the Opensolr Web Crawler

The Opensolr Web Crawler is an innovative and comprehensive solution designed for businesses and individuals looking to rapidly deploy an advanced, AI-powered search engine. With just a few clicks, you can create and deploy an intelligent, fully functional search interface powered by Apache Solr, tailored to your specific content needs.

What Makes the Opensolr Web Crawler Unique?

Instant Setup and Effortless Management

  • Quick Index Creation: Sign up, create a Solr index in seconds, add your URL(s), and let the Web Crawler take care of the rest.
  • Fully Automated: Automatically crawls and indexes web pages, full HTML content, and rich file formats such as PDF, DOC, DOCX, XLS, and various image formats.
  • Scheduled Crawling: Regular, automated re-crawling to keep your content up-to-date without manual intervention.
  • Resume Capabilities: Crawl your website incrementally—never lose your progress.

Advanced AI Integration and NLP Capabilities

  • AI Reader: Automatically summarizes lengthy text pages, distilling the key information clearly and concisely, eliminating distractions such as advertisements and irrelevant content.
  • Robust NLP and NER (Named Entity Recognition): Integrated with OpenNLP, automatically detects languages and extracts key entities like names, locations, and organizations.
  • Sentiment Analysis: Detects and scores sentiment, highlighting positive, negative, or potentially hateful content for deeper insights.

Rich and Customizable Search UI

  • Responsive, Embeddable UI: Easily integrate a polished, responsive search interface directly into your own website, or use it standalone.
  • Customizable Parameters: Personalize search experiences with parameters such as:
  • &topbar=off - Hide or display the top search bar.
  • &q=SEARCH_QUERY - Start searches with predefined queries.
  • &in=web/media/images - Narrow results by content type.
  • &og=yes/no - Control display of OpenGraph images.
  • &source=WEBSITE - Restrict searches to specific domains.

Full RAG (Retrieval-Augmented Generation) Search Capabilities

  • Opensolr pioneers the integration of fully automated RAG capabilities, delivering instant, context-rich AI-generated responses from your indexed data.

Superior Features for Enhanced Search Precision

  • Spellcheck & Autocomplete: Built-in, intelligent suggestions and corrections to enhance search accuracy.
  • Geo-Location Capabilities: Extracts and indexes GPS metadata from image files, supporting powerful geo-location-based search queries.
  • SEO-Friendly Crawling: Receive live SEO insights directly from the Web Crawler UI or REST API, optimizing your content as you crawl.

Scalability and Customization for Large Datasets

  • Fully Scalable: Index massive datasets seamlessly, perfect for large-scale websites and extensive file collections.
  • Tailored Customization: Expert customization and dedicated support for unique crawling and indexing requirements, ensuring the best search performance for your needs.

Seamless API Integration

  • Full Apache Solr API Access: Direct access to your crawled data through Solr’s robust API, enabling powerful custom integrations and advanced analytics.
  • Automation REST API: Effortlessly manage crawling processes programmatically, schedule tasks, and receive real-time updates.

A Complete Search Solution Out-of-the-Box

The Opensolr Web Crawler is the first platform of its kind to deliver a fully automated, AI-powered search solution that combines effortless indexing, advanced NLP features, and seamless API integration. Whether you’re looking for a turnkey solution or a deeply customized experience, Opensolr offers unmatched flexibility, scalability, and support.


Harness the power of advanced AI, natural language processing, and Apache Solr—all packaged in one comprehensive solution, instantly ready for deployment.

Start your journey with Opensolr today and redefine how your users discover information.

Learn More & Sign Up at Opensolr

🤖 Opensolr Web Crawler

Discover a seamless, AI-powered way to index, enrich, and search your web content—automatically.
Learn even more, here.


What is the Opensolr Web Crawler?

The Opensolr Web Crawler is a robust platform for crawling, indexing, and enriching websites of any size.
It automatically extracts key meta-information, applies Natural Language Processing (NLP) and Named Entity Recognition (NER), and injects all content and structure directly into your Solr index.

  • 🚀 Instantly searchable: All content becomes instantly searchable via a fully responsive, embeddable search UI.
  • 🤖 AI-driven enrichment: Named entities, sentiment, language detection, and more are extracted on the fly.
  • 🕑 Get started in minutes: Launch a powerful, custom search engine on your data without manual setup.

🔎 See It In Action

Or try the Solr API for a live crawl.


⚡ Key Features

  • Full NLP and NER:
    Extract people, locations, organizations, and more using OpenNLP.

  • Comprehensive Metadata Extraction:
    Collects meta tags, page structure, creation dates, and document fields.

  • AI-Hints:
    Opensolr AI-Hints are enabled by default for all crawler indexes, delivering rich context and smart search assistance.

  • Automatic Content Language Detection:
    Indexes and searches in any language, with built-in stopword, synonym, and spellcheck support.

  • Responsive, Embeddable Search UI:
    Integrate Opensolr search into your site, customize top bar, filters, and behavior.

  • Scheduled Recrawling & Live Stats:
    Only new and updated content is fetched, with live stats for crawling and SEO.

  • Secure & Flexible:
    Supports HTTP Auth for protected content, robust backup and replication, and fully managed by API or UI.

  • Rich Content Support:
    Indexes and analyzes HTML, doc, docx, xls, PDF, and most image formats—extracting content, meta, GPS/location data, and sentiment.

  • Crawl Resume:
    Pause and resume crawls anytime; supports cron jobs and incremental indexing.


⚙️ Embedding & Customization

You can embed your Opensolr Web Crawler Search Engine on any website.
Customize your search experience with parameters such as:

  • &topbar=off – Hide the top search tool
  • &q=SEARCH_QUERY – Set the initial search
  • &in=web/media/images – Filter by content type
  • &og=yes/no – Show/hide OG images per result
  • &source=WEBSITE – Restrict to a single domain
  • &fresh=... – Apply result freshness or sentiment bias
  • &lang=en – Filter by language

🚀 What’s New

  • AI-Hints: Enabled by default for every crawler index.
  • Automatic Language Detection and advanced NER via OpenNLP.
  • Customizable for any language and analysis pipeline.
  • Full support for spellcheck, autocomplete, backup, and replication.
  • Live SEO & crawling stats and sentiment analysis.
  • Automated scheduling and easy management via UI or REST API.

📥 Solr Configuration for Crawling

To enable smooth crawling and full feature support, use our ready-made Solr configs:
- Solr 9 Config Zip Archive

Do not manually modify your schema.xml for crawler indexes to ensure all features work as designed.


🎬 Quick Video Demo


Opensolr Web Crawler Standards

1. Page has to respond within less than 5 seconds (that's not the page download time, it's the page / website response time), otherwise the page in question will be ommited from indexing.

2. Our web crawler will follow, but will never index dynamic pages (pages with a ? query in the URL). Such as: https://website.com?query=value

3. In order to be indexed, pages should never reflect a meta tag of the form

<meta name="robots" content="noindex" />

4. In order to be followed for other links, pages should never reflect a meta tag of the form:

<meta name="robots" content="nofollow" />

5. Just as in the case of #3 and #4, all pages that are desired to appear in search results should never include "noindex or nofollow or none" as a robots meta tag.

6. Pages that should appear in the search results, and are desired to be indexed and crawled, should never appear as restricted in the generic website.tld/robots.txt file

7. Pages should have a clear, concise title, while also trying to avoid duplicates in the titles, if at all possible. Pages without a title whatsoever, will always be ommited from indexing.

8. Article pages should present a creation date, by either one of the following meta tags:

article:published_time

or

og:updated_time

9. #8 Will apply , as best practice, for any other pages, in order to be able to correctly and consistently present fresh content at the top of the search results, for any given query.

10. Presence of: author, or og:author, or article:creator meta tag is a best practice, even though that will be something generic such as: "Admin", etc, in order to provide better data structure for search in the future.

11. Presence of a category or og:category tag will also help with faceting and more consistent data structure.

12. In case two or more different pages, that reside at two or more different URLs, BUT present the same actual content, they should both have a canonical meta tag, which indicates which one of the URLs should be indexed. Otherwise, search API will present duplicates in the results






Review us on Google Business
ISO-9001 CERTIFIED ISO-27001 CERTIFIED