Select a category on the left, to get your answers quickly
The Opensolr Web Crawler is an innovative and comprehensive solution designed for businesses and individuals looking to rapidly deploy an advanced, AI-powered search engine. With just a few clicks, you can create and deploy an intelligent, fully functional search interface powered by Apache Solr, tailored to your specific content needs.
&topbar=off
- Hide or display the top search bar.&q=SEARCH_QUERY
- Start searches with predefined queries.&in=web/media/images
- Narrow results by content type.&og=yes/no
- Control display of OpenGraph images.&source=WEBSITE
- Restrict searches to specific domains.The Opensolr Web Crawler is the first platform of its kind to deliver a fully automated, AI-powered search solution that combines effortless indexing, advanced NLP features, and seamless API integration. Whether you’re looking for a turnkey solution or a deeply customized experience, Opensolr offers unmatched flexibility, scalability, and support.
Harness the power of advanced AI, natural language processing, and Apache Solr—all packaged in one comprehensive solution, instantly ready for deployment.
Start your journey with Opensolr today and redefine how your users discover information.
Discover a seamless, AI-powered way to index, enrich, and search your web content—automatically.
Learn even more, here.
The Opensolr Web Crawler is a robust platform for crawling, indexing, and enriching websites of any size.
It automatically extracts key meta-information, applies Natural Language Processing (NLP) and Named Entity Recognition (NER), and injects all content and structure directly into your Solr index.
Or try the Solr API for a live crawl.
Full NLP and NER:
Extract people, locations, organizations, and more using OpenNLP.
Comprehensive Metadata Extraction:
Collects meta tags, page structure, creation dates, and document fields.
AI-Hints:
Opensolr AI-Hints are enabled by default for all crawler indexes, delivering rich context and smart search assistance.
Automatic Content Language Detection:
Indexes and searches in any language, with built-in stopword, synonym, and spellcheck support.
Responsive, Embeddable Search UI:
Integrate Opensolr search into your site, customize top bar, filters, and behavior.
Scheduled Recrawling & Live Stats:
Only new and updated content is fetched, with live stats for crawling and SEO.
Secure & Flexible:
Supports HTTP Auth for protected content, robust backup and replication, and fully managed by API or UI.
Rich Content Support:
Indexes and analyzes HTML, doc, docx, xls, PDF, and most image formats—extracting content, meta, GPS/location data, and sentiment.
Crawl Resume:
Pause and resume crawls anytime; supports cron jobs and incremental indexing.
You can embed your Opensolr Web Crawler Search Engine on any website.
Customize your search experience with parameters such as:
&topbar=off
– Hide the top search tool&q=SEARCH_QUERY
– Set the initial search&in=web/media/images
– Filter by content type&og=yes/no
– Show/hide OG images per result&source=WEBSITE
– Restrict to a single domain&fresh=...
– Apply result freshness or sentiment bias&lang=en
– Filter by languageTo enable smooth crawling and full feature support, use our ready-made Solr configs:
- Solr 9 Config Zip Archive
Do not manually modify your schema.xml
for crawler indexes to ensure all features work as designed.
1. Page has to respond within less than 5 seconds (that's not the page download time, it's the page / website response time), otherwise the page in question will be ommited from indexing.
2. Our web crawler will follow, but will never index dynamic pages (pages with a ? query in the URL). Such as: https://website.com?query=value
3. In order to be indexed, pages should never reflect a meta tag of the form
<meta name="robots" content="noindex" />
4. In order to be followed for other links, pages should never reflect a meta tag of the form:
<meta name="robots" content="nofollow" />
5. Just as in the case of #3 and #4, all pages that are desired to appear in search results should never include "noindex or nofollow or none" as a robots meta tag.
6. Pages that should appear in the search results, and are desired to be indexed and crawled, should never appear as restricted in the generic website.tld/robots.txt file
7. Pages should have a clear, concise title, while also trying to avoid duplicates in the titles, if at all possible. Pages without a title whatsoever, will always be ommited from indexing.
8. Article pages should present a creation date, by either one of the following meta tags:
article:published_time
or
og:updated_time
9. #8 Will apply , as best practice, for any other pages, in order to be able to correctly and consistently present fresh content at the top of the search results, for any given query.
10. Presence of: author, or og:author, or article:creator meta tag is a best practice, even though that will be something generic such as: "Admin", etc, in order to provide better data structure for search in the future.
11. Presence of a category or og:category tag will also help with faceting and more consistent data structure.
12. In case two or more different pages, that reside at two or more different URLs, BUT present the same actual content, they should both have a canonical meta tag, which indicates which one of the URLs should be indexed. Otherwise, search API will present duplicates in the results