How to Improve Solr Crawl Times

  • Updated

Here are some options which may allow you to decrease the time of your Solr crawl and also reduce CPU usage during it.

  1. Remove unnecessary items from the crawl registration page. Documents and community members require substantial resources to crawl so if not needed uncheck and re-register the sites.
  2. Improve the hardware. Add RAM if the current amount is insufficient, improve the CPU, and/or use a faster HDD. 
  3. Put the index and assetcache on separate disks. This can be done during installation.  
  4. Decide which folders do not need to be indexed and mark them as unsearchable. After changing the folder property see these steps.
  5. Check the ManifoldCF logs and compare with high CPU times. There should be content IDs which can be looked up in the content table of the database. You may be able to identify common items which cause high CPU such as PDFs or other file times. If they are not needed they can be marked as unsearchable. 
  6. If a site was ever on 8.02 or earlier there will be unneeded text files which were used for Windows Indexing Search. Remove unneeded txt files which slow down crawls. 
  7. Mark library images as unsearchable if not needed in search results. Even when documents are not selected in the registration page library images in the content table are still crawled. 

    This can be done with the following database query.
    update content 

    set searchable = 0 

    where content_id in 

    (select content_id from library where libtype_id = 1)
  8. Mark MP3s and MP4s as unsearchable. 

    update content 
    set searchable=0 
    where asset_id in 
    (select id
    from assetdatatable
    where mimetype like '%mp3%' OR mimetype like '%mp4%')