Nutch
Nutch is an open source well matured and highly extensible web crawler software used for Linux distributions. Nutch enables the fine grained configuration and relies on Apache Hadoop data structures. It provides an extensible interfaces including parse, index and scoring filters for custom implementations. In addition, the pluggable index exists for Apache Solr, Eastic search, etc. It can gain a lot of strength in a Hadoop cluster instead of running on a single machine. Apache Nutch is a highly extensible and scalable open source web crawler software project