Filters work in recursive and non-recursive mode.
1. Crawl all URLs except URLs in the main news folder and its subfolders
2. Crawl all URLs except URLs in the main news folder, but do crawl the /important subfolder under news
3. Crawl only the products folders anywhere in the site, with all subfolders
4. Crawl only products folders anywhere in the site, but exclude their /archive subfolder and their numbered archive subfolders like /archive1, /archive2...
5. All use cases by example
URLs to be filtered:
http://www.msd-animal-health.com/foo
http://www.msd-animal-health.com/bar
http://www.msd-animal-health.com/foobar
Filter result:
exclude filter | include filter | filter priority | urls to be crawled |
---|---|---|---|
none |
http://www.msd-animal-health.com/foo http://www.msd-animal-health.com/bar http://www.msd-animal-health.com/foobar |
||
foo | none | http://www.msd-animal-health.com/bar | |
foo | bar | exclude even if included | http://www.msd-animal-health.com/bar |
foo | bar | include even if excluded | http://www.msd-animal-health.com/bar http://www.msd-animal-health.com/foobar |
bar | none | http://www.msd-animal-health.com/foo | |
bar | foo | exclude even if included | http://www.msd-animal-health.com/foo |
bar | foo | include even if excluded | http://www.msd-animal-health.com/foo http://www.msd-animal-health.com/foobar |