Leech

Crawling capabilities for Apache Tika

View the Project on GitHub leechcrawler/leech

Leech

Crawling capabilities for Apache Tika. Crawl content out of e.g. file systems, http(s) sources (webcrawling) or imap(s) servers. Leech offers additional Tika parsers providing these crawling capabilities.
It is the RDF free successor of Aperture from the DFKI GmbH Knowledge Management group and available unter the terms of the GPLv3. In the case you need a different license or want to make a project with us, feel free to contact us!

The key intentions of Leech:


How to start | Code snippets / Examples | Extending Leech | Mailing list | People / Contact | Supporters


Crawl something incrementally in 1 minute:

String strSourceUrl = "URL4FileOrDirOrWebsiteOrImapfolderOrImapmessageOrSomething";

Leech leech = new Leech();
CrawlerContext crawlerContext = new CrawlerContext();
crawlerContext.setIncrementalCrawlingHistoryPath("./history/4SourceUrl");
leech.parse(strSourceUrl, new DataSinkContentHandlerAdapter()
{
    public void processNewData(Metadata metadata, String strFulltext)
    {
        System.out.println("Extracted metadata:\n" + metadata + "\nExtracted fulltext:\n" + strFulltext);
    }
    public void processModifiedData(Metadata metadata, String strFulltext)
    {
    }
    public void processRemovedData(Metadata metadata)
    {
    }
    public void processErrorData(Metadata metadata)
    {
    }
}, crawlerContext.createParseContext());