Data Acquisition and Extraction
An enterprise grade, large scale web crawler and extraction engine built using Datoin Platform for all your Data Needs
Benefits of using Datoin Platform for Crawling
Crawling or Data Acquisition is just an another component in the complete extraction pipeline. Datoin platform gives us the benefit of a quick configuration of extractions, and easier implementation of custom business logic using already existing off-the-shelf components. Thus, we can build a quicker proof of concepts and deliver faster than any of custom solutions. And, did we forgot to say it is scalable too. Datoin platform is built on top of Apache Hadoop and customised Nutch crawler.
Extraction and Analysis
TransformationTransform different schema from various websites into one standardised schema which you can readily consume in your app.
FilteringWeed out undesired records, unwanted categories and useless records with incomplete information.
De-duplicationGet only unique records by removing the redundant records.
Customised ExtractionSite specific or site agnostics scraping tool, which can be customised to your data needs.
Predefined ExtractionNLP and Machine Learning to extract verticals such as restaurant, products, news, etc helps you get started.
ClassificationClassify your extracted data into predefined category by using our trained machine learning components for Eg: News Category, Sentiment etc.
ClusteringCarry out cluster analysis on crawled data to identify the pattern and use them in your application.
AggregationGet great insights about the data using our customizable aggregation components.
Add-On AnalysisEasily add custom analysis components such as enrichment, different data sources(other than web), etc to your crawl pipeline.
Datoin Cloud HostingForget about maintenance, scalability issues, etc. Use Datoin hosted crawler to get data and focus on the core problem.
On-Premise or Private CloudThe undocking feature that comes with Datoin Platform helps us to deploy on your private cloud.
Focused CrawlingOur crawler learns the path to pages which have needed info to optimise your crawling.
Continuous CrawlingGet fresh data continuously such as news, product prices, etc.
Scheduled CrawlingYou may want to schedule your crawl to check for new data periodically.
Data FormatsGet the data in whatever the format you may need such as XML, JSON, Excel, CSV, etc
APISConvert your crawler into REST APIs, integrate data directly into your applications!
Web hooksExport data into cloud storages(Amazon s3), to hosted search indices(Solr, Elastic), to databases(Mongo, Mysql) and many more.
Interested in how can Datoin be useful to you?