x-ray alternatives and similar modules
Based on the "Parsing" category.
Alternatively, view x-ray alternatives based on common mentions on social networks and blogs.
-
markdown-it
Markdown parser, done right. 100% CommonMark support, extensions, syntax plugins & high speed -
remark
remark is a popular tool that transforms markdown with plugins. These plugins can inspect and change your markup. You can use remark on the server, the client, CLIs, deno, etc. -
nearley
๐๐๐ฒ Simple, fast, powerful parser toolkit for JavaScript. -
@parcel/css
An extremely fast CSS parser, transformer, bundler, and minifier written in Rust. -
parse5
HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant. -
fast-xml-parser
Validate XML, Parse XML and Build XML rapidly without C/C++ based libraries and no callback. -
csv-parser
Streaming csv parser inspired by binary-csv that aims to be faster than everyone else -
google-libphonenumber
The up-to-date and reliable Google's libphonenumber package for node.js. -
xlsx-populate
Excel XLSX parser/generator written in JavaScript with Node.js and browser support, jQuery/d3-style method chaining, encryption, and a focus on keeping existing workbook features and styles in tact. -
json-mask
Tiny language and engine for selecting specific parts of a JS object, hiding the rest. -
strip-json-comments
Strip comments from JSON. Lets you use comments in your JSON files! -
Awesome phonenumber parser
Google's libphonenumber pre-compiled with the closure compiler -
binary-extract
Extract a value from a buffer of json without parsing the whole thing -
parsec ๐
๐ Tiniest body parser in the universe. Built for modern Node.js -
docx-to-pdf-on-AWS-Lambda
Microsoft Word doc/docx to PDF conversion on AWS Lambda using Node.js
AWS Cloud-aware infrastructure-from-code toolbox [NEW]
* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.
Do you think we are missing an alternative of x-ray or a related project?
README
var Xray = require('x-ray')
var x = Xray()
x('https://blog.ycombinator.com/', '.post', [
{
title: 'h1 a',
link: '[email protected]'
}
])
.paginate('.nav-previous [email protected]')
.limit(3)
.write('results.json')
Installation
npm install x-ray
Features
Flexible schema: Supports strings, arrays, arrays of objects, and nested object structures. The schema is not tied to the structure of the page you're scraping, allowing you to pull the data in the structure of your choosing.
Composable: The API is entirely composable, giving you great flexibility in how you scrape each page.
Pagination support: Paginate through websites, scraping each page. X-ray also supports a request
delay
and a paginationlimit
. Scraped pages can be streamed to a file, so if there's an error on one page, you won't lose what you've already scraped.Crawler support: Start on one page and move to the next easily. The flow is predictable, following a breadth-first crawl through each of the pages.
Responsible: X-ray has support for concurrency, throttles, delays, timeouts and limits to help you scrape any page responsibly.
Pluggable drivers: Swap in different scrapers depending on your needs. Currently supports HTTP and PhantomJS driver drivers. In the future, I'd like to see a Tor driver for requesting pages through the Tor network.
Selector API
xray(url, selector)(fn)
Scrape the url
for the following selector
, returning an object in the callback fn
.
The selector
takes an enhanced jQuery-like string that is also able to select on attributes. The syntax for selecting on attributes is [email protected]
. If you do not supply an attribute, the default is selecting the innerText
.
Here are a few examples:
- Scrape a single tag
xray('http://google.com', 'title')(function(err, title) {
console.log(title) // Google
})
- Scrape a single class
xray('http://reddit.com', '.content')(fn)
- Scrape an attribute
xray('http://techcrunch.com', '[email protected]')(fn)
- Scrape
innerHTML
xray('http://news.ycombinator.com', '[email protected]')(fn)
xray(url, scope, selector)
You can also supply a scope
to each selector
. In jQuery, this would look something like this: $(scope).find(selector)
.
xray(html, scope, selector)
Instead of a url, you can also supply raw HTML and all the same semantics apply.
var html = '<body><h2>Pear</h2></body>'
x(html, 'body', 'h2')(function(err, header) {
header // => Pear
})
API
xray.driver(driver)
Specify a driver
to make requests through. Available drivers include:
- request - A simple driver built around request. Use this to set headers, cookies or http methods.
- phantom - A high-level browser automation library. Use this to render pages or when elements need to be interacted with, or when elements are created dynamically using javascript (e.g.: Ajax-calls).
xray.stream()
Returns Readable Stream of the data. This makes it easy to build APIs around x-ray. Here's an example with Express:
var app = require('express')()
var x = require('x-ray')()
app.get('/', function(req, res) {
var stream = x('http://google.com', 'title').stream()
stream.pipe(res)
})
xray.write([path])
Stream the results to a path
.
If no path is provided, then the behavior is the same as .stream().
xray.then(cb)
Constructs a Promise
object and invoke its then
function with a callback cb
. Be sure to invoke then()
at the last step of xray method chaining, since the other methods are not promisified.
x('https://dribbble.com', 'li.group', [
{
title: '.dribbble-img strong',
image: '.dribbble-img [data-src]@data-src'
}
])
.paginate('[email protected]')
.limit(3)
.then(function(res) {
console.log(res[0]) // prints first result
})
.catch(function(err) {
console.log(err) // handle error in promise
})
xray.paginate(selector)
Select a url
from a selector
and visit that page.
xray.limit(n)
Limit the amount of pagination to n
requests.
xray.abort(validator)
Abort pagination if validator
function returns true
.
The validator
function receives two arguments:
result
: The scrape result object for the current page.nextUrl
: The URL of the next page to scrape.
xray.delay(from, [to])
Delay the next request between from
and to
milliseconds.
If only from
is specified, delay exactly from
milliseconds.
var x = Xray().delay('1s', '10s')
xray.concurrency(n)
Set the request concurrency to n
. Defaults to Infinity
.
var x = Xray().concurrency(2)
xray.throttle(n, ms)
Throttle the requests to n
requests per ms
milliseconds.
var x = Xray().throttle(2, '1s')
xray.timeout (ms)
Specify a timeout of ms
milliseconds for each request.
var x = Xray().timeout(30)
Collections
X-ray also has support for selecting collections of tags. While x('ul', 'li')
will only select the first list item in an unordered list, x('ul', ['li'])
will select all of them.
Additionally, X-ray supports "collections of collections" allowing you to smartly select all list items in all lists with a command like this: x(['ul'], ['li'])
.
Composition
X-ray becomes more powerful when you start composing instances together. Here are a few possibilities:
Crawling to another site
var Xray = require('x-ray')
var x = Xray()
x('http://google.com', {
main: 'title',
image: x('#gbar [email protected]', 'title') // follow link to google images
})(function(err, obj) {
/*
{
main: 'Google',
image: 'Google Images'
}
*/
})
Scoping a selection
var Xray = require('x-ray')
var x = Xray()
x('http://mat.io', {
title: 'title',
items: x('.item', [
{
title: '.item-content h2',
description: '.item-content section'
}
])
})(function(err, obj) {
/*
{
title: 'mat.io',
items: [
{
title: 'The 100 Best Children\'s Books of All Time',
description: 'Relive your childhood with TIME\'s list...'
}
]
}
*/
})
Filters
Filters can specified when creating a new Xray instance. To apply filters to a value, append them to the selector using |
.
var Xray = require('x-ray')
var x = Xray({
filters: {
trim: function(value) {
return typeof value === 'string' ? value.trim() : value
},
reverse: function(value) {
return typeof value === 'string'
? value
.split('')
.reverse()
.join('')
: value
},
slice: function(value, start, end) {
return typeof value === 'string' ? value.slice(start, end) : value
}
}
})
x('http://mat.io', {
title: 'title | trim | reverse | slice:2,3'
})(function(err, obj) {
/*
{
title: 'oi'
}
*/
})
Examples
- selector: simple string selector
- collections: selects an object
- arrays: selects an array
- collections of collections: selects an array of objects
- array of arrays: selects an array of arrays
In the Wild
- Levered Returns: Uses x-ray to pull together financial data from various unstructured sources around the web.
Resources
Backers
Support us with a monthly donation and help us continue our activities. [Become a backer]
Sponsors
Become a sponsor and get your logo on our website and on our README on Github with a link to your site. [Become a sponsor]
License
MIT
*Note that all licence references and agreements mentioned in the x-ray README section above
are relevant to that project's source code only.