Wednesday, May 28, 2014

Create RSS Feed from any web page

Step-by-step example of feed setup

Feed43 uses the following principles to convert web page into RSS feeds:
  1. It applies the global search pattern once to extract the part of the web page that contains news items.
  2. It repeatedly applies the item search pattern to the part of the web page defined in previous step to locate a list of news items and parse their attributes (title, link and, possibly, content).
  3. These attributes are substituted into feed template to form a feed body.
Now, straight to the example...

Let's imagine we want to track Orange careers for London. We go to to the Job search form at Orange web site, select London as location and get the following page:
Web page screenshot
URL of this page will be something like http://www.orangejobs.co.uk/fe/tpl_orange01.asp?KEY=3706675&C=742523864612&PAGESTAMP=dbetewllsnyvwhgtgg&nexts=INIT_JOBLISTSTART &nextss=&mode=1&newQuery=yes&searchrefno=&searchlocation=1340&searchdivision=0&searchtext= &formsubmit4=Search+and+Apply. Don't be confused with such long URL. This is OK.
We create a new feed with Feed43, copy the URL of the page from the browser to the 'Address' field, and press [Reload] button. Feed43 will download this page and show it's HTML source with syntax highlighting (see the picture below).
Now we should find the part of the page with content we need. In our case we can search for the string ‘Direct Acquisition Campain Manager’ (the first position in the list):
Viewing page source
After analyzing the HTML source we can see that every job item starts with <td class=searchresultsjoblink>, has a link with title="..." and href="..." attributes that can be used as a news item title and link respectively. Each job description (that can be used as news item content) starts with <td class=searchresultstopofjobdesc> and ends with </td>.
Now we can write item search pattern that extracts these text snippets:
<td class=searchresultsjoblink>{*}
title="{%}"{*}
href="{%}"{*}
<td class=searchresultstopofjobdesc>{%}</td>
This means: find <td class=searchresultsjoblink>, then take anything between title=" and " as the first parameter, then take anything between href=" and " as the second parameter, then take anything between <td class=searchresultstopofjobdesc> and </td> as the third parameter.
Line breaks in search patterns do not play a role and can be used just to nicely format the string.
The item search pattern is unique for the whole page, and will not match any text outside the content we need. So, the global pattern will be the following:
{%}
This means: use the whole page to apply item pattern.
After typing in the global pattern and item pattern into the feed edit form and pressing [Extract] button, we can see the clipped data:
Viewing extracted text snippets
Now we edit feed properties as shown on the picture below. Note that we use {%1} placeholder in item title template, {%2} in item link template, and {%3} in item content template. These placeholders will be substituted with real values for each news item.
Pressing the [Preview] item will render the feed in a viewable form:
Editing feed properties and previewing feed
Below we see the feed link and some additional options (for example, you can password-protect you feed or rename it, if necessary):
Resulting step with additional options
Now we copy the feed link and try it in a news reader:
Resulting step with additional options
Voila! News feed is up and running. As original web page updates, we will see new items in the list.

Source: http://feed43.com/step-by-step.html

0 comments:

Post a Comment