#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2013
    Posts
    3
    Rep Power
    0

    Lightbulb Best way to get articles content


    Hello. First of all, my english is not good - but I hope you understand what I mean. Also, I hope that is the right category for my question.

    I can not seem to figure out which is the best way to extract the content of the articles. I mean content articles on news sites such as CNN, BBC, Mashable and so on. Maybe something like Readability (but even that does not always work and anyway, I don't know how it is possible to make something similar).

    The first thing that crossed my mind was RSS. But even this, some sites offer partial feed, some do not offer RSS anymore. RSS lost "power" when Google decided to close Google Reader.

    I tried to appeal to the services like Full-TEXT RSS Feeds (fivefilters - search on Google). But even if I pay for it, is not exactly what I want. Sometimes it does not works and when it does return "extra content".

    What to ask my users? To enter RSS feed from that site? And beyond? I have the title, URL and a limited text. I need full text.
    To save URL and pass it through the fivefilters filter? Then I will have different additional content dependind by entered feed.

    Waiting for your ideas and questions if necessary! I hope I made ​​myself clear, I tried my best.

    I hope you understand what I want and can help me get over this problem. Not trying to steal content from other sites or anything like that. There is no reason to do that (to steal content from external sites). I just want to give them a new format, of course, alongside the original version. I just do not find a way to get to the full text.
  2. #2
  3. Jealous Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,303
    Rep Power
    9400
    Originally Posted by Usetelm
    I hope you understand what I want and can help me get over this problem. Not trying to steal content from other sites or anything like that. There is no reason to do that (to steal content from external sites). I just want to give them a new format, of course, alongside the original version. I just do not find a way to get to the full text.
    The problem is not stealing content. The problem is that you're hosting the content in a different place with the intention of having people go to your site instead of to theirs. At the very least you're stealing traffic and traffic is money.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Dec 2004
    Posts
    3,031
    Rep Power
    377
    Originally Posted by requinix
    The problem is not stealing content. The problem is that you're hosting the content in a different place with the intention of having people go to your site instead of to theirs. At the very least you're stealing traffic and traffic is money.
    how is this different to some sites providing RSS feed which you can put on your website?
  6. #4
  7. Jealous Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,303
    Rep Power
    9400
    Because their RSS feeds don't (shouldn't) contain the whole article. You're actually driving traffic to them by getting the summary or preview out to more people, but those people still end up going to the original site to see the whole thing. (However they can still sic lawyers on you for it.)
    The problem arises when people don't need to go to the original site.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2013
    Posts
    3
    Rep Power
    0
    Originally Posted by requinix
    The problem is not stealing content. The problem is that you're hosting the content in a different place with the intention of having people go to your site instead of to theirs. At the very least you're stealing traffic and traffic is money.
    What about readability.com/articles/olxaobiw?readbar=1, readability.com/articles/zmeq2er9?readbar=1 and so on? Readability is very popular and nobody does anything to not allow them to display their own content.
  10. #6
  11. Jealous Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,303
    Rep Power
    9400
    They make it easy to remove your (as a content publisher) site from their scraping and that's cheaper than hiring lawyers. But I don't know of any legal basis they can use to defend themselves.

    Now is the point I should say "Ask a lawyer first". If your lawyer says it's okay then I'd be surprised but whatever.
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2013
    Posts
    3
    Rep Power
    0
    Originally Posted by requinix
    They make it easy to remove your (as a content publisher) site from their scraping and that's cheaper than hiring lawyers. But I don't know of any legal basis they can use to defend themselves.

    Now is the point I should say "Ask a lawyer first". If your lawyer says it's okay then I'd be surprised but whatever.
    I'll do the same thing - content publisher have the option to remove their site from my "scraping". I can not see the difference between what I want (I mean taking content) and what makes Readability. Like I said, I will also display the original article and author, source and so on.
  14. #8
  15. Jealous Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,303
    Rep Power
    9400
    Get a lawyer first. If someone decides to sue you for infringement, even if you're in the right it could still cost you a lot of money.

IMN logo majestic logo threadwatch logo seochat tools logo