YaCy

From LinuxReviews
Jump to: navigation, search
YaCy
Developed byMichael Peter Christen
Latest release1.3 / December, 2012
OSJava (Platform independent The GNU Operating System, Windows, Mac OS X, etc)GNU/Linux
TypeFree Software Search Engine
LicenseGPL
Websitehttp://www.yacy.net

YaCy is a free open-source peer to peer-based (distributed) search engine program made in Java which is designed to run on your personal computer. It comes complete with a web server, wiki, bookmark manager, proxy, blog and internal messaging system.

YaCy is a distributed Java-based personal P2P-based Internet search-engine program and also a caching proxy designed to run on your personal desktop computer or on your server.

Contents

[edit] What is it?

YaCy's front page, by default available at 127.0.0.1:8090 when the YaCy software is running.

The YaCy software aims at becoming a global distributed Internet search-engine. Your YaCy indexes a small part of the web, your own searches are distributed among other peers, and your YaCy replies when other people search for web-pages it knows about.

This, in theory, will eventually allow you to search the whole indexable web with your own YaCy software even though your personal YaCy-installation only knows about a small portion of it.

But YaCy is much more than that. It is almost a complete all-in-one Internet assistant, because YaCy has many other interesting features, including:

  • A built-in web-server. The web-server which shows you the search-page also allows you to publish a homepage, files and so on.
  • A proxy, which caches the pages you visit and optionally indexes them so they become part of the global search-index.
  • A password-protected file-sharing area, which allows you to share files with your friends.
  • A very basic wiki (example), and a very simple blog function.

[edit] The Search Engine

The Search Engine is YaCy's feature. It works like any other search engine, you type in what you want to read about and it gives you a list of results. You can optionally choose how long you want YaCy to search before it presents the results, the longer you ask it to search, the more peers are asked for results.

[edit] The search results

YaCy knows about quite a few pages regarding most keywords, but it does present far fewer search-results than most major search-engines. It also doesn't do a very good job at sorting the pages according to relevance.

Seeing is believing, you can try a few random searches using YachSearch.com and compare them with search engines such as Google.com and MSN.

It's was reported by mainstream press that YaCy covers 205 million Internet-pages January 2007.[1], but the actual total known URLs was closer to 214 million in January 2007. This may sound much, but search engines such as Exalead.com brag that they cover 8 billion pages. It's very hard to say just how big the web really is, so it's hard to know how much YaCy covers. The Internet has about 100.000.000 hostnames according to[2], but one hostname could hold everything from zero to millions of actual pages.

The search results YaCy present are very impressive if you only consider that it is a free open-source GPL-licensed search-engine running on home computers, and that it already has indexed two hundred million pages when there are only close to a hundred active peers in the YaCy-network.

[edit] Blacklists and Censorship

The indexable web is what the crawler software used by major search-engines - and P2P search engines like YaCy - can index and make appear in the search-results.

the visible web is what is actually shown on Internet's front pages[3][4].

There is intentionally a big difference between the visible web and the indexable web. Governments, specially in NATO-countries, can and do instruct search-engines that they must "not embarrass us" - and all major search-engines are more than willing to censor their services.[5].

YaCy is a peer to peer search-engine, so there is no central point who can say "blahblahblah.tld? We don't like that one. We want as few people as possible to be aware that it even exists. We'll just blacklist it and hope nobody notices."

You can add URLs to a blacklist to prevent them from showing up in your peer's search-results. And you can also import blacklists from other peers.

But you can not blacklist or censor sites globally. You can make your peer not know about a website. But you can't make every YaCy peer ignore it. This makes YaCy very censorship-resistant. It is simply not possible for someone to censor away parts of the indexable web from the visible web (except your Internet Service Provider, it can make sure you can't communicate with other YaCy peers - in which case you're pretty much screwed anyway).

YaCy's search-results are censorship-resistant, and that's it's biggest quality.

[edit] Resource demands

YaCy is written in Java. As such its resource usage can be high. System resource monitoring is integrated into YaCy, and the amount of resources that is uses is highly configurable, such that YaCy can be configured to run unobtrusively on low spec servers yet still effectively index websites. YaCy's system requirements are low, and it has been demonstrated successfully indexing and performing web-searches on smartphones and less than 600mhz devices.

[edit] Search Engine Optimization

The first thing that came to mind when reading about YaCy was: It must be real easy to do "Search Engine Optimization" (SEO) using YaCy. And it is.

See, YaCy isn't really that big yet, and there really isn't that many peers yet, so YaCy doesn't know about every single website out there - yet. Traditionally you'd have to wait for big search-engines to finally decide to crawl your site(s), and you'd have to guess what they think of them, and you'd have to hope that they would grab your entire site and not just grab your front page and then wait months before re-visiting. If you run a YaCy peer then you can simply ask your YaCy to spider and index your site, and volla, it does, and volla, it now shows up in the search-results at other YaCy peers. It does appear that some of the peers run YaCy for exactly this reason.

Further, YaCy's search-results has a link to "Info" on the search-results pages, and here you'll find interesting things like "Parsed text" and "Parsed sentences". This can be useful even if you're not running a YaCy peer or even care what it is, since it shows you exactly what part of the text on your

[edit] Privacy

YaCy is designed to hash the search queries sent between computers. By definition, this cannot be meaningless to other peers in the YaCy search network, since those peers must know which results to send back to the requester. YaCy therefore provides no strong privacy guarantees. However, since search queries sent to most mainstream, centralised search engines are sent in plain text, YaCy is as private, if not more so, than centralised searches, depending on who you wish to keep the data private from.

Clearly, if your goal is to keep information from organisations like google and those they actively share search information with, YaCy is an improvement, since no such direct or active sharing is done with those organisations -- at least, not by design.

It must be noted, however, that peers directly connected to a YaCy server may see searches sent from that server, and the operators of those connected peers, as well as any computers in-between, may monitor search results.

Secondly, governmental or similar organisations with enough power and control of networks (such as the NSA and GCHQ), are very likely to be monitoring all traffic, or at least, "interesting" traffic, between computers on the internet.

Operators running YaCy servers do not see the cleartext keywords being searched for. To do so, they would need to modify YaCy to figure them out and/or list the sites who are returned for that keyword, which is relatively simple, but does require programming skills. Indentifying the actual machines and users performing the searches would be less straightforward, however. Another method would be to simply monitor traffic sent back to other peers.

In short: If you're searching using a traditional search-engine then you're telling that search engine, its operators, those they share that information with, and all computer networks in between, what subjects you are interested in. With YaCy, you're telling a wider range of peers, and a wider range of networks, but in a less easily readable way. You are also perhaps not sending information to large organisations which are clear targets for information requests about what you search for.

A partial solution to the P2P side of privacy issues is to use publicly available YaCy peers[6] such as YaCySearch to do your YaCy-searching. But this brings up a whole other problem: You're back to square one: the public server's privacy issues, yet with the P2P privacy issues behind the scenes. The peer-owners can easily look at the interface and see what IP is searching for what keyword, so you're basically back to the "All your keywords belong to us" privacy issues traditional search-engine systems have. If you have three friend who's running YaCy and you randomly use their peers to search and they randomly use yours then it does become slightly harder for a adversary to see who's searching for what keywords.

Note that YaCy also logs local searches with the keywords in plain text at the YaCy peer where the search is done.

The best solution to YaCy's privacy issues the same solution that applies to all other web-browsing: Use the Tor Internet privacy system all your traffic, including YaCy searches, and all other searches for that matter, and don't search using your own peer but encourage others to do so.

The censorship-resistant properties YaCy has are great. The privacy properties it has are not so great. You can and should use software like Tor in combination with YaCy to get good censorship-resistant properties and good privacy properties.

Note from YaCy developers: it is not true that a remote search asks every other peer for their results of a search request. The index is distributed into a DHT (distributed hash table) and during a search only some other peers are contacted, not every. The developers are aware of the fact that an evil person can occupy a sensible position inside the DHT and monitor the search requests from outside. We developed a peer-hopping procedure against such supervision methods which will occur sometime this year and will make supervision of remote search imposible. Furthermore, remote searches do never use clear words for request, but only word hashes.

[edit] Corporate Website Search Engine?

YaCy can be setup to be a back-end for a corporate website, but there are other far better alternatives for adding a simple search-function to a website. YaCy isn't designed to be a search-back-end for a single site, and requires to much resources to make it worth having on a server just to add a search-function. Take a look at the DataparkSearch Engine is that's what you need.

[edit] References

  1. Heise.de (german): Jimmy Wales' Suchmaschine zum Mitmachen nimmt Gestalt an
  2. NetCraft: January 2007 Web Server Survey
  3. Alexa Top 100 Websites
  4. Winner-Take-All: Google and the Third Age of Computing
  5. Google censors itself for China
  6. YaCy Demopeers
Personal tools
hardware tests
Categories
Privacy policy
linux events
ipv6
Networking

You need to login to edit or create pages on this wiki.

IPv6

Search:

linux newz | random page | poetry | free blog