Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

open-source-search-engine

Join GitHub today

GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.

New issue

Improve search results? #59

Open

ROCKNROLLKID opened this Issue Jun 22, 2018 · 2 comments

Comments

3 participants

ROCKNROLLKID commented Jun 22, 2018

Hello.

I have been using Findx as a secondary search engines for a few months. For a long time, I noticed that certain searches I need to be exact on in order to get proper results. For example, if I type Mesa3D.org, I will then see a link for Mesa3D.org homepage, but if I type just Mesa3D, I do not see the Mesa3D homepage. The same thing happens when I try to search GitHub.

Are there any plans to improve search results like this?

br-privacore self-assigned this Jun 22, 2018

Member

br-privacore commented Jun 22, 2018

Hi,

thank you for your interest in Findx!

Yes, we have discussed how to improve queries like this, and your request actually triggered us to start development of this feature already.

We will give domain matches a ranking boost, but only for single-term queries. I know Google used to boost domains matching concatenated queries, and it opened up a can of worms due to "spamming" - creative marketers registered long domains like bestflatscreentv.com (not actual example) to match the query "best flat screen tv". We think the risk is limited when we only enable it for single term queries.

I'll update this issue once the feature is in place and our index has been updated.

This comment has been minimized.

Show comment

martinvahi Jul 29, 2018

I do not remember, whether I have written about it
at some the Privacore/FindX issue reports before, but
my current (2018_07) belief is that just like humans understand
text differently, the search engines can also analyze and
classify texts differently.

The core of the Matter

...is that

meaning depends on the context

(The Frame Problem)
One way to understand "the context" is that text T_1
is an element at some container C_1, which can in return
reside at some other container C_2, which can in return
reside at another container C_3, which can ....
and the containers C_1, C_2, C_3, ... are the context of
the text T_1. For example, if

T_1="I love You"

then the C_1 that changes the meaning of the T_1 is
that the T_1 is said not in real life, but by an actor
to another actor at a movie. The C_2
that contains the C_1 might be that the characters that
the actors play at that movie are Americans, not Estonians,
which renders the "I love You" to a Estonian meaning of

Estonian_meaning(T_1, C_1)="I like You a lot, but 
    that does not say anything about 
    whether I want to have sex with You."

So, C_1 changes the meaning of the T_1 and the C_2
changes the meaning again. It is not known, how
many contexts there are, but the number of contexts
that a human is able to understand is limited and
the more intelligent the human, the more contexts
it can notice/understand. I do not know the meaning or story
behind the Matryoshka doll, but maybe
it might have something to do with the fact that
context changes the meaning.

Real life Example

That also explains, why the Google Translate will never
work as well as a human translator does, unless Google
Translate sends a robot, an android, to live with the humans
to GATHER INFORMATION that describes the context or
obtains the information by some other means, may be by
some derivation, analysis of huge data sets. Computational
power wise a few Google datacenters will match the human
brain pretty well, it's just the lack of information and
speed of information exchange. The datacenters will have
the problem that as the speed of light is limited, signal
transfers have a physical distance related delays.
It takes elecrical_signals/light time to travel
from one place to another. The human brain
has much smaller physical dimensions than the
huge Google datacenters have.

-----sidenote---start----
The same problem occurs with physically large
microchips: it takes time for a signal to travel from
one corner of a die to another, although, as
I explained at a post titled "Multi-core CPU Production Economics",
due to production economics the future multi-core CPUs with
thousands of CPU-cores will likely be multi-die chips.
-----sidenote---end------

Few wild, Possibly Flawed, Ideas, how to Approach this

Definitions

For the sake of simplicity let's define the Internet
to be a FINITE set of of plain text documents,
a Universal Set, and search engine as a function
that select a subset of those documents.

a_subset_of_the_universal_set_of_text_documents_Query_1=search_engine(Internet,Query_1)

In practice the idea that the Internet is a finite set of documents actually
holds for the end user of a classical, non-P2P-search-engine,
because those search engines, including the Google and the FindX and the Bing,
have a finite index. Even if there are multiple search engines
at play, the idea that the number of documents is finite for
the end user still holds, because the Universal Set of documents may
consist of all of the documents that the search engines
have crawled and all of the links that are described
at all of the crawled documents and documents that the
end user has obtained by other means. Assumption is that
the data size of a single document is always finite.

A search engine can be seen as a Query parametrized
FILTER that is applied to ALL DOCUMENTS in the Universal Set of documents.
(May be a more intuitive example is that in the case of
corporate information systems the
automatically generated reports are essentially
predefined filters combined with some calculations that
are applied to the data that the filter has selected.
The search engine parameters, "queries", can be
so complex that they have to be saved for later re-use.)

An Ideal Solution

As the meaning depends on the context and context is
search engine end user specific, an ideal search result,
an ideally selected subset of the Universal Set of text documents,
is user specific. A halve-user-specific solution might be that
when a doctor uses a medical term for a search query,
it gets its own subfield specific documents as the search result,
but if a pharmacists uses the same medical term, the
search results might consist of drugs that are used at
the treatment of the issues related to the medical term, but
if a non-medical-professional searches that term, the
search results might consist of general discussions
about that medical term, links to home pages of fine doctors,
links to warnings about shoddy doctors, etc.
As a single person can have many roles,
one query parameter might be the role of the person.

The Mad Idea: Distributed P2P Search Engine That Indexes only Local Documents

I won't repeat myself here, but I have described part of
the idea at my Silktorrent Fossil repository (archival copy).
The distributed P2P Search Engine idea is not new.
For me the inspiration has been the YaCy, but
due to the fact that the search results should be search engine
end user specific, there needs to be end user specific indexes,
which means that the Universal Set of text documents needs to be
fully indexed for (almost) every end user for at least once and then
re-indexed in an end user specific manner after the end user
has intellectually changed.

For example, a teenage me probably
needs different search results than the adult me. A student
studying topic TOPIC_01 probably needs beginner tutorials
and early, introductory, scientific papers about the TOPIC_01, but
a student, who has completed the course of the
TOPIC_01, probably needs a different set of scientific papers about
the TOPIC_01, may be later advancements, derivatives, of the
TOPIC_01.

To my current knowledge, the only way to do that scalably
is to use some sort of scalable P2P system. May be it does not need to be
a kind of P2P system, where the end users run the nodes.
After all, the 2018 Internet is a P2P system, where
the Internet Service Providers as commercial entities
have teamed up to form the physical P2P network
that is "the internet"(IPv4/IPv6). May be in stead of
the YaCy like solution, where every person individually
runs a P2P node, each household or a bigger house
will host a local server room that is paid for as part of the
house utilities. Sever rooms of different multi-flat houses
may form a P2P network and may be there are only a few
gateways of that P2P network per very neighborhood.

---sidenote--start--
In Estonia every smaller village has just a few, may be just one,
optical cables anyway. Those cables carry the whole traffic
of the village, which means that physically the P2P-network
forming hardware tends to get pooled to local regions
central points of failure anyway. Another example of such
a local regions central point of failure is the household's
internet connection: if it goes down, all family members
are without a proper, non-mobile, internet connection.

In 2018 Estonia the Toompea Supermafia,
so to speak the Government, State, has an optical network
that covers literally every village and the ISPs, including
the companies that are considered big companies in the
context of the Estonian ~1.4 million inhabitants, either
run their own optical network or rent
those optical cables from the Supermafia, so that broadband internet
is affordable all around 2018 Estonia, even at some
deep forest, if that's where one wants to live. By affordable
I mean about 30€/month and with about max(2000€)
initiation/connecting fee in 2018 EURs, regardless of location.
The overall loot/tax rate in 2018 for common Estonians is about 60%,
so the Toompea Supermafia, the Government, can afford to
"compensate" the optical cabling cost that is above the
few thousand EUR initiation fee, but for most rural
Estonians the initiation fee is about few hundred EUR maximum,
probably even less, depending on the circumstances.
---sidenote--end----

So, the part 1 of the mad idea was that
the indexed data should be stored locally for reindexing
according to the context of every person and that there
is a scalable and affordable way to do that
by using a P2P system that can be maintained
by professionals, so that non-IT-people
do not have to bother with the maintenance of the
P2P search engine. The part 2 of that mad idea
is that the system candidates for storing and sharing
the local document collections are:

ZeroNet
(As of 2018 it's main author of the ZeroNet is
one of the ThePirateBay founders, the Peter Sunde)
InterPlanetary File System (IPFS)
Beaker (introduction) web browser that uses its own P2P network in stead of the "plain Internet".

My 2018 favorite is the ZeroNet, but as of 2018_07 it is
NOT READY for prime time. According to my very subjective
2018_07 opinion the IPFS is a failure in terms of implementation
technology choice, because the Go compiler is hard to get working,
hard to compile. (One version of the Go compiler needs
an earlier version of the Go compiler to compile, which in turn needs
an even earlier version of the Go to be available, etc. If it can't be
bootstrapped easily, then it can't be ported easily and that
is enough to disqualify Go for me.) Interestingly as of 2018_07 the IPFS
developers have worked on reimplementing the IPFS in JavaScript,
but the NodeJS is certainly NOT anything lightweight to run
as a background process. The Beaker/Dat people have
avoided the Go blunder and the Beaker/Dat
is a serious contender, but at some point I suspected that
they might lack in modularity by depending heavily on
web browser integration. I hope that I'm mistaken about the Dat/Beaker.

As the IPFS people have the social issue that they have to
show something for the money of their supporters, the
Dat/Beaker project and the ZeroNet project are socially much better
positioned to scrap failed attempts, break backwards compatibility
and create a clean implementation from scratch. That's why
I think that the Dat/Beaker and the ZeroNet have a much better
chance of creating a quality oriented solution that will really scale,
once they get their work completed to a state, where the projects
can technically withstand mainstream adoption. My 2018_07 favorite
is the ZeroNet and as of 2018_07 I think that the ZeroNet is the
ideal candidate for maintaining the local document collection that
a local P2P search engine node should index.

What regards to the projects like the Privacore and the FindX, then
as of 2018_07 I believe that one business model for the FindX/Privacore
might be

the sales of 
Neighborhood P2P Search Engine
servers

or the hosting of small, dedicated, private, P2P Search Engine nodes. As of 2018_07
the Gigablast author seems to have moved from server sales to
donation based financing/"business" model.
(Quotes because I can't call the gathering of donations a business,
even if it brings in a lot of money, unless it is a public service, in which case
the money transfers should NOT be called donantions, but
voluntary-keep-running-payments. I guess the Gigablast.com
2018_07 money gathering qualifies as voluntary-keep-running-payments.)

Future services to Search Engine Runners

Just like car industry does not produce paint, rubber, metal,
the paint production is outsourced to chemical industry,
the Search Engine Industry may divide its task to subtasks
and outsource some of them. For example, one sub-task
might be the collecting of links. There might be a
Linux Foundation like or Apache Foundation like or
Eclipse Foundation like non-profit that is jointly financed
by many search engine providers and that non-profit does
only one thing: creates a gigantic collection of links, without
indexing anything. The link collection does not include
any duplicate entries. There is absolutely NO LABELING,
it is only a raw collection of links. Different search engines
might use the same collection of raw links and index
the documents by using different contexts. One indexes
the documents according to the context of doctors. Another
according to the context of logistics specialists, etc. In stead
of a single "Bing" or "Google" people would pick a specialized
search engine, much like the following search engines are:

The specialized search engines might finance themselves by collecting
voluntary-payments-to-keep-running(hereafter: VP2KR)
and they may limit their services according to the IP-address ranges
that were indicated at the VP2KR money transfers. Each VP2KR
money transfer may include a region indicator, like "Estonia" or "town Foo"
at its comment field and then all of the resources of the search engine
nonprofit are distributed according to the distribution that forms
from the VP2KR money transfers. Those money transfers that do not
indicate a specific region, go to the "global serving" pool of the search engine.

As long as humanity uses deception, there will always
be censorship. In various ways(archival copy).
There will always stay a niche market for indexing
those documents that are somehow banned, be it due to
private interests or
supermafia("State" in new-speak/news-peak terms) demands. As long as some members of
the humanity are superficial or plain stupid or lazy
at educating oneself with self-sought-out materials,
deception will be successful and the work of
Public Relations_(read: lying for money)_ specialists
and mainstream journalists_(for them almost anything goes, as long as it pays well)_
will not run out. As long as supermafia("State") exists,
nonprofits that run public search engines will always have
supermafia induced censorship limits, but even if
they serve only those materials that pass the censorship, they lessen
the load of private P2P search engine nodes. Basically,
in the future people are expected run their personal
search engine aggregators that use the censored, public,
search engines and the private engine instances and
their neighborhood server room P2P search engine instances.

May be one future "job" is to be a sub-contractor to
future "general IT-support". Just like in 2018 there are
freelancing plumbers and car repairs specialists, in the future
there might be freelancing P2P-search-engine node
suppliers, who install a dedicated device on the
demands of the "general IT-support".

The jobs of applied statisticians, so called "data scientists",
is essentially describing and running mathematically advanced queries.
The 2018 "data analytics" companies are essentially specialized
search engine companies that run their own, internal,
search engines on "small", temporary, "sets of documents"(client data)
"one-query-series-at-a-time". As of 2018_07 I guess that the world of
applied statisticians and the world of classical search engines
can be combined in far more elegant way than the
various year 2018 web page analytics and e-shop Artificial Intelligence
applications are. I guess that the WolframAlpha
is certainly a step to that direction, but there are laso other similar efforts.

etc.etc.etc.

Hardware Trends

The trend of reducing power consumption and
parallelizing as much as possible fits well together
with the Andrew Zonenberg's AntiKernel idea (an unofficial collection of materials resides at my repository).
As of 2018_07 I suspect that may be one
future development might be that in stead of
letting the general purpose operating system file system drivers
to handle file system related low level details like
user permissions, error correction codes, journaling, post-power-failure cleanup, etc.,
the future "hard disks"(HDDs) might have a POSIX file system
protocol in stead of byte and address based protocol.
Just like the modern flash memory cards
handle wear leveling(archival copy) transparently, the future "HDDs"
may handle all of the file system level errors transparently.

A step forward might be that in stead of a file system,
some standardized, SIMPLISTIC, database engine
is used in stead. That is to say, the communication protocol
of the future "HDD" might be a communication protocol of
a simplistic, standardized, database engine. That idea is partly described at my site(archival copy).
By "simplistic" I mean something simpler than SQLite,
something, where queries always have timing guarantees. The
timing guarantees can vary by device, but the worst case timing
parameters might be read out of the device just like the IDs of
year 2018 HDDs and CPUs can be read out of the device.
The implementation of the database engine might use
FPGAs that are programmed by using "High Level Synthesis".
If the code is done to safety critical system quality, then software
flaws are rare enough to be irrelevant.

The use of such high level "HDDs" affects any search engine
implementation that uses data that is stored on such "HDDs".
That is to say, when planning for search engine software architecture,
then that's probably one thing to watch out for, specially in terms
of how to design search algorithms modular enough to
allow the leveraging of the computational power of such HDDs.
Redesigning search algorithms, indexing algorithms, might be
a lot of work, TOO MUCH WORK, if the work pile has accumulated
over 10 years or so.

The Conclusion

Not even Microsoft and Google can afford to scale
their search engines to the point, where all known documents
are reindexed for every end-user according to end-user specific context.
Probably only P2P systems can scale to that level.
The P2P systems do not have to be run by
every end-user individually, there can be small scale clustering
of computation resources in the forms of neighborhood servers,
company servers, household servers.

That is to say, it is hopeless for Privacore/FindX to offer
proper query results to all end-users. Even Microsoft and Google
can not TECHNICALLY do it even if they wanted to do it. But
it MIGHT be possible to get some sub-task completed
in some very future-proof manner, so that the current
investment will not be useless at offering future services.

Thank You for reading this "blog post" :-)

martinvahi commented Jul 29, 2018

The core of the Matter

...is that

meaning depends on the context

T_1="I love You"

Estonian_meaning(T_1, C_1)="I like You a lot, but 
    that does not say anything about 
    whether I want to have sex with You."

Real life Example

Few wild, Possibly Flawed, Ideas, how to Approach this

Definitions

For the sake of simplicity let's define the Internet
to be a FINITE set of of plain text documents,
a Universal Set, and search engine as a function
that select a subset of those documents.

a_subset_of_the_universal_set_of_text_documents_Query_1=search_engine(Internet,Query_1)

An Ideal Solution

The Mad Idea: Distributed P2P Search Engine That Indexes only Local Documents

ZeroNet
(As of 2018 it's main author of the ZeroNet is
one of the ThePirateBay founders, the Peter Sunde)
InterPlanetary File System (IPFS)
Beaker (introduction) web browser that uses its own P2P network in stead of the "plain Internet".

What regards to the projects like the Privacore and the FindX, then
as of 2018_07 I believe that one business model for the FindX/Privacore
might be

the sales of 
Neighborhood P2P Search Engine
servers

Future services to Search Engine Runners

etc.etc.etc.

Hardware Trends

The Conclusion

Thank You for reading this "blog post" :-)

Terms
Privacy
Security
Status
Help

Press h to open a hovercard with more details.

Please note that GitHub no longer supports your web browser.

privacore/open-source-search-engine forked from gigablast/open-source-search-engine

Join GitHub today

Improve search results? #59

Comments

ROCKNROLLKID commented Jun 22, 2018

br-privacore self-assigned this Jun 22, 2018

This comment has been minimized.

br-privacore Jun 22, 2018

br-privacore commented Jun 22, 2018

This comment has been minimized.

martinvahi Jul 29, 2018

The core of the Matter

Real life Example

Few wild, Possibly Flawed, Ideas, how to Approach this

Definitions

An Ideal Solution

The Mad Idea: Distributed P2P Search Engine That Indexes only Local Documents

Future services to Search Engine Runners

Hardware Trends

The Conclusion

martinvahi commented Jul 29, 2018

The core of the Matter

Real life Example

Few wild, Possibly Flawed, Ideas, how to Approach this

Definitions

An Ideal Solution

The Mad Idea: Distributed P2P Search Engine That Indexes only Local Documents

Future services to Search Engine Runners

Hardware Trends

The Conclusion